高级心理统计期末作业    
# <center> Overperception of moral outrage in online social networks inflates beliefs about intergroup hostility <center>  

### <center> 第五组：陈可遇、李沐紫、张心怡 <center>

![Image Name](https://cdn.kesci.com/upload/image/rkz1ehen1l.png?imageView2/0/w/960/h/960)


## Twitter field studies examining overperception of outrage  
本小组关注的研究问题是：  
1、观察者是否倾向于**过度感知 overperceive**作者的**道德愤怒moral outrage**  
2、**过度感知**是否与他们每天**使用社交媒体了解政治的数量 political social media use**有关。  

* 研究假设：人们在社交媒体上会过度感知作者传达的愤怒。使用媒体频率越高，观众越倾向于过度感知愤怒  

> Brady, W. J., McLoughlin, K. L., Torres, M. P., Luo, K. F., Gendron, M., & Crockett, M. J. (2023). Overperception of moral outrage in online social networks inflates beliefs about intergroup hostility. *Nature Human Behaviour, 7*(6), Article 6. https://doi.org/10.1038/s41562-023-01582-0

In [2]:
# 导入 pymc 模型包，和 arviz 等分析工具 
import pymc as pm
import arviz as az
import seaborn as sns
import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
import pandas as pd
import ipywidgets
import bambi as bmb

# 忽略不必要的警告
import warnings
warnings.filterwarnings("ignore")

## 1 Study1:观众是否倾向于过度感知作者的愤怒

### 1.1 数据导入

In [10]:
# 导入Study1观众的数据
df1_raw = pd.read_csv('/home/mw/input/data6907/study1_data_raw.csv')

# 将party值转为字符串形式：1→民主党 2→共和党
df1_raw['party'] = np.where(df1_raw['party'] == 1, "Democrat", 
                               np.where(df1_raw['party'] == 2, "Republican", 'Other'))


# 数据清理
# 清除party='Other'的个案
df1_raw = df1_raw[df1_raw['party'] != 'Other']

# 保留理解测试答案为1的个案，即通过理解测试的个案
df1_raw = df1_raw[df1_raw['comp_check'] == 1]


# 算出每个tweet观众感知愤怒的平均值
df1_raw1 = df1_raw.filter(regex='_or').mean()
df1_raw1 = df1_raw1[0:len(df1_raw1):2]

df1_raw1 = df1_raw1.reset_index()

df1_raw1 = df1_raw1.rename(columns={0:'outrage'})


# 填写变量group值为2
for i in range (len(df1_raw1)):
    df1_raw1["group"]=2 # group:1→观众

In [11]:
# 导入作者报告的数据
df1_raw2 = pd.read_csv('/home/mw/input/data6907/study1_self_report.csv')
df1_raw2 = df1_raw2['sr_outrage']
df1_raw2 = df1_raw2.reset_index()
df1_raw2 = df1_raw2.rename(columns={'sr_outrage':'outrage'})

# 填写变量group值为1
for i in range (len(df1_raw2)):
    df1_raw2['group'] = 1 # group:0→作者

# 组合作者和观众的数据
df1 = pd.concat([df1_raw1, df1_raw2])
del df1['index']

In [178]:
df1

Unnamed: 0,outrage,group
0,3.333333,2
1,4.809524,2
2,5.619048,2
3,5.571429,2
4,4.809524,2
...,...,...
128,1.000000,1
129,6.000000,1
130,5.000000,1
131,1.000000,1


In [12]:
# 同样步骤处理Study2的数据
df1_2_raw = pd.read_csv('/home/mw/input/data6907/study2_data_raw.csv')

# 将party值转为字符串形式：1→民主党 2→共和党
df1_2_raw['party'] = np.where(df1_2_raw['party'] == 1, "Democrat", 
                               np.where(df1_2_raw['party'] == 2, "Republican", 'Other'))

# 数据清理
# 清除party='Other'的个案
df1_2_raw = df1_2_raw[df1_2_raw['party'] != 'Other']

# 保留理解测试答案为1的个案，即通过理解测试的个案
df1_2_raw = df1_2_raw[df1_2_raw['comp_check'] == 1]

# 算出每个tweet观众感知愤怒的平均值
df1_2_raw1 = df1_2_raw.filter(regex='_or').mean()
df1_2_raw1 = df1_2_raw1[0:len(df1_2_raw1):2]

df1_2_raw1 = df1_2_raw1.reset_index()

df1_2_raw1 = df1_2_raw1.rename(columns={0:'outrage'})

# 数据处理
for i in range (len(df1_2_raw1)):
    df1_2_raw1["group"]=2 # group:2→观众

df1_2_raw2 = pd.read_csv('/home/mw/input/data6907/study2_self_report.csv')
df1_2_raw2 = df1_2_raw2['sr_outrage']
df1_2_raw2 = df1_2_raw2.reset_index()
df1_2_raw2 = df1_2_raw2.rename(columns={'sr_outrage':'outrage'})

# 
for i in range (len(df1_2_raw2)):
    df1_2_raw2['group'] = 1 # group:0→作者

df1_2 = pd.concat([df1_2_raw1, df1_2_raw2])
del df1_2['index']

In [182]:
df1_2

Unnamed: 0,outrage,group
0,2.76,1
1,5.56,1
2,5.24,1
3,5.84,1
4,5.24,1
...,...,...
192,1.00,2
193,6.00,2
194,7.00,2
195,7.00,2


In [13]:
sns.displot(df1, x="outrage", hue="group")

<seaborn.axisgrid.FacetGrid at 0x7f350e82ce20>

In [6]:
#描述性统计作者和观察者愤怒感知的均值
fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.boxplot(data=df1,
            x="group", 
            y="outrage",
            ax=axes[0])
axes[0].set_title('study1')

sns.boxplot(data=df1_2, 
            x="group", 
            y="outrage",
            ax=axes[1])
axes[1].set_title('study2')

# 调整子图之间的间距
plt.tight_layout()

plt.show()
#作者组的数据没有最小值标注可能是由于最小值和第一个四分位数之间的距离小于1.5倍的四分位距

In [7]:
#根据来源对道德愤怒大小进行分组，并计算每个组内的均值与方差
df1.groupby(['group'])['outrage'].agg(['mean','var'])
#(注：NaN是由于分组中缺乏数据)

Unnamed: 0_level_0,mean,var
group,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.954887,5.467646
2,4.535734,0.919433



### 1.2 模型设定  

**Model1：Normal linear regression**  
观察者是否倾向于过度感知作者的道德愤怒  

1、**自变量：** $X_{i} \text{→ group : authors / observers}$  
2、**因变量：** $Y_{i} \text{→ 愤怒感知 outrage}$  
3、**数据关系：**  

$$  
\begin{equation}  
\begin{array}{lcrl}  
\text{data:} & \hspace{.05in} &   Y_{i} | \beta_{0}, \beta_{1}, \sigma & \stackrel{ind}{\sim} N\left(\mu_{i}, \sigma_y^2\right) \;\; \text{ with } \;\;  
\mu_{i} = e^{\beta_0 + \beta_1 X_{i}} \\  

\text{priors:} & & \beta_{0}  & \sim N(1.38, 0.5^2) \\  
                    & & \beta_1  & \sim N(0, 5^2) \\  
										& & \sigma & \sim \text{Exp}(0.6) \\   
\end{array}  
\end{equation}  
$$  


**Model1：Poisson linear regression**  
观察者是否倾向于过度感知作者的道德愤怒  

1、**自变量：** $X_{i} \text{→ group : authors / observers}$  
2、**因变量：** $Y_{i} \text{→ 愤怒感知 outrage}$  
3、**数据关系：**  

$$  
\begin{equation}  
\begin{array}{lcrl}  
\text{data:} & \hspace{.025in} & Y_i|\beta_0,\beta_1 & \stackrel{ind}{\sim} Pois\left(\lambda_i \right) \;\; \text{ with }  \;\;  
\lambda_i = e^{\beta_0 + \beta_1 X_{i}} \\  
\text{priors:} & & \beta_{0} & \sim N\left(1.38, 0.5^2 \right) \\  
&& \beta_1 & \sim Gamma(0.001, 0.5^2) \\  
 \end{array}  
\end{equation}  
$$

#### 1.2.1 选择先验

In [8]:
with pm.Model(coords = {"obs_id": df1.index}) as linear_model1:
    linear_model1.add_coord('obs_id',df1.index, mutable=True)
    x = pm.MutableData("x",df1.group)                     #x是自变量group
    y = pm.MutableData('y', df1.outrage, dims = 'obs_id')

    beta_0 = pm.Normal("beta_0", mu=1.38, sigma=0.5)          #定义beta_0          
    beta_1 = pm.Normal("beta_1", mu=0, sigma=0.5)         #定义beta_1
    sigma = pm.Exponential("sigma", 0.6)                  #定义sigma

    mu = pm.Deterministic("mu", pm.math.exp(beta_0 + beta_1*x), dims="obs_id") #定义mu，自变量与先验结合

    likelihood = pm.Normal("y_est", mu=mu, sigma=sigma, observed=y)   #定义似然：预测值y符合N(mu, sigma)分布
                                                                                #通过 observed 传入实际数据y 道德愤怒水平

In [9]:
with pm.Model(coords = {"obs_id": df1.index}) as linear_model2:

    beta_0 = pm.Normal("beta_0", mu=1.38, sigma=0.5)           #定义beta_0          
    beta_1 = pm.Gamma("beta_1", mu=0.001, sigma=0.1)          #定义beta_1

    x = pm.MutableData("x",df1.group)                     #x是自变量group

    # 预测 lambda，自变量与先验结合
    lam = pm.Deterministic(
        "lam", 
        ##------------------------------------------------
        #  注意，这里我们使用 pm.math.exp 对预测进行逆对数转换
        #------------------------------------------------
        pm.math.exp(beta_0 + beta_1*x), dims="obs_id")


    likelihood = pm.Poisson("y_est", mu=lam, observed=df1.outrage)   #定义似然：预测值y符合Poisson分布
                                                                                #通过 observed 传入实际数据y 道德愤怒水平

#### 1.2.2 先验预测检验

In [10]:
# 对正态回归模型进行先验预测
normal_prior =pm.sample_prior_predictive(samples=50, 
                                          model=linear_model1,
                                          var_names=["mu"],
                                          random_seed=84735)
# 对泊松回归模型进行先验预测
poisson_prior =pm.sample_prior_predictive(samples=50, 
                                          model=linear_model2,
                                          var_names=["lam"],
                                          random_seed=84735)

Sampling: [beta_0, beta_1]
Sampling: [beta_0, beta_1]


In [11]:
fig, ax = plt.subplots(1,2,figsize=(15, 5))

#绘制正态回归模型中，预测变量为group情况下的先验预测线性模型
ax[0].plot(normal_prior.constant_data["x"], 
           normal_prior.prior["mu"].stack(sample=("chain", "draw")), c="k", alpha=0.4)

#设置坐标轴标题
ax[0].set_title('Normal regression', fontsize=20) 
ax[0].set_xlabel('Group', fontsize=16) 
ax[0].set_ylabel('Outrage', fontsize=16) 

#绘制绘制泊松回归模型中，预测变量为group情况下的先验预测线性模型
ax[1].plot(poisson_prior.constant_data["x"], 
           poisson_prior.prior["lam"].stack(sample=("chain", "draw")), c="k", alpha=0.4)

#设置坐标轴标题、y轴刻度显示
ax[1].set_title('Possion regression', fontsize=20) 
ax[1].set_xlabel('Group', fontsize=16) 
ax[1].set_ylabel('Outrage', fontsize=16) 
ax[1].ticklabel_format(axis='y', style='plain')

sns.despine()
plt.show()

In [155]:
pm.model_to_graphviz(linear_model1)

In [156]:
pm.model_to_graphviz(linear_model2)

#### 1.2.3 拟合数据

In [12]:
#===========================
#     注意！！！以下代码可能需要运行1-2分钟左右
#===========================
with linear_model1:
    linear_model1_trace = pm.sample(draws=5000,                   # 使用mcmc方法进行采样，draws为采样次数
                      tune=1000,                    # tune为调整采样策略的次数，可以决定这些结果是否要被保留
                      chains=4,                     # 链数
                      discard_tuned_samples=True,  # tune的结果将在采样结束后被丢弃
                      random_seed=84735)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_0, beta_1, sigma]


Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 32 seconds.


In [13]:
#===========================
#     注意！！！以下代码可能需要运行1-2分钟左右
#===========================
with linear_model2:
    linear_model2_trace = pm.sample(draws=5000,                   # 使用mcmc方法进行采样，draws为采样次数
                      tune=1000,                    # tune为调整采样策略的次数，可以决定这些结果是否要被保留
                      chains=4,                     # 链数
                      discard_tuned_samples=True,  # tune的结果将在采样结束后被丢弃
                      random_seed=84735)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_0, beta_1]


Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 32 seconds.
There were 13684 divergences after tuning. Increase `target_accept` or reparameterize.


### 1.3 评估  

#### 1.3.1 MCMC评估

In [14]:
# 诊断Normal后验分布估计情况
az.plot_trace(linear_model1_trace, var_names=["beta_0","beta_1"],
              figsize=(15,10),
              compact=False)
plt.show()

In [15]:
# 诊断Poisson后验分布估计情况
az.plot_trace(linear_model2_trace, var_names=["beta_0","beta_1"],
              figsize=(15,10),
              compact=False)
plt.show()

#### 1.3.2 后验预测回归模型

In [16]:
# Normal后验预测
with linear_model1:
    linear_model1_ppc = pm.sample_posterior_predictive(linear_model1_trace, random_seed=84735)

# Poisson后验预测
with linear_model2:
    linear_model2_ppc = pm.sample_posterior_predictive(linear_model2_trace, random_seed=84735)

Sampling: [y_est]


Sampling: [y_est]


In [17]:
az.plot_ppc(linear_model1_ppc, num_pp_samples=500)
az.plot_ppc(linear_model2_ppc, num_pp_samples=500)

plt.show()

In [18]:
from statistics import median
def MAE(model_ppc):
    # 计算每个X取值下对应的后验预测模型的均值
    pre_x = model_ppc.posterior_predictive["y_est"].stack(sample=("chain", "draw"))
    pre_y_mean = pre_x.mean(axis=1).values

    # 提取观测值Y，提取对应Y值下的后验预测模型的均值
    MAE = pd.DataFrame({
        "outrage_ppc_mean": pre_y_mean,
        "outrage_original": df1.outrage
    })

    # 计算预测误差
    MAE["pre_error"] = abs(MAE["outrage_original"] -\
                            MAE["outrage_ppc_mean"])

    # 最后，计算预测误差的中位数
    MAE = median(MAE.pre_error)
    return MAE

In [19]:
# 定义hdi
def counter_outlier(model_ppc, hdi_prob=0.95):
    # 将az.summary生成的结果存到hdi_multi这个变量中，该变量为数据框
    hdi = az.summary(model_ppc, kind="stats", hdi_prob=hdi_prob)
    lower = hdi.iloc[:,2].values
    upper = hdi.iloc[:,3].values

    # 将原数据中的道德愤怒分数合并，便于后续进行判断
    y_obs = model_ppc.observed_data["y_est"].values

    # 判断原数据中的group是否在后验预测的95%可信区间内，并计数
    hdi["verify"] = (y_obs <= lower) | (y_obs >= upper)
    hdi["y_obs"] = y_obs
    hdi_num = sum(hdi["verify"])

    return hdi_num

In [20]:
# 定义绘图函数
def plot_ppi(linear_model_ppc):
    fig, ax =  plt.subplots(figsize=(8,6))

    # 将az.summary生成的结果存到hdi_multi这个变量中，该变量为数据框
    hdi_multi = az.summary(linear_model_ppc, hdi_prob=0.95)
    hdi_multi = hdi_multi.reset_index()
    # 将原数据中的压力分数与自我控制分数合并，便于后续进行判断
    hdi_multi["x_obs"] = linear_model_ppc.constant_data["x"].values
    hdi_multi["y_obs"] = linear_model_ppc.observed_data["y_est"].values

    # 绘制95%的可信区间
    HDI = ax.vlines(hdi_multi["x_obs"], 
            hdi_multi["hdi_2.5%"], hdi_multi["hdi_97.5%"], 
            color="orange", 
            alpha=0.5,
            label="94% HDI")

    #绘制真实值的散点图，并根据是否落在区间内设置观测值的不同颜色
    colors = np.where((hdi_multi["y_obs"] >= hdi_multi["hdi_2.5%"]) & (hdi_multi["y_obs"] <= hdi_multi["hdi_97.5%"]), 
                    '#2F5597', '#C00000')
    ax.scatter(hdi_multi["x_obs"], hdi_multi["y_obs"],
            c = colors,
            zorder = 2)

    # 设置图例的颜色、形状、名称
    legend_color = ['#2F5597', '#C00000']
    handles = [plt.Line2D([0], [0], 
                        marker='o', 
                        color='w', 
                        markerfacecolor=color, markersize=10) for color in legend_color]
    handles += [HDI]
    labels = ['Within HDI', 'Outside HDI','95% HDI']

    # 设置坐标轴名称、标题
    ax.set_xlabel('group', fontsize=14) 
    ax.set_ylabel('outrage', fontsize=14)
    fig.legend(handles=handles, labels=labels, loc='outside upper right')
    fig.suptitle('Posterior Predictive Interval', fontsize=16)
    sns.despine()

# 绘制
plot_ppi(linear_model1_ppc)
plot_ppi(linear_model2_ppc)

In [21]:
# 输出结果
linear_model1_MAE = MAE(linear_model1_ppc)
linear_model2_MAE = MAE(linear_model2_ppc)

# 输出结果
linear_model1_outliers = counter_outlier(linear_model1_ppc)
linear_model2_outliers = counter_outlier(linear_model2_ppc)

print(f"Normal回归模型1 MAE: {linear_model1_MAE:.2f}")
print(f"Normal回归模型1 超出95%hdi: {linear_model1_outliers:.2f}")

print(f"泊松回归模型2 MAE: {linear_model2_MAE:.2f}")
print(f"泊松回归模型2 超出95%hdi: {linear_model2_outliers:.2f}")

Normal回归模型1 MAE: 1.09
Normal回归模型1 超出95%hdi: 0.00
泊松回归模型2 MAE: 1.25
泊松回归模型2 超出95%hdi: 38.00


#### 1.3.3 模型比较

In [22]:
with linear_model1:
   pm.compute_log_likelihood(linear_model1_trace)
with linear_model2:
   pm.compute_log_likelihood(linear_model2_trace)

In [23]:
comparison_list = {
    "model1(normal)":linear_model1_trace,
    "model2(poisson)":linear_model2_trace,
}
az.compare(comparison_list)

Unnamed: 0,rank,elpd_loo,p_loo,elpd_diff,weight,se,dse,warning,scale
model1(normal),0,-533.605303,2.50469,0.0,1.0,8.991188,0.0,False,log
model2(poisson),1,-540.472214,0.827994,6.86691,5.950795e-14,6.943151,2.903548,False,log


#### 1.3.4 新数据预测

In [24]:
new_coords = {"obs_id": df1_2.index}

with linear_model1:
    # 传入数据
    pm.set_data({"x": df1_2.group,
                 "y": df1_2.outrage},
                coords=new_coords
                )   
    
    # 生成对因变量的预测
    pred_trace = pm.sample_posterior_predictive(linear_model1_trace, 
                                                var_names=["beta_0","beta_1"],
                                                predictions=True,
                                                extend_inferencedata=True,
                                                random_seed=84735)

Sampling: [beta_0, beta_1]


In [25]:
pred_trace.log_likelihood["y_est"].stack(sample = ("chain", "draw"))

In [26]:
#stack(sample = ("chain", "draw")：将每一个X对应的4*5000个后验预测值合并到一个维度sample
#对于每一个X，需要计算其20000个值的平均值，因此将dim设置为sample
pred_pi = pred_trace.log_likelihood["y_est"].stack(sample = ("chain", "draw")).mean(dim="sample")
# 转换为数据框
pred_pi = pred_pi.to_dataframe()
pred_pi["group"] = pred_trace.constant_data.x.values
pred_pi["outrage"] = pred_trace.observed_data.y_est.values
pred_pi

Unnamed: 0_level_0,y_est,group,outrage
y_est_dim_0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,-1.727035,2,3.333333
1,-1.517279,2,4.809524
2,-1.692262,2,5.619048
3,-1.676282,2,5.571429
4,-1.517279,2,4.809524
...,...,...,...
261,-2.876427,1,1.000000
262,-2.157538,1,6.000000
263,-1.674382,1,5.000000
264,-2.876427,1,1.000000


In [27]:
figs, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
sns.regplot(x = pred_pi["group"], 
                y = pred_pi["outrage"],
                label="observed data", 
                color='#C00000')


sns.regplot(x = pred_pi["group"], 
                y = pred_pi["y_est"],
                label="observed data", 
                color='blue')

sns.regplot(x="group", y="outrage", data=df1,ax=ax1)

<Axes: xlabel='group', ylabel='outrage'>

### 1.4 统计推断  
以上的结果显示：  
- $\beta_0 = 1.242$  
- $\beta_1 = 0.134 ≠ 0$，且 $\beta_1$ 的 94%HDI 不包括 0 ，说明观众和作者报告的道德愤怒有差异 。

In [28]:
az.summary(linear_model1_trace, var_names=["beta_0","beta_1"])

Unnamed: 0,mean,sd,hdi_3%,hdi_97%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta_0,1.242,0.084,1.087,1.4,0.001,0.001,7190.0,8107.0,1.0
beta_1,0.134,0.051,0.037,0.229,0.001,0.0,7180.0,8214.0,1.0


## 2 Study2：观察者过度感知愤怒的倾向的影响因素  
是否与他们每天**使用社交媒体了解政治的数量 political social media use**有关。

### 2.1 数据导入

In [4]:
data_source1 = pd.read_csv('/home/mw/input/data6907/study1_overperception.csv')
data_source2 = pd.read_csv('/home/mw/input/data6907/study2_overperception.csv')
data_source3 = pd.read_csv('/home/mw/input/data6907/study3_overperception.csv')
data_source1['study'] = 1
data_source2['study'] = 2
data_source3['study'] = 3

# 合并study1和study2
merged_data = pd.concat([data_source1, data_source2], ignore_index=True)

# 去掉age的列
df2_merged = merged_data[["sm_use_politics_slider","overperception","study"]]
df2_2_merged = data_source3[["sm_use_politics_slider","overperception","study"]]

# 丢弃含空值的行
df2 = df2_merged.dropna()
df2_2 = df2_2_merged.dropna()

In [5]:
# 生成站点索引
df2["study_idx"] = pd.factorize(df2.study)[0]
# 生成被试数索引
df2["obs_id"] = range(len(df2))
# 将站点、被试id设置为索引
df2.set_index(['study','obs_id'],inplace=True,drop=False)

In [8]:
sns.displot(df2, x = 'overperception', hue="study")

<seaborn.axisgrid.FacetGrid at 0x7f35183e56a0>

In [25]:
# 绘制变量使用媒体频率与过度感知的散点图
sns.regplot(data=df2[df2["study"]==1], 
            x="sm_use_politics_slider", 
            y="overperception"
            )
sns.regplot(data=df2[df2["study"]==2], 
            x="sm_use_politics_slider", 
            y="overperception",
            marker="+"
            )

<Axes: xlabel='sm_use_politics_slider', ylabel='overperception'>

### 2.2 模型设定  

**Model2：Hierarchical model with varying intercepts & slopes**  
是否与他们每天使用社交媒体了解政治的数量 political social media use有关。  

![Image Name](https://cdn.kesci.com/upload/s6bks56jiq.png?imageView2/0/w/300/h/950)  

1、**自变量：** $X_{ij1} \text{→ 使用媒体数量 political social media use}$  
2、**因变量：** $Y_{ij} \text{→ 观众过度感知 overperceive}$  
3、**总结模型定义：**  

$$  
\begin{array}{rll}  
Y_{ij} | \beta_{0j}, \beta_{1j}, \sigma_y & \sim N(\mu_{ij}, \sigma_y^2) \;\; \text{ with } \;\;  \mu_{ij} = \beta_{0j} + \beta_{1j} X_{ij1} & \text{(每个站点内的线性模型)} \\  
\beta_{0j} | \beta_0, \sigma_0  & \stackrel{ind}{\sim} N(\beta_0, \sigma_0^2) & \text{(截距在站点间的变化)} \\  
\beta_{1j} | \beta_1, \sigma_1  & \stackrel{ind}{\sim} N(\beta_1, \sigma_1^2) & \text{(斜率在站点间的变化)} \\  
 
\beta_{0}  & \sim N(0, 50^2) & \text{(全局参数的先验)} \\  
\beta_1  & \sim N(0, 5^2) & \\  

\sigma_0 & \sim \text{Exp}(1)    & \\  
\sigma_1 & \sim \text{Exp}(1)    & \\  

\sigma_y & \sim \text{Exp}(1).    & \\  
\end{array}  
$$  


#### 2.2.1 选择先验

In [57]:
# 定义函数来构建和采样模型
def run_var_both_model(non_centered = False):

    #定义数据坐标，包括站点和观测索引
    coords = {"study": df2.study.unique(),
            "obs_id": df2.obs_id}

    with pm.Model(coords=coords) as model:
        #定义全局参数
        beta_0 = pm.Normal("beta_0", mu=0, sigma=50)
        beta_0_sigma = pm.Exponential("beta_0_sigma", 1)
        beta_1 = pm.Normal("beta_1", mu=0, sigma=5) 
        beta_1_sigma = pm.Exponential("beta_1_sigma", 1)
        
        sigma_y = pm.Exponential("sigma_y", 1) 

        #传入自变量、获得观测值对应的站点映射
        x = pm.MutableData("x1", df2.sm_use_politics_slider, dims="obs_id")
        y = pm.MutableData('y', df2.overperception, dims = 'obs_id')
        study = pm.MutableData("study", df2.study_idx, dims="obs_id") 

        #选择不同的模型定义方式
        if non_centered:
            beta_0_offset = pm.Normal("beta_0_offset", 0, sigma=1, dims="study")
            beta_0j = pm.Deterministic("beta_0j", beta_0 + beta_0_offset * beta_0_sigma, dims="study")
            beta_1_offset = pm.Normal("beta_1_offset", 0, sigma=1, dims="study")
            beta_1j = pm.Deterministic("beta_1j", beta_1 + beta_1_offset * beta_1_sigma, dims="study")
            
        else:
            beta_0j = pm.Normal("beta_0j", mu=beta_0, sigma=beta_0_sigma, dims="study")
            beta_1j = pm.Normal("beta_1j", mu=beta_1, sigma=beta_1_sigma, dims="study")

        #线性关系
        mu = pm.Deterministic("mu", beta_0j[study] + beta_1j[study]*x , dims="obs_id")

        # 定义 likelihood
        likelihood = pm.Normal("y_est", mu=mu, sigma=sigma_y, observed=y, dims="obs_id")

        trace = pm.sample(draws=5000,           # 使用mcmc方法进行采样，draws为采样次数
                            tune=1000,                    # tune为调整采样策略的次数，可以决定这些结果是否要被保留
                            chains=4,                     # 链数
                            discard_tuned_samples= True,  # tune的结果将在采样结束后被丢弃
                            random_seed=84735,
                            target_accept=0.99)
    
    return model, trace

#### 2.2.2 拟合数据

In [60]:
# 注意，以下代码可能运行10分钟左右

var_both_model, var_both_trace = run_var_both_model(non_centered = True)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_0, beta_0_sigma, beta_1, beta_1_sigma, sigma_y, beta_0_offset, beta_1_offset]


Sampling 4 chains for 1_000 tune and 1_516 draw iterations (4_000 + 6_064 draws total) took 1065 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details
There were 755 divergences after tuning. Increase `target_accept` or reparameterize.
Chain 0 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
Chain 1 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.
Chain 2 reached the maximum tree depth. Increase `max_treedepth`, increase `target_accept` or reparameterize.


In [63]:
pm.model_to_graphviz(var_both_model)

### 2.3 评估  

#### 2.3.1 MCMC评估

In [67]:
# 设置绘图坐标
figs, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 5))
# 绘制变化的截距
az.plot_forest(var_both_trace,
               var_names=["~mu", "~sigma", "~offset", "~beta_1"],
               filter_vars="like",
               combined=True,
               ax=ax1)

# 绘制变化的斜率
az.plot_forest(var_both_trace,
               var_names=["~mu", "~sigma", "~offset", "~beta_0"],
               filter_vars="like",
               combined=True,
               ax=ax2)
plt.show()

#### 2.3.2 后验预测回归线

In [109]:
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,sm_use_politics_slider,overperception,study,study_idx,obs_id
study,obs_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,0.0,1.413793,1,0,0
1,1,29.0,2.684211,1,0,1
1,2,5.0,-0.142857,1,0,2
1,3,5.0,1.736842,1,0,3
1,4,5.0,0.766667,1,0,4
...,...,...,...,...,...,...
2,219,73.0,1.800000,2,1,219
2,220,17.0,0.966667,2,1,220
2,221,11.0,-2.033333,2,1,221
2,222,2.0,-0.766667,2,1,222


In [143]:
#提取不同站点数据对应的索引并储存，便于后续将后验预测数据按照站点进行提取
def get_group_index(data):
    group_index = {}
    for i, group in enumerate(data["study"].unique()):
        group_index[group] = df2[df2["study"]==group]["obs_id"].values
    return group_index

study_index = get_group_index(df2)

In [144]:
study_index

{1: array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
        51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
        68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
        85, 86]),
 2: array([ 87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99,
        100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
        113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
        126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
        139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151,
        152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164,
        165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177,
        178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190,
        191,

In [148]:
#定义函数，绘制不同站点下的后验预测回归线
def plot_partial_regression(data, trace, group_index):
    # 定义画布，根据站点数量定义画布的列数
    fig, ax = plt.subplots(1,len(data["study"].unique()), 
                       sharex=True,
                       sharey=True,
                       figsize=(15,5))
    
    # 根据站点数来分别绘图
    # 我们需要的数据有原始数据，每一个因变量的后验预测均值
    # 这些数据都储存在后验参数采样结果中，也就是这里所用的trace
    for i, group in enumerate(data["study"].unique()):
        #绘制真实数据的散点图
        x = trace.constant_data.x1.sel(obs_id = group_index[group])
        y = trace.observed_data.y_est.sel(obs_id = group_index[group])
        mu = trace.posterior.mu.sel(obs_id = group_index[group])
        ax[i].scatter(x, y,
                color=f"C{i}",
                alpha=0.5)
        #绘制回归线
        ax[i].plot(x, mu.stack(sample=("chain","draw")).mean(dim="sample"),
                color=f"C{i}",
                alpha=0.5)
        #绘制预测值95%HDI
        az.plot_hdi(
            x, mu,
            hdi_prob=0.95,
            fill_kwargs={"alpha": 0.25, "linewidth": 0},
            color=f"C{i}",
            ax=ax[i])
    # 生成横坐标名称
    fig.text(0.5, 0, 'social media use', ha='center', va='center', fontsize=12)
    # 生成纵坐标名称
    fig.text(0.08, 0.5, 'overperception', ha='center', va='center', rotation='vertical', fontsize=12)
    # 生成标题
    plt.suptitle("Posterior regression models", fontsize=15)
        
    sns.despine()


In [149]:
plot_partial_regression(data=df2,
                trace=var_both_trace,
                group_index=study_index)

#### 2.3.3 组间方差与组内方差

In [72]:
# 提取组间和组内变异
var_both_model_sum = az.summary(var_both_trace,
                        var_names=["_offset", "sigma_"],
                        filter_vars="like")
between_sd = (var_both_model_sum.filter(like='_offset', axis=0)["mean"]**2).sum()
within_sd = var_both_model_sum.loc['sigma_y', 'mean']**2
# 计算变异占比
var = between_sd + within_sd
print("被组间方差所解释的部分：", between_sd/var)
print("被组内方差所解释的部分：", within_sd/var)
print("组内相关：", between_sd/var)

被组间方差所解释的部分： 0.044319250693031914
被组内方差所解释的部分： 0.955680749306968
组内相关： 0.044319250693031914


#### 2.3.4 评估后验预测

In [73]:
with var_both_model:
    var_both_ppc = pm.sample_posterior_predictive(var_both_trace, random_seed=84735)

Sampling: [y_est]


In [75]:
# 定义计算 MAE 函数
from statistics import median
def MAE(model_ppc):
    # 计算每个X取值下对应的后验预测模型的均值
    pre_x = model_ppc.posterior_predictive["y_est"].stack(sample=("chain", "draw"))
    pre_y_mean = pre_x.mean(axis=1).values

    # 提取观测值Y，提取对应Y值下的后验预测模型的均值
    MAE = pd.DataFrame({
        "scontrol_ppc_mean": pre_y_mean,
        "scontrol_original": model_ppc.observed_data.y_est.values
    })

    # 计算预测误差
    MAE["pre_error"] = abs(MAE["scontrol_original"] -\
                            MAE["scontrol_ppc_mean"])

    # 最后，计算预测误差的中位数
    MAE = median(MAE.pre_error)
    return MAE

var_both_MAE = MAE(var_both_ppc)
print(var_both_MAE)

0.5922109507421386


In [76]:
# 定义
def counter_outlier(model_ppc, hdi_prob=0.95):
    # 将az.summary生成的结果存到hdi_multi这个变量中，该变量为数据框
    hdi = az.summary(model_ppc, kind="stats", hdi_prob=hdi_prob)
    lower = hdi.iloc[:,2].values
    upper = hdi.iloc[:,3].values

    # 将原数据中的自我控制分数合并，便于后续进行判断
    y_obs = model_ppc.observed_data["y_est"].values

    # 判断原数据中的压力分数是否在后验预测的95%可信区间内，并计数
    hdi["verify"] = (y_obs <= lower) | (y_obs >= upper)
    hdi["y_obs"] = y_obs
    hdi_num = sum(hdi["verify"])

    return hdi_num

var_both_outlier = counter_outlier(var_both_ppc)
print(var_both_outlier)

13


#### 2.3.4 模型比较

In [77]:
pm.compute_log_likelihood(var_both_trace, model=var_both_model)

In [78]:
az.loo(var_both_trace)

Computed from 6064 posterior samples and 224 observations log-likelihood matrix.

         Estimate       SE
elpd_loo  -297.47    12.52
p_loo        5.83        -

------

Pareto k diagnostic values:
                         Count   Pct.
(-Inf, 0.5]   (good)      222   99.1%
 (0.5, 0.7]   (ok)          1    0.4%
   (0.7, 1]   (bad)         0    0.0%
   (1, Inf)   (very bad)    1    0.4%

### 2.4 统计推断  
以上的结果显示：  
- $\beta_0 = 0.535/0.529$，表明在 社交媒体使用数量 为 0 时，个体愤怒过度感知为 0.535/0.529。  
- $\beta_1 = 0.013/0.002 ≠ 0$，且 study1$\beta_1$ 的 94%HDI 不包括 0 ，study2$\beta_1$ 的 94%HDI 包括 0 ，说明study1中媒体使用数量对过度感知有影响，使用次数越多，过度感知越大。

In [174]:
az.summary(var_both_trace,
           var_names=["beta_0j","beta_1j"],
           filter_vars="like")

Unnamed: 0,mean,sd,hdi_3%,hdi_97%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta_0j[1],0.535,0.089,0.358,0.716,0.001,0.001,1689.0,1984.0,1.23
beta_0j[2],0.529,0.077,0.395,0.687,0.011,0.008,52.0,3586.0,1.19
beta_1j[1],0.013,0.003,0.007,0.019,0.0,0.0,3380.0,3319.0,1.01
beta_1j[2],0.002,0.003,-0.004,0.008,0.0,0.0,2268.0,2051.0,1.01


## 3 总结与讨论

#### 3.1 结果报告  

- 从描述性统计结果来看，作者的道德愤怒均值分布较散，观察者的道德愤怒均值高于作者，且分布较集中；  
- 在第一个模型中，结果证明了假设，即观众会过度感知作者表达的道德愤怒  
- 在第二个模型中，为了探索观察者过度感知愤怒的倾向的影响因素，我们检验了观察者每天**使用社交媒体了解政治的数量 political social media use**。结果显示：study1中媒体使用数量对过度感知有影响，使用次数越多，过度感知越大。