## 第二次小作业内容  

**1. 作业内容：**  

* 使用多个自变量(两个或两个以上，自行选择)对恋爱情况(因变量)建立logistic回归模型  

**2. 具体要求：**  
    重复课上的workflow，包括:  
		
    1. 模型定义(要求附上基本的文字解释，格式参考附后）  
    
    2. MCMC采样，模型诊断图，后验参数解释  
        * 使用az.plot_trace绘制模型诊断图  
        * 对后验参数结果在模型中的意义进行文字解释  
        * 绘制后验回归模型(使用az.plot_hdi)  

    3. 使用定义好的模型，对新站点的结果进行预测并评估  
        * 按照50-50的分类标准，计算预测结果对应的准确性、敏感性、特异性，并附上简单的文字解释  
    
    4. 对本数据集的预测结果进行评估  
        * 按照50-50的分类标准，计算预测结果对应的准确性、敏感性、特异性，并附上简单的文字解释  
    
    5. 与课上的模型(log_model1:自变量为回避倾向; log_model2:自变量为性别)进行模型比较  

**3. 作业截止时间：12.19**  

**4. 作业提交于和鲸平台**  


* 注1：所有的文字解释仅需增加在notebook中，无需提交额外的文档  

* 注2：上述要求请使用pymc语法实现  

* 注3：在数据预处理时注意缺失值  

* 注4：每个小组内成员所使用的站点数据是相同的，但该作业为个人独立提交，小组内可以互相讨论  

| 组别  | 建立模型使用的站点 | 对新数据进行预测时使用的站点  |  
|---------|------------------------------|-----------------------|  
| 1 | METU   | UCSB  |  
| 2 | Oxford | Poland |  
| 3 | Serbia | VCU |  
| 4 | VCU |Oxford |  
| 5 | UCSB |Serbia |  


**模型定义文字描述格式参考：**  

1. 自变量：xx  

2. 因变量：xx  

3. 数据关系：  

$$  
\begin{array}{lcrl}  
\text{data:} & \hspace{.01in} & Y_i|\beta_0,\beta_1 & {\sim} \text{Bern}(\pi_i) \;\; \text{ with } \;\; \pi_i = \frac{e^{\beta_0 + \beta_1 X_{i1}}}{1 + e^{\beta_0 + \beta_1 X_{i1}}} \\  
\text{priors:} & & \beta_{0}  &  \sim N()  \\  
               & & \beta_1  & \sim N(). \\  
\end{array}  
$$  


In [1]:
# 导入 pymc 模型包，和 arviz 等分析工具
import pymc as pm
import arviz as az
import seaborn as sns
import scipy.stats as st
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
import pandas as pd
import ipywidgets

# 忽略不必要的警告
import warnings
warnings.filterwarnings("ignore")


In [2]:
# 通过 pd.read_csv 加载数据 Data_Sum_HPP_Multi_Site_Share.csv
df_raw = pd.read_csv('/home/mw/input/bayes20238001/Data_Sum_HPP_Multi_Site_Share.csv')

# 选取VCU站点的数据
df = df_raw[df_raw["Site"] == "VCU"]

In [3]:
# 总览数据
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 10)
df

Unnamed: 0,age,anxiety,anxiety_r,artgluctot,attachhome,attachphone,AvgHumidity,avgtemp,avoidance,avoidance_r,cigs,didf,eatdrink,eot,exercise,gluctot,health,hiTemp,kamf,langfamily,language,mintemp,monogamous,networksize,nostalgia,onlineid,riskAvd,romantic,scontrol,sex,Site,smoke,socialdiversity,socialembedded,socTherm,soliTherm,stress
1332,1966.0,6.000000,2.132340,0.0,4.555556,1.777778,92.0,36.361111,3.222222,0.099036,,3.363636,1.0,3.2,2.0,0.0,4.0,1.428571,6.000000,1,5,10.000000,7.0,22,36,1.000000,2.000000,2.0,41,2.0,VCU,2.0,9,2,4.4,4.625,34
1333,1969.0,3.611111,0.295121,0.0,4.777778,1.444444,71.0,37.055556,2.222222,-0.982006,10.0,2.454545,1.0,2.2,2.0,4.0,3.0,3.571429,2.571429,1,5,14.444444,7.0,17,42,2.363636,3.000000,1.0,41,2.0,VCU,1.0,8,2,3.8,3.000,43
1334,1971.0,4.611111,1.064189,0.0,4.888889,3.444444,34.0,36.333333,3.166667,0.038978,,2.181818,1.0,1.8,2.0,76.0,3.0,3.142857,4.428571,1,5,8.888889,7.0,29,38,2.818182,2.666667,2.0,49,2.0,VCU,2.0,8,4,3.6,3.250,35
1335,1975.0,2.277778,-0.730304,0.0,4.888889,3.222222,91.0,36.250000,2.111111,-1.102122,,1.272727,1.0,1.4,2.0,14.0,2.0,2.142857,3.428571,1,5,14.444444,7.0,26,31,2.909091,3.000000,1.0,34,1.0,VCU,2.0,8,3,4.8,4.500,44
1336,1978.0,2.611111,-0.473948,0.0,4.444444,3.333333,91.0,36.416667,3.666667,0.579499,,3.090909,1.0,2.6,2.0,6.0,2.0,3.142857,3.714286,1,5,14.444444,7.0,53,22,2.818182,3.333333,2.0,37,1.0,VCU,2.0,7,6,2.0,3.500,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1478,1997.0,3.444444,0.166943,0.0,3.666667,3.111111,99.0,36.000000,4.333333,1.300193,,2.818182,1.0,2.6,2.0,0.0,4.0,2.714286,3.857143,1,5,11.111111,4.0,9,30,2.909091,2.000000,2.0,41,1.0,VCU,2.0,5,1,2.8,3.875,43
1479,1997.0,4.666667,1.106915,0.0,3.666667,3.000000,60.0,36.722222,3.555556,0.459383,,2.727273,1.0,2.2,2.0,12.0,4.0,3.714286,4.285714,1,5,11.111111,6.0,22,33,2.818182,3.333333,2.0,42,2.0,VCU,2.0,5,3,3.2,2.750,46
1480,,2.222222,-0.773030,0.0,2.111111,4.111111,91.0,35.900000,4.722222,1.720599,,3.454545,,3.2,,0.0,3.0,2.285714,5.571429,1,5,14.444444,,18,6,3.909091,2.666667,,45,,VCU,,7,2,2.2,3.625,38
1481,,0.000000,-2.482070,,4.666667,2.888889,92.0,35.944444,2.777778,-0.381427,,2.363636,,2.0,,,4.0,3.142857,3.142857,1,5,10.000000,,0,30,2.181818,3.333333,,42,,VCU,,0,1,3.2,3.750,31


In [4]:
# 选取变量
df = df[["romantic", "attachhome", "attachphone"]]

# 剔除缺失值
df = df.dropna()

#重新编码，编码后的数据：1 = "yes"; 0 = "no"
df["romantic"] =  np.where(df['romantic'] == 2, 0, 1)

#设置索引
df["index"] = range(len(df))
df = df.set_index("index")
df

Unnamed: 0_level_0,romantic,attachhome,attachphone
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,4.555556,1.777778
1,1,4.777778,1.444444
2,0,4.888889,3.444444
3,1,4.888889,3.222222
4,0,4.444444,3.333333
...,...,...,...
143,0,5.000000,2.888889
144,0,5.000000,3.888889
145,0,5.000000,3.777778
146,0,3.666667,3.111111


In [5]:
# 绘制"romantic"的条形图
sns.countplot(data=df, x='romantic')
plt.xlabel('romantic')
plt.ylabel('Count')
plt.title('Distribution of romantic')
plt.show()

# 绘制"health"和"romantic"的散点图
sns.scatterplot(data=df, x='attachhome', y='romantic', alpha=0.6)
plt.xlabel('attachhome')
plt.ylabel('romantic')
plt.yticks([0, 1], ['no', 'yes'])
plt.show()

# 绘制"attachhome"和"romantic"的散点图
sns.scatterplot(data=df, x='attachphone', y='romantic', alpha=0.6)
plt.xlabel('attachphone')
plt.ylabel('romantic')
plt.yticks([0, 1], ['no', 'yes'])
plt.show()


# 模型定义  
1. 自变量：  attachhome，attachphone  

2. 因变量：romantic  

3. 数据关系：  
$$ \begin{array}{lcrl}\text{data:} & \hspace{.01in} & Y_i|\beta_0,\beta_1,\beta_2 & \stackrel{ind}{\sim}\text{Bern}(\pi_i)\;\;\text{ with }\;\;\pi_i=\frac{e^{\beta_0+\beta_1X_{i1}+\beta_2X_{i2}}}{1+e^{\beta_0+\beta_1X_{i1}+\beta_2X_{i2}}} \\ \text{priors:} &  & \beta_0 & \sim N\left(0,0.5^2\right) \\  &  & \beta_1 & \sim N\left(0,0.5^2\right) \\& & \beta_2 & \sim N\left(0,0.5^2\right) \\{}\end{array}  $$

In [6]:
coords = {"obs_id": df.index}
with pm.Model() as log_model3:
    # 此处对coords的定义方式进行了更改，因为后续我们需要进行对新数据的预测
    # 因此将维度定义成可更改的
    log_model3.add_coord('obs_id',df.index, mutable=True)
    attachhome = pm.MutableData("attachhome", df.attachhome, dims="obs_id")
    attachphone = pm.MutableData("attachphone", df.attachphone, dims="obs_id")
    y = pm.MutableData('y', df.romantic, dims = 'obs_id')

    #先验
    beta_0 = pm.Normal("beta_0", mu=0, sigma=0.5)          #定义beta_0          
    beta_1 = pm.Normal("beta_1", mu=0, sigma=0.5)           #定义beta_1
    beta_2 = pm.Normal("beta_2", mu=0, sigma=0.5)           #定义beta_2
    #线性关系
    mu = pm.Deterministic("mu", beta_0 + beta_1 * attachhome + beta_2 * attachphone , dims="obs_id")
    #注意此处使用了Logistic sigmoid function：pm.math.invlogit
    #相当于进行了如下计算 (1 / (1 + exp(-mu))
    pi = pm.Deterministic("pi", pm.math.invlogit(mu), dims="obs_id")
    #似然
    likelihood = pm.Bernoulli("y_est",p=pi, observed=y,dims="obs_id")

In [7]:
# 可视化模型（贝叶斯变量因果图）
# 我们使用 pymc 自带的 `model_to_graphviz` 方法来可视化模型中各变量的因果关系。
pm.model_to_graphviz(log_model3)

# mcmc采样

In [38]:
with log_model3:
    # MCMC 近似后验分布
    log_model3_trace = pm.sample(
                                draws=5000,                   # 使用mcmc方法进行采样，draws为采样次数
                                tune=1000,                    # tune为调整采样策略的次数，可以决定这些结果是否要被保留
                                chains=4,                     # 链数
                                discard_tuned_samples= True,  # tune的结果将在采样结束后被丢弃
                                random_seed=84735)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_0, beta_1, beta_2]


Sampling 4 chains for 1_000 tune and 8_000 draw iterations (4_000 + 32_000 draws total) took 64 seconds.


In [39]:
az.plot_trace(log_model3_trace,
              var_names=["beta_0","beta_1","beta_2"],
              figsize=(15,8),
              compact=False)
plt.show()

# 后验参数解释  

以下的结果显示：  
- $\beta_0 = 0.229$，那么 $e^{\beta_0} = 1.26$， 表明 X1,X2 为 0时，个体恋爱的可能性为 1.26 （attachhome和attachphone最小为1，）  
- $\beta_1 = -0.089$， $e^{\beta_0} = 1.09$， 表明家庭重视每增加1个单位，个体恋爱的发生比变为之前的1.09倍。  
- $\beta_2 = 0.098$， $e^{\beta_0} = 1.10$， 表明手机重视每增加1个单位，个体恋爱的发生比变为之前的1.10倍。  
- 然而，$\beta_1$ 和$\beta_2$ 的94%HDI包括0，说明都不能有效预测恋爱发生的概率。 

In [10]:
az.summary(log_model3_trace, var_names=["beta_0","beta_1","beta_2"])

Unnamed: 0,mean,sd,hdi_3%,hdi_97%,mcse_mean,mcse_sd,ess_bulk,ess_tail,r_hat
beta_0,0.229,0.453,-0.61,1.089,0.005,0.004,9406.0,8888.0,1.0
beta_1,-0.089,0.145,-0.352,0.194,0.002,0.001,8665.0,8538.0,1.0
beta_2,0.098,0.173,-0.222,0.425,0.002,0.001,9392.0,8366.0,1.0


In [11]:
# 通过 np.exp 将 beta 参数进行转换
az.plot_posterior(log_model3_trace, var_names=["beta_0","beta_1","beta_2"], transform = np.exp)
plt.show()

# **绘制后验预测回归线**

In [40]:
log_model3_trace

In [52]:
#画出每个自变量对应的恋爱概率94%hdi值
az.plot_hdi(
    df.attachhome,
    log_model3_trace.posterior.pi,
    hdi_prob=0.95,
    fill_kwargs={"alpha": 0.25, "linewidth": 0},
    color="C1"
)
#得到每个自变量对应的恋爱概率均值，并使用sns.lineplot连成一条光滑的曲线
post_mean = log_model3_trace.posterior.pi.mean(("chain", "draw"))
sns.lineplot(x = df.attachhome, 
             y= post_mean, 
             label="posterior mean", 
             color="C1")
#绘制真实数据散点图
sns.scatterplot(x = df.attachhome, 
                y= df.romantic,label="observed data", 
                color='#C00000', 
                alpha=0.5)
#设置图例位置
plt.legend(loc="upper right",
           bbox_to_anchor=(1.5, 1),
           fontsize=12)
sns.despine()

In [14]:
#画出每个自变量对应的恋爱概率94%hdi值
az.plot_hdi(
    df.attachphone,
    log_model3_trace.posterior.pi,
    hdi_prob=0.95,
    fill_kwargs={"alpha": 0.25, "linewidth": 0},
    color="C1"
)
#得到每个自变量对应的恋爱概率均值，并使用sns.lineplot连成一条光滑的曲线
post_mean = log_model3_trace.posterior.pi.mean(("chain", "draw"))
sns.lineplot(x = df.attachphone, 
             y= post_mean, 
             label="posterior mean", 
             color="C1")
#绘制真实数据散点图
sns.scatterplot(x = df.attachphone, 
                y= df.romantic,label="observed data", 
                color='#C00000', 
                alpha=0.5)
#设置图例位置
plt.legend(loc="upper right",
           bbox_to_anchor=(1.5, 1),
           fontsize=12)
sns.despine()

In [47]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.cm import ScalarMappable

fig = plt.figure(figsize=(30, 30))
ax = fig.add_subplot(111, projection='3d')

# 获取数值数组
values = log_model3_trace.posterior["pi"].stack(sample=("chain", "draw"))

for i in range(100):
    ax.scatter(xs=log_model3_trace.constant_data["attachhome"],
               ys=log_model3_trace.constant_data["attachphone"],
               zs=values[:, i], 
               c=values[:, i],
               cmap="jet",
               alpha=0.4)

# 设置x、y轴标题和总标题    
ax.set_xlabel('attachhome')
ax.set_ylabel('attachphone')
ax.set_zlabel('romantic')

# 创建颜色条带
cax = fig.add_axes([0.95, 0.1, 0.03, 0.8])  # 调整颜色条带位置和大小
sm = ScalarMappable(cmap="jet")
sm.set_array(values)
fig.colorbar(sm, cax=cax)
cax.set_ylabel('pi value')

plt.show()

# 对新站点的结果进行预测并评估  
按照50-50的分类标准，计算预测结果对应的准确性、敏感性、特异性，并附上简单的文字解释  

In [16]:
# 选取Oxford站点的数据
dfox = df_raw[df_raw["Site"] == "Oxford"]
dfox = dfox[["romantic", "attachhome", "attachphone"]]

# 剔除缺失值
dfox = dfox.dropna()

#重新编码，编码后的数据：1 = "yes"; 0 = "no"
dfox["romantic"] =  np.where(dfox['romantic'] == 2, 0, 1)

#设置索引
dfox["index"] = range(len(dfox))
dfox = dfox.set_index("index")
dfox

Unnamed: 0_level_0,romantic,attachhome,attachphone
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,1,2.222222,1.000000
1,1,3.333333,1.777778
2,1,4.111111,2.555556
3,1,3.777778,2.555556
4,0,4.000000,3.000000
...,...,...,...
132,0,2.777778,3.666667
133,1,3.555556,2.666667
134,1,3.888889,2.666667
135,0,4.000000,3.000000


In [17]:
pred_coords ={"obs_id":range(0,137)} 

with log_model3:
    # 传入数据
    pm.set_data({"attachhome":dfox["attachhome"],
                "attachphone":dfox["attachphone"],
                "y": dfox["romantic"]},
                coords=pred_coords
                )   
    # 生成对因变量的预测
    prediction1 = pm.sample_posterior_predictive(log_model3_trace, 
                                                var_names=["y_est"],
                                                predictions=True,
                                                extend_inferencedata=True,
                                                random_seed=84735)

Sampling: [y_est]


In [18]:
prediction1

In [19]:
# 提取储存在 predicitons中的预测值
y_pred = prediction1.predictions["y_est"].stack(sample=("chain","draw","obs_id")).values

In [20]:
#stack(sample = ("chain", "draw")：将每一个X对应的4*5000个后验预测值合并到一个维度sample
#对于每一个X，需要计算其20000个值的平均值，因此将dim设置为sample
pred_pi1 = prediction1.predictions.y_est.stack(sample = ("chain", "draw")).mean(dim="sample")
# 转换为数据框
pred_pi1 = pred_pi1.to_dataframe()

In [21]:
pred_pi1

Unnamed: 0_level_0,y_est
obs_id,Unnamed: 1_level_1
0,0.53395
1,0.53285
2,0.52850
3,0.53650
4,0.54300
...,...
132,0.58590
133,0.53895
134,0.53350
135,0.54355


In [22]:
#将原数据中的X 和Y存入数据框
pred_pi1["attachhome"] = dfox.attachhome.values
pred_pi1["attachphone"] = dfox.attachphone.values
pred_pi1["romantic"] = dfox.romantic.values

#根据分类标准（50-50）生成最终的分类结果
pred_pi1["romantic_pred"] = np.where(pred_pi1["y_est"] >= 0.5, 1, 0)
pred_pi1

In [23]:
confusion_matrix1 = pd.crosstab(pred_pi1["romantic"], pred_pi1["romantic_pred"], 
                              rownames=['Actual'], colnames=['Predicted'])
confusion_matrix1

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,8,61
1,3,65


In [24]:
# 计算a b c d的数量
true_positive = confusion_matrix1.at[1,1 ]
false_positive = confusion_matrix1.at[0, 1]
true_negative = confusion_matrix1.at[0, 0]
false_negative = confusion_matrix1.at[1, 0]
# 代入公式
accuracy = (true_positive + false_negative) /(true_positive + false_positive + true_negative + false_negative)
sensitivity = (true_positive) /(true_positive + false_negative)
specificity = (true_negative) / (true_negative + false_positive)

print("准确性:", accuracy)
print("敏感性:", sensitivity)
print("特异性:", specificity)

准确性: 0.49635036496350365
敏感性: 0.9558823529411765
特异性: 0.11594202898550725


# 对本数据集结果预测评估

In [25]:
coords = {"obs_id": df.index}

with pm.Model() as log_model3:
    attachhome = pm.MutableData("attachhome", df.attachhome, dims="obs_id")
    attachphone = pm.MutableData("attachphone", df.attachphone, dims="obs_id")
    y = pm.MutableData('y', df.romantic, dims = 'obs_id')

    #先验
    beta_0 = pm.Normal("beta_0", mu=0, sigma=0.5)          #定义beta_0          
    beta_1 = pm.Normal("beta_1", mu=0, sigma=0.5)           #定义beta_1
    beta_2 = pm.Normal("beta_2", mu=0, sigma=0.5)           #定义beta_2
    #线性关系
    mu = pm.Deterministic("mu", beta_0 + beta_1 * attachhome + beta_2 * attachphone , dims="obs_id")
    #注意此处使用了Logistic sigmoid function：pm.math.invlogit
    #相当于进行了如下计算 (1 / (1 + exp(-mu))
    pi = pm.Deterministic("pi", pm.math.invlogit(mu), dims="obs_id")
    #似然
    likelihood = pm.Bernoulli("y_est",p=pi, observed=y,dims="obs_id")

    log_model3_trace = pm.sample(draws=5000,                 
                                tune=1000,                  
                                chains=4,                     
                                discard_tuned_samples= True, 
                                random_seed=84735)
    log_model3_ppc = pm.sample_posterior_predictive(log_model3_trace)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_0, beta_1, beta_2]


Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 43 seconds.
Sampling: [y_est]


In [26]:
az.plot_ppc(log_model3_ppc, num_pp_samples=50)

<Axes: xlabel='y_est / y_est'>

In [27]:
log_model3_ppc.posterior_predictive.y_est.stack(sample = ("chain", "draw"))

In [28]:
#stack(sample = ("chain", "draw")：将每一个X对应的4*5000个后验预测值合并到一个维度sample
#对于每一个X，需要计算其20000个值的平均值，因此将dim设置为sample
pred_pi2 = log_model3_ppc.posterior_predictive.y_est.stack(sample = ("chain", "draw")).mean(dim="sample")
# 转换为数据框
pred_pi2 = pred_pi2.to_dataframe()

In [29]:
#将原数据中的X 和Y存入数据框
pred_pi2["attachhome"] = log_model3_ppc.constant_data.attachhome.values
pred_pi2["attachphone"] = log_model3_ppc.constant_data.attachphone.values
pred_pi2["romantic"] = log_model3_ppc.observed_data.y_est.values

#根据分类标准（50-50）生成最终的分类结果
pred_pi2["romantic_pred"] = np.where(pred_pi2["y_est"] >= 0.5, 1, 0)
pred_pi2

Unnamed: 0_level_0,y_est,attachhome,attachphone,romantic,romantic_pred
obs_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0.49955,4.555556,1.777778,0,0
1,0.48285,4.777778,1.444444,1,0
2,0.52815,4.888889,3.444444,0,1
3,0.53175,4.888889,3.222222,1,1
4,0.53885,4.444444,3.333333,0,1
...,...,...,...,...,...
143,0.52160,5.000000,2.888889,0,1
144,0.53865,5.000000,3.888889,0,1
145,0.54085,5.000000,3.777778,0,1
146,0.55425,3.666667,3.111111,0,1


In [30]:
# 使用`pd.crosstab`生成混淆矩阵，前两个值表明你需要统计的列名
# 由于要生成一个2*2的联表，需要指定行的名称和列的名称
confusion_matrix2 = pd.crosstab(pred_pi2["romantic"], pred_pi2["romantic_pred"], 
                              rownames=['Actual'], colnames=['Predicted'])
confusion_matrix2

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3,64
1,2,79


In [31]:
# 计算a b c d的数量
true_positive = confusion_matrix2.at[1,1 ]
false_positive = confusion_matrix2.at[0, 1]
true_negative = confusion_matrix2.at[0, 0]
false_negative = confusion_matrix2.at[1, 0]
# 代入公式
accuracy = (true_positive + false_negative) /(true_positive + false_positive + true_negative + false_negative)
sensitivity = (true_positive) /(true_positive + false_negative)
specificity = (true_negative) / (true_negative + false_positive)

print("准确性:", accuracy)
print("敏感性:", sensitivity)
print("特异性:", specificity)

准确性: 0.5472972972972973
敏感性: 0.9753086419753086
特异性: 0.04477611940298507


# 与课上的模型进行模型比较  

In [32]:
# 选取VCU站点的数据
df = df_raw[df_raw["Site"] == "VCU"]

# 选取变量
df = df[["romantic", "attachhome", "attachphone","avoidance_r","sex"]]
#0 表示男性，1表示女性
df["sex"]=np.where(df["sex"] == 1, 0, 1)

# 剔除缺失值
df = df.dropna()

#重新编码，编码后的数据：1 = "yes"; 0 = "no"
df["romantic"] =  np.where(df['romantic'] == 2, 0, 1)

#设置索引
df["index"] = range(len(df))
df = df.set_index("index")


In [34]:
with pm.Model() as model1:
    model1.add_coord('obs_id',df.index, mutable=True)
    avoidance = pm.MutableData("avoidance", df.avoidance_r, dims="obs_id")
    y = pm.MutableData('y', df.romantic, dims = 'obs_id')
    #先验
    beta_0 = pm.Normal("beta_0", mu=0, sigma=0.5)          #定义beta_0          
    beta_1 = pm.Normal("beta_1", mu=0, sigma=0.5)           #定义beta_1
    #线性关系
    mu = pm.Deterministic("mu", beta_0 + beta_1 * avoidance, dims="obs_id")
    #注意此处使用了Logistic sigmoid function：pm.math.invlogit
    #相当于进行了如下计算 (1 / (1 + exp(-mu))
    pi = pm.Deterministic("pi", pm.math.invlogit(mu), dims="obs_id")
    #似然
    likelihood = pm.Bernoulli("y_est",p=pi, observed=y,dims="obs_id")

with pm.Model() as model2:
    model2.add_coord('obs_id',df.index, mutable=True)
    sex = pm.MutableData("sex", df.sex, dims="obs_id")
    y = pm.MutableData('y', df.romantic, dims = 'obs_id')
    #先验
    beta_0 = pm.Normal("beta_0", mu=0, sigma=0.5)          #定义beta_0          
    beta_1 = pm.Normal("beta_1", mu=0, sigma=0.5)           #定义beta_1
    #线性关系
    mu = pm.Deterministic("mu", beta_0 + beta_1 * sex, dims="obs_id")
    #注意此处使用了Logistic sigmoid function：pm.math.invlogit
    #相当于进行了如下计算 (1 / (1 + exp(-mu))
    pi = pm.Deterministic("pi", pm.math.invlogit(mu), dims="obs_id")
    #似然
    likelihood = pm.Bernoulli("y_est",p=pi, observed=y,dims="obs_id")

with pm.Model() as model3:
    model3.add_coord('obs_id',df.index, mutable=True)
    attachhome = pm.MutableData("attachhome", df.attachhome, dims="obs_id")
    attachphone = pm.MutableData("attachphone", df.attachphone, dims="obs_id")
    y = pm.MutableData('y', df.romantic, dims = 'obs_id')
    #先验
    beta_0 = pm.Normal("beta_0", mu=0, sigma=0.5)          #定义beta_0          
    beta_1 = pm.Normal("beta_1", mu=0, sigma=0.5)           #定义beta_1
    beta_2 = pm.Normal("beta_2", mu=0, sigma=0.5)           #定义beta_2
    #线性关系
    mu = pm.Deterministic("mu", beta_0 + beta_1 * attachhome + beta_2 * attachphone , dims="obs_id")
    #注意此处使用了Logistic sigmoid function：pm.math.invlogit
    #相当于进行了如下计算 (1 / (1 + exp(-mu))
    pi = pm.Deterministic("pi", pm.math.invlogit(mu), dims="obs_id")
    #似然
    likelihood = pm.Bernoulli("y_est",p=pi, observed=y,dims="obs_id")

In [35]:
with model1:
    model1_trace = pm.sample(draws=5000,                   # 使用mcmc方法进行采样，draws为采样次数
                        tune=1000,                   # tune为调整采样策略的次数，可以决定这些结果是否要被保留
                        chains=4,                    # 链数
                        discard_tuned_samples= True, # tune的结果将在采样结束后被丢弃
                        idata_kwargs={"log_likelihood": True},
                        random_seed=84735) 
with model2:
    model2_trace = pm.sample(draws=5000,                   # 使用mcmc方法进行采样，draws为采样次数
                        tune=1000,                   # tune为调整采样策略的次数，可以决定这些结果是否要被保留
                        chains=4,                    # 链数
                        discard_tuned_samples= True, # tune的结果将在采样结束后被丢弃
                        idata_kwargs={"log_likelihood": True},
                        random_seed=84735)  
with model3:
    model3_trace = pm.sample(draws=5000,            # 使用mcmc方法进行采样，draws为采样次数
                      tune=1000,                    # tune为调整采样策略的次数，可以决定这些结果是否要被保留
                      chains=4,                     # 链数
                      discard_tuned_samples= True,  # tune的结果将在采样结束后被丢弃
                      idata_kwargs={"log_likelihood": True},
                      random_seed=84735)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_0, beta_1]


Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 18 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_0, beta_1]


Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 22 seconds.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta_0, beta_1, beta_2]


Sampling 4 chains for 1_000 tune and 5_000 draw iterations (4_000 + 20_000 draws total) took 43 seconds.


In [36]:
az.loo(model1_trace)
az.loo(model2_trace)
az.loo(model3_trace)

comparison_list = {
    "model1":model1_trace,
    "model2":model2_trace,
    "model3":model3_trace,
}
az.compare(comparison_list)

Unnamed: 0,rank,elpd_loo,p_loo,elpd_diff,weight,se,dse,warning,scale
model2,0,-103.07251,1.409928,0.0,1.0,1.253286,0.0,False,log
model1,1,-103.606955,1.853957,0.534445,0.0,1.211354,0.807676,False,log
model3,2,-103.673285,2.061836,0.600775,1.110223e-16,1.20042,0.875106,False,log


In [None]:
模型2的预测效果最好，性别可能可以更好的预测一个人的恋爱状态。