# 1141_社會資料分析 期末考
學生：劉晏成

本次考試模擬學術研究的真實情境，以量化分析方式，回答「新北市民的政治自由態度，是否會因為
性別、年齡、教育年數、個人收入而有差異？」這個研究問題。

使用變項：
* 性別（a1）
* 年齡（age）
* 年齡分組（agegroup）
* 教育年數（eduy）
* 收入（inc）
* 政治自由態度（P_liberal） （由 g4a – g4i 等 9 題所建構而成）

In [1]:
from load import load_sav
import pandas as pd

df = load_sav("../data/final.sav")

In [2]:
variable_value_labels = df.attrs['variable_value_labels']
column_names = df.attrs['column_names']
column_names_to_labels = df.attrs['column_names_to_labels']

def print_labels(key):
    variable_labels = variable_value_labels.get(key, None)
    column_labels = column_names_to_labels.get(key, None)

    print(f"variable_labels: {variable_labels}")
    print(f"column_labels: {column_labels}")

In [3]:
df_final = df[["a1", "age", 'agegroup', 'eduy', 'inc', 'P_liberal']].copy().rename(columns = { 'a1': 'sex' })

df_final

Unnamed: 0,sex,age,agegroup,eduy,inc,P_liberal
0,2.0,30.0,1.0,16.0,25000.0,2.857143
1,1.0,31.0,1.0,16.0,35000.0,2.857143
2,2.0,33.0,1.0,16.0,25000.0,2.714286
3,1.0,24.0,1.0,12.0,35000.0,2.285714
4,2.0,59.0,3.0,12.0,5000.0,2.714286
...,...,...,...,...,...,...
863,2.0,58.0,3.0,6.0,0.0,3.285714
864,2.0,38.0,2.0,12.0,0.0,2.857143
865,2.0,68.0,3.0,6.0,0.0,3.285714
866,1.0,39.0,2.0,12.0,35000.0,3.428571


## Q1: 控制教育年數之後，收入和政治自由態度之間的淨相關如何？與未控制之前的零階相關差別為何？

首先選取相關係數的計算方法。由於三個變項皆非順序變項，故採用Peasron 相關係數作為主要統計方法。

In [4]:
df_1 = df_final[["eduy", "inc", 'P_liberal']].copy()

檢查遺漏值狀況

In [5]:
from statistic.missing_value import check_missing_value

check_missing_value(df_1)

Unnamed: 0,# of Missing Values
eduy,0
inc,4
P_liberal,8


發現資料含遺漏值，由於將影響相關係數結果，剔除遺漏值資料

In [6]:
df_1_cleared = df_1.dropna()
check_missing_value(df_1_cleared)

Unnamed: 0,# of Missing Values
eduy,0
inc,0
P_liberal,0


In [7]:
from statistic.correlation import zero_order_correlation_matrix, partial_correlation_matrix

檢視零階相關結果

In [8]:
zero_order_correlation_matrix(df_1, ['inc', 'P_liberal'], method='pearson')

Unnamed: 0,Variable,Statistic,inc,P_liberal
0,inc,Correlation,1.0,0.107
1,inc,Significance (2-tailed),.,0.002
2,inc,df,0,854
3,P_liberal,Correlation,0.107,1.0
4,P_liberal,Significance (2-tailed),0.002,.
5,P_liberal,df,854,0


檢視淨階相關結果

In [9]:
partial_correlation_matrix(df_1, ['inc', 'P_liberal'], 'eduy', method='pearson')

Unnamed: 0,Control Variable,Variable,Statistic,inc,P_liberal
0,eduy,inc,Correlation,1.0,0.056
1,eduy,inc,Significance (2-tailed),.,0.101
2,eduy,inc,df,0,853
3,eduy,P_liberal,Correlation,0.056,1.0
4,eduy,P_liberal,Significance (2-tailed),0.101,.
5,eduy,P_liberal,df,853,0


### Answer
首先觀察零階相關與淨相關的相關係數：
1. inc 與P_liberal 的零階相關係數為0.107
2. inc 與P_liberal 的淨相關係數為0.056

接下來檢定零階相關與淨相關  
$H_0: 不存在相關$  
$H_1: 存在相關$  
$\alpha = 0.05$

觀察零階相關與淨相關的雙尾顯著值：
1. 零階相關顯著值為0.002
2. 淨相關顯著值為0.101

未加入教育年數的零階相關具有統計顯著相關性，並且相關性為弱相關；而控制教育年數後，收入與政治自由態度轉變爲未具相關性。

可發現儘管在未加入教育年數前，零階相關的統計拒絕虛無假設，通過檢定；也就是顯示收入與政治自由態度具有統計相關性。然而在控制教育年數後，二者便不再具有顯著相關性。這表示教育年數作為干擾變數（confounding variable），共同影響了兩個變相。零階相關的相關性可能為虛假相關，因此無法支持收入與政治自由態度具統計相關性。

## Q2: 以政治自由態度作為依變項，教育年數作為自變項，年齡做為調節變項，做同時迴歸；檢驗教育年數對政治自由態度的影響是否受年齡調節。

In [10]:
df_2 = df_final[["eduy", "age", 'P_liberal']].copy()

檢查遺漏值狀況

In [11]:
from statistic.missing_value import check_missing_value

check_missing_value(df_2)

Unnamed: 0,# of Missing Values
eduy,0
age,0
P_liberal,8


發現資料含遺漏值，由於將影響相關係數結果，剔除遺漏值資料

In [12]:
df_2_cleared = df_2.dropna()
check_missing_value(df_2_cleared)

Unnamed: 0,# of Missing Values
eduy,0
age,0
P_liberal,0


（由於此份分析報告的項目皆為受到遺漏值影響的統計分析，故未來不一一敘述處理遺漏值，預設會處理遺漏值資料，採取剔除遺漏值資料方法）

In [13]:
from statistic.linear_regression import moderator_analysis

reports_2 = moderator_analysis(X = df_2_cleared['eduy'], Y = df_2_cleared['P_liberal'], Z = df_2_cleared['age'])

首先檢驗交互作用，檢視 $Y = b_1X + b_2Z + b_3XZ + a $ 迴歸結果

In [14]:
reports_2['interaction_summary']

0,1,2,3
Dep. Variable:,P_liberal,R-squared:,0.049
Model:,OLS,Adj. R-squared:,0.045
Method:,Least Squares,F-statistic:,14.63
Date:,"Sat, 27 Dec 2025",Prob (F-statistic):,2.66e-09
Time:,15:14:41,Log-Likelihood:,-358.01
No. Observations:,860,AIC:,724.0
Df Residuals:,856,BIC:,743.1
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.3899,0.167,14.350,0.000,2.063,2.717
eduy,0.0384,0.011,3.468,0.001,0.017,0.060
age,0.0043,0.003,1.444,0.149,-0.002,0.010
eduy*age,-0.0004,0.000,-1.815,0.070,-0.001,3.15e-05

0,1,2,3
Omnibus:,24.508,Durbin-Watson:,1.973
Prob(Omnibus):,0.0,Jarque-Bera (JB):,25.895
Skew:,0.405,Prob(JB):,2.38e-06
Kurtosis:,3.257,Cond. No.,7890.0


交互作用項P Value 為0.07 ，由於P Value > 0.05 ，故交互作用效果未達統計顯著

接下來去除交互作用項，檢視 $Y = b_1X + b_2Z + a $ 迴歸結果

In [15]:
reports_2['main_effects_summary']

0,1,2,3
Dep. Variable:,P_liberal,R-squared:,0.045
Model:,OLS,Adj. R-squared:,0.043
Method:,Least Squares,F-statistic:,20.25
Date:,"Sat, 27 Dec 2025",Prob (F-statistic):,2.55e-09
Time:,15:14:41,Log-Likelihood:,-359.66
No. Observations:,860,AIC:,725.3
Df Residuals:,857,BIC:,739.6
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.6486,0.086,30.700,0.000,2.479,2.818
eduy,0.0196,0.004,4.983,0.000,0.012,0.027
age,-0.0008,0.001,-0.762,0.446,-0.003,0.001

0,1,2,3
Omnibus:,25.432,Durbin-Watson:,1.984
Prob(Omnibus):,0.0,Jarque-Bera (JB):,26.926
Skew:,0.416,Prob(JB):,1.42e-06
Kurtosis:,3.244,Cond. No.,325.0


### Answer
教育年數與年齡對政治自由態度無顯著交互作用影響；故就整體模型呈現教育年數、年齡與政治自由態度的關係。

整體模型R Square 為0.045，模型P Value < 0.05 ，模型具統計顯著性

分別檢視教育年數以及年齡，僅教育年數P Value < 0.05 ，依照此模型檢驗，僅教育年數與政治自由態度具統計相關性。

**延伸**

由於年齡未具備調節作用，也並沒有獨立與依變項具統計相關，延伸分析可剔除年齡。

## Q3: 以政治自由態度作為依變項，性別、年齡分組、教育年數、收入作為自變項，做同時迴歸
1. 討論哪個自變項最有助於我們解釋個人的政治自由態度
2. 檢視此迴歸模型是否有多元共線性的問題

In [16]:
df_3 = df_final[["sex", "agegroup", 'eduy', 'inc', 'P_liberal']].copy()

In [17]:
df_3_cleared = df_3.dropna()

針對自變項sex、agegroup、eduy, inc 與依變項P_liberal 做多元線性迴歸，並且採取標準化變項做法

In [18]:
from statistic.linear_regression import simultaneous_linear_regression_model
from sklearn.preprocessing import StandardScaler


X_1 = df_3_cleared[['sex', 'agegroup', 'eduy', 'inc']]
Y_1 = df_3_cleared['P_liberal']

scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_std = pd.DataFrame(
    scaler_X.fit_transform(X_1), 
    columns=X_1.columns, 
    index=X_1.index
)

Y_std = pd.Series(
    scaler_y.fit_transform(Y_1.values.reshape(-1, 1)).ravel(),
    index=Y_1.index
)

model = simultaneous_linear_regression_model(Y_std, X_std)
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.049
Model:,OLS,Adj. R-squared:,0.045
Method:,Least Squares,F-statistic:,10.97
Date:,"Sat, 27 Dec 2025",Prob (F-statistic):,1.12e-08
Time:,15:14:41,Log-Likelihood:,-1193.1
No. Observations:,856,AIC:,2396.0
Df Residuals:,851,BIC:,2420.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-8.934e-16,0.033,-2.67e-14,1.000,-0.066,0.066
sex,-0.0214,0.034,-0.625,0.532,-0.089,0.046
agegroup,-0.0340,0.038,-0.889,0.374,-0.109,0.041
eduy,0.1806,0.039,4.574,0.000,0.103,0.258
inc,0.0545,0.035,1.535,0.125,-0.015,0.124

0,1,2,3
Omnibus:,24.537,Durbin-Watson:,1.966
Prob(Omnibus):,0.0,Jarque-Bera (JB):,25.927
Skew:,0.407,Prob(JB):,2.34e-06
Kurtosis:,3.254,Cond. No.,1.82


由於模型P Value 遠小於 0.05，故此模型具統計顯著，意味至少一係數與依變項有相關。

在檢視各變相的結果後，發現性別、年齡分組、收入P Value 皆大於0.05 ，不具統計相關性。

### Answer 1:
在剔除不具統計相關性的三項自變數後，僅剩下教育年數統計相關性，其標準化係數為0.1806 。

為分析多元共線性的問題，首先獲得各個維度下的CI

In [19]:
from statistic.linear_regression import condition_index_analysis
import statsmodels.api as sm


condition_index_analysis(X_1)

Unnamed: 0,Dimension,Eigenvalue,Condition_Index,Variance_Proportion_const,Variance_Proportion_sex,Variance_Proportion_agegroup,Variance_Proportion_eduy,Variance_Proportion_inc
0,1,4.373048,1.0,0.223324,0.20651,0.197398,0.213102,0.159667
1,2,0.377209,3.404877,0.023996,0.111109,0.078671,0.000117,0.786108
2,3,0.16356,5.170752,0.007007,0.08631,0.640681,0.251108,0.014893
3,4,0.071382,7.827044,0.038186,0.56682,0.018941,0.33673,0.039323
4,5,0.014802,17.188443,0.707487,0.029251,0.064309,0.198943,9e-06


接下來獲得個別自變數的VIF

In [20]:
from statistic.linear_regression import vif_analysis

vif_analysis(X_1)

Unnamed: 0,Variable,VIF
0,sex,8.08564
1,agegroup,4.575297
2,eduy,8.404758
3,inc,2.738275


### Answer 2
從CI 表中發現，各個維度的CI 數值都不大於100 甚至不及30 ，因此整體模型可控。

另外從VIF 表中發現個別自變數的VIF 發現並未有任一自變數VIF > 10 ，故重新驗證模型可控。

然而事實上從上一部分的多元回歸結果已可發現僅有年齡具有統計相關性，故事實上接下來需要檢驗的應是各自變項間與依變項是否有調節、中介、干擾的現象。

## Q4: 按下列步驟，創造「政治自由態度分層」變項。
1. 將 P_liberal 重新編碼成新變項「政治自由態度分層」（變項名 liberal_3gp）；1- 2.6 分編成新變
項的 1（數值標籤設成「低」）、2.7- 3.0 分編成 2（「中」）、3.1 分以上編成 3（「高」）；
2. 列出「政治自由態度分層」變項的次數分配。

In [21]:
df_4 = df_final[["sex", "agegroup", 'eduy', 'inc', 'P_liberal']].copy()

In [22]:
df_4['liberal_3gp'] = pd.cut(df_4['P_liberal'], 
                              bins=[0, 2.6, 3.0, float('inf')],
                              labels=[1, 2, 3],
                              include_lowest=True)

In [23]:
df_4_cleared = df_4.dropna()

依據規則重新編碼

In [24]:
df_4_cleared

Unnamed: 0,sex,agegroup,eduy,inc,P_liberal,liberal_3gp
0,2.0,1.0,16.0,25000.0,2.857143,2
1,1.0,1.0,16.0,35000.0,2.857143,2
2,2.0,1.0,16.0,25000.0,2.714286,2
3,1.0,1.0,12.0,35000.0,2.285714,1
4,2.0,3.0,12.0,5000.0,2.714286,2
...,...,...,...,...,...,...
863,2.0,3.0,6.0,0.0,3.285714,3
864,2.0,2.0,12.0,0.0,2.857143,2
865,2.0,3.0,6.0,0.0,3.285714,3
866,1.0,2.0,12.0,35000.0,3.428571,3


列出liberal_3gp 的頻率分布

In [25]:
df_4_cleared['liberal_3gp'].value_counts().sort_index()

liberal_3gp
1    226
2    407
3    223
Name: count, dtype: int64

## Q5: 按下列步驟，分析「性別」和「政治自由態度分層」是否有顯著相關
1. 做「性別」和「政治自由態度分層」的交叉表（細格內呈現「百分比」和「調整的標準化殘差」） ；
2. 做卡方分析；
3. 報告分析結果（如果卡方分析結果顯著，則要報告殘差分析結果） 。

In [26]:
df_5 = df_4_cleared[["sex", 'liberal_3gp']].copy()

In [27]:
from statistic.categorical_data import crosstab_with_residuals

report_5 = crosstab_with_residuals(row_series = df_5['liberal_3gp'], col_series = df_5['sex'])

首先觀察次數分配的交叉表

In [28]:
report_5['ct']

sex,1.0,2.0
liberal_3gp,Unnamed: 1_level_1,Unnamed: 2_level_1
1,101,125
2,201,206
3,109,114


性別間的百分比

In [29]:
report_5['ct_col_pct']

sex,1.0,2.0
liberal_3gp,Unnamed: 1_level_1,Unnamed: 2_level_1
1,24.574209,28.089888
2,48.905109,46.292135
3,26.520681,25.617978


In [30]:
report_5['adjusted_residuals']

sex,1.0,2.0
liberal_3gp,Unnamed: 1_level_1,Unnamed: 2_level_1
1,-1.165796,1.165796
2,0.764808,-0.764808
3,0.300628,-0.300628


卡方檢定結果

In [31]:
report_5['chi2_table']

Unnamed: 0,Value
chi2,1.373906
p_value,0.503107
dof,2.0


### Answer
1. 次數分配、性別間百分比、調整後標準化殘差交叉表如上
2. 由卡方檢定結果可發現P Value > 0.05 ，因此接收虛無假設，此二分類項目並無顯著相關。

## Q6: 按下列步驟，分析「年齡分組」和「政治自由態度分層」是否有顯著相關
1. 做「年齡分組」和「政治自由態度分層」的交叉表（細格內呈現「百分比」和「調整的標準化殘差」） ；
2. 做卡方分析 和 Gamma 分析；
3. 報告分析結果：包含相關顯著性，關連方向（正相關或負相關）、關連強度。

In [32]:
df_6 = df_4_cleared[["agegroup", 'liberal_3gp']].copy()

In [33]:
from statistic.categorical_data import crosstab_with_residuals

report_6 = crosstab_with_residuals(row_series = df_6['liberal_3gp'], col_series = df_6['agegroup'])

計算出次數分配、年齡分組間百分比、調整後標準化交叉表

次數分配

In [34]:
report_6['ct']

agegroup,1.0,2.0,3.0
liberal_3gp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,72,85,69
2,152,164,91
3,107,74,42


年齡分組間百分比

In [35]:
report_6['ct_col_pct']

agegroup,1.0,2.0,3.0
liberal_3gp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,21.752266,26.315789,34.158416
2,45.92145,50.773994,45.049505
3,32.326284,22.910217,20.792079


調整後標準化殘差

In [36]:
report_6['adjusted_residuals']

agegroup,1.0,2.0,3.0
liberal_3gp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,-2.450397,-0.044476,2.861155
2,-0.756053,1.471847,-0.813082
3,3.32121,-1.63,-1.948377


執行卡方檢定

In [37]:
report_6['chi2_table']

Unnamed: 0,Value
chi2,16.841638
p_value,0.002075
dof,4.0


執行Gamma 檢定

In [38]:
from statistic.categorical_data.rank_correlation import rank_correlation_measures

rank_correlation_measures(df_6, 'liberal_3gp', 'agegroup')

Unnamed: 0,Measure,Value,ASE,Approx_T,Approx_Sig
0,Goodman & Kruskal's Gamma,-0.181796,0.094435,-1.925089,0.054218
1,Kendall's Tau-c,-0.113779,0.030114,-3.778322,0.000158


由於Gamma 檢定顯著值 > 0.05 ，不通過檢定，故檢視同分狀況。

In [39]:
df_6.value_counts()

agegroup  liberal_3gp
2.0       2              164
1.0       2              152
          3              107
3.0       2               91
2.0       1               85
          3               74
1.0       1               72
3.0       1               69
          3               42
Name: count, dtype: int64

### Answer
1. 次數分配、性別間百分比、調整後標準化殘差交叉表如上
2. 由卡方檢定結果可發現P Value < 0.05 ，因此拒絕虛無假設，此二分類項目具顯著相關。
3. 由於Gamma 檢定P Value > 0.05 ，不通過檢定。往下檢視同分狀況，可發現三年齡分佈的組別，皆有約過半為單一分組（liberal_3gp = 2）。此狀況可能表示同分狀況嚴重，因此檢視Tau 檢定 ，發現其P Value 遠小於0.05 。依據此分析，判斷年齡分布與政治自由態度分層具統計顯著相關。