# 1141_社會資料分析 作業六
學生：劉晏成

以「2018台北大學社會系三鶯調查」（brilliant_working.sav）檔案回答以下問題：

1. 以收入（inc）作為依變項，性別（sex）、教育年數（eduy）、人際網路（net）、社交性（sociability）作為自變項，做同時迴歸
2. 討論哪個自變項最有助於我們解釋個人的收入
3. 檢視此迴歸模型是否有多元共線性的問題

In [1]:
from load import load_sav
import pandas as pd

df = load_sav("../data/brilliant_working.sav")

In [2]:
variable_value_labels = df.attrs['variable_value_labels']
column_names = df.attrs['column_names']
column_names_to_labels = df.attrs['column_names_to_labels']

In [3]:
df_1 = df[['sex', 'eduy', 'net', 'sociability', 'inc']].copy()

檢查遺漏值狀況

In [4]:
from statistic.missing_value import check_missing_value

check_missing_value(df_1)

Unnamed: 0,# of Missing Values
sex,0
eduy,0
net,0
sociability,3
inc,4


In [5]:
df_1_cleared = df_1.dropna()
check_missing_value(df_1_cleared)

Unnamed: 0,# of Missing Values
sex,0
eduy,0
net,0
sociability,0
inc,0


In [6]:
from statistic.linear_regression import simultaneous_linear_regression_report

X_1 = df_1_cleared[['sex', 'eduy', 'net', 'sociability']]
Y_1 = df_1_cleared['inc']
report = simultaneous_linear_regression_report(Y_1, X_1)
report

0,1,2,3
Dep. Variable:,inc,R-squared:,0.118
Model:,OLS,Adj. R-squared:,0.114
Method:,Least Squares,F-statistic:,28.67
Date:,"Sat, 22 Nov 2025",Prob (F-statistic):,2.19e-22
Time:,15:51:45,Log-Likelihood:,-10049.0
No. Observations:,861,AIC:,20110.0
Df Residuals:,856,BIC:,20130.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.179e+04,7894.761,-1.494,0.136,-2.73e+04,3703.611
sex,1.224e+04,1953.880,6.267,0.000,8409.981,1.61e+04
eduy,2066.2973,263.778,7.833,0.000,1548.569,2584.025
net,877.2699,414.863,2.115,0.035,63.001,1691.539
sociability,601.7687,458.046,1.314,0.189,-297.256,1500.794

0,1,2,3
Omnibus:,230.004,Durbin-Watson:,1.972
Prob(Omnibus):,0.0,Jarque-Bera (JB):,577.731
Skew:,1.39,Prob(JB):,3.53e-126
Kurtosis:,5.894,Cond. No.,179.0


## Q1: 以收入（inc）作為依變項，性別（sex）、教育年數（eduy）、人際網路（net）、社交性（sociability）作為自變項，做同時迴歸

### Answer
首先觀察 $R_2$ 為0.118，顯示此模型具弱解釋力

接下來觀察F Statistic 的P Value  

$H_0$: 整體迴歸模型不顯著  
$H_1$: 整體迴歸模型顯著  
$\alpha$: 0.05

此模型的F Statistic 的P Value < 0.05 。故此模型至少一自變項具有預測能力。另外注意此模型中社交性係數並不通過T Test（$\alpha = 0.05$） ，也就是其影響力並不具備統計顯著。

## Q2: 討論哪個自變項最有助於我們解釋個人的收入

### Answer
首先做個別自變數的事後檢定：

依據報表，四個自變項都通過T Test （P Value > 0.05），因此皆具統計意義。

又此四個自變項以性別的係數（coef）為最大（12240），也就是當控制其他條件後，男性（1）的收入會比女性（0）收入增加12240 元。

## Q3: 檢視此迴歸模型是否有多元共線性的問題

首先獲得各個維度下的CI

In [7]:
from statistic.linear_regression import condition_index_analysis
import statsmodels.api as sm


condition_index_analysis(X_1)

Unnamed: 0,Dimension,Eigenvalue,Condition_Index,Variance_Proportion_const,Variance_Proportion_sex,Variance_Proportion_eduy,Variance_Proportion_net,Variance_Proportion_sociability
0,1,4.331362,1.0,0.223041,0.1421597,0.213223,0.19885,0.222726
1,2,0.447973,3.109468,0.03536,0.857747,0.036611,0.0308,0.039482
2,3,0.171144,5.030736,0.038056,3.933102e-05,0.226633,0.717259,0.018013
3,4,0.040901,10.290693,0.17071,2.689475e-07,0.518975,0.052588,0.257727
4,5,0.008619,22.417041,0.532834,5.371276e-05,0.004557,0.000503,0.462052


接下來獲得個別自變數的VIF

In [8]:
from statistic.linear_regression import vif_analysis

vif_analysis(X_1)

Unnamed: 0,Variable,VIF
0,sex,1.940243
1,eduy,12.384264
2,net,5.625701
3,sociability,19.160257


### Answer
從CI 表中發現，各個維度的CI 數值都不大於100 甚至不及30 ，因此整體模型可控。

另外從VIF 表中發現個別自變數的VIF 發現教育年數以及社交性VIF > 10 ，接下來的研究可考慮剔除其一使此多元線性模型更簡單。

## 延伸
嘗試剔除VIF 最高，並且在第一次模型中並不通過T Test 的社交性

In [9]:
X_2 = df_1_cleared[['sex', 'eduy', 'net']]
report_2 = simultaneous_linear_regression_report(Y_1, X_2)
report_2

0,1,2,3
Dep. Variable:,inc,R-squared:,0.116
Model:,OLS,Adj. R-squared:,0.113
Method:,Least Squares,F-statistic:,37.62
Date:,"Sat, 22 Nov 2025",Prob (F-statistic):,7.599999999999999e-23
Time:,15:57:30,Log-Likelihood:,-10050.0
No. Observations:,861,AIC:,20110.0
Df Residuals:,857,BIC:,20130.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3229.5609,4457.527,-0.725,0.469,-1.2e+04,5519.388
sex,1.219e+04,1954.306,6.239,0.000,8357.151,1.6e+04
eduy,2102.5651,262.441,8.012,0.000,1587.463,2617.667
net,1029.5888,398.502,2.584,0.010,247.434,1811.743

0,1,2,3
Omnibus:,231.315,Durbin-Watson:,1.973
Prob(Omnibus):,0.0,Jarque-Bera (JB):,584.951
Skew:,1.395,Prob(JB):,9.54e-128
Kurtosis:,5.919,Cond. No.,68.8


可發現F-statistic 表現比第一個模型佳，並且所有自變數都通過T Test($\alpha$ = 0.05)。故可以此模型為主要解釋模型。
