### 課題1 Assignment 1
score_oneway.csvは3クラスのテストの点数のデータである．各クラスのテストの平均点に差があるかどうか？有意水準5%で検定せよ．

- 帰無仮説: ３つのクラスの平均点に差はない
- 対立仮説: ３つのクラスの平均点に差がある

score_oneway.csv is the score data of the test of 3 classes. 
Is there a difference in the average scores of the tests in 
each class? Test at a significance level of 5%.

- Null hypothesis:　There is no difference in the average scores of the tests in each class.
- Alternative hypothesis:　There is a difference in the average scores of the tests in each class.

In [1]:
import pandas as pd
import scipy.stats as ss 
import statsmodels.api as sm 
import statsmodels.formula.api as smf

##### csvファイルの読み込み Load data into a pandas's dataframe

In [2]:
datafile = "data/score_oneway.csv"
df = pd.read_csv(datafile, delimiter=",")
display(df)

Unnamed: 0,Class,Score
0,1,76
1,1,54
2,1,62
3,1,46
4,1,53
...,...,...
82,3,57
83,3,80
84,3,92
85,3,41


##### Class列のユニークな要素の値を抽出する Find unique class names

In [3]:
groups = pd.unique(df["Class"])
print(groups)

[1 2 3]


##### 各グループのデータの抽出　Extraction of data for each group

In [4]:
data1 = df[df["Class"]==1]["Score"]
data2 = df[df["Class"]==2]["Score"]
data3 = df[df["Class"]==3]["Score"]
#print(data1)
#print(data2)
#print(data3)

### 方法1: scipyのf_oneway()関数 scipy's f_oneway() function

In [5]:
f, p_val = ss.f_oneway(data1, data2, data3)
print(f,p_val)

if p_val > 0.05:
    print("Retain H0")
else:
    print("Reject H0")

0.9554459385671847 0.38878048139735655
Retain H0


### 方法2: statsmodelsを用いた方法

Step 1. statsmodels.formula.apiを用いて回帰分析を行う  Construct a regression model using statsmodels.formula.api  
Step 2. 回帰分析の結果に基づいて一元配置分散分析を行う  Perform one-way ANOVA based on the results of regression analysis

In [6]:
model = smf.ols(formula='Score ~ C(Class)', data=df) # 回帰モデルの設定  Construct a regression model 
results = model.fit() # 回帰分析の実行 Execute linear regression
print(results.summary()) # 回帰分析の結果 Results of the linear regression

# one-way anova by anova_lm()e
table = sm.stats.anova_lm(results, typ=2)
print("----------------")
print(table)
print("----------------")

p_val = table.at['C(Class)','PR(>F)']

if p_val > 0.05:
    print("Retain H0")
else:
    print("Reject H0")

                            OLS Regression Results                            
Dep. Variable:                  Score   R-squared:                       0.022
Model:                            OLS   Adj. R-squared:                 -0.001
Method:                 Least Squares   F-statistic:                    0.9554
Date:                Sat, 06 Jul 2024   Prob (F-statistic):              0.389
Time:                        15:15:23   Log-Likelihood:                -387.05
No. Observations:                  87   AIC:                             780.1
Df Residuals:                      84   BIC:                             787.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept        60.7931      3.911     15.544

### 方法3: 分散分析表から求める方法


分散分析表
<table border="" cellpadding="5" width="100%">
<tbody>
<tr valign="top">
<td align="left" width="8%"><strong>Source</strong></td>
<td align="center" width="22%"><strong>SS</strong></td>
<td align="center" width="19%"><strong>df</strong></td>
<td align="center" width="23%"><strong>MS</strong></td>
<td align="center" width="25%"><strong>F</strong></td>
</tr>
<tr valign="top">
<td align="center"><strong>Between</strong></td>
<td align="left">$SS_{between} = \sum_{j=1}^{k}n_j(\bar{x}_{j} - \bar{x})^2$</td>
<td align="left">$df_{between} = k - 1$</td>
<td align="left">$MSG = \frac{SS_{between}}{df_{between}}$</td>
<td align="left">$ F = \frac{MSG}{MSE} $</td>
</tr>
<tr valign="top">
<td align="center"><strong>Within</strong></td>
<td align="left">$SS_{within} = \sum_{i=1}^{n_j}\sum_{j=1}^{k}(x_{ij} - \bar{x}_{j})^2$</td>
<td align="left">$df_{within} = N - k$</td>
<td align="left">$MSE = \frac{SS_{within}}{df_{within}}$</td>
<td align="center"></td>
</tr>
<tr valign="top">
<td align="center"><strong>Total</strong></td>
<td align="left">$SS_{total} = \sum_{i=1}^{n_j}\sum_{j=1}^{k}(x_{ij} - \bar{x})^2$</td>
<td align="left">$df_{total} = N - 1$</td>
<td align="center"></td>
<td align="center"></td>
</tr>
</tbody></table>


##### 自由度 Degree of freedom

In [7]:
# 水準数 number of levels
k = len(pd.unique(df["Class"])) 
# データの総数　total numbers of data
N = df.shape[0]
# degree of freedom between
df_between = k - 1
# degree of freedom within
df_within = N - k
# degree of freedom in total
df_total = N - 1

##### 水準間の平方和 Sum of squares between levels

In [8]:
ave_all = df['Score'].mean() # x_bar: 標本データ全ての平均, means of all scores

n1 = data1.size # "class_1"のサンプル数 number of samples of "class_1"
n2 = data2.size # "class_2"のサンプル数 number of samples of "class_2"
n3 = data3.size # "class_3"のサンプル数 number of samples of "class_3"
c1_ave = data1.mean() # "class_1"の平均値 mean of "class_1"
c2_ave = data2.mean() # "class_2"の平均値 mean of "class_2"
c3_ave = data3.mean() # "class_3"の平均値 mean of "class_3"

# sum of sum of squares between groups: ss_between
c1_ssb = n1 * (c1_ave - ave_all)**2
c2_ssb = n2 * (c2_ave - ave_all)**2 
c3_ssb = n3 * (c3_ave - ave_all)**2
ss_between = c1_ssb + c2_ssb + c3_ssb
print("ss_between: ", ss_between)

ss_between:  847.609195402297


##### 水準内の平方和 Sum of squares within levels

In [9]:
# sum of sum of squares within levels: ss_within
# SS_within:
c1_ssw = sum((data1 - c1_ave)**2)
c2_ssw = sum((data2 - c2_ave)**2)
c3_ssw = sum((data3 - c3_ave)**2)
ss_within =  c1_ssw + c2_ssw + c3_ssw
print("ss_within: ", ss_within)

ss_within:  37259.65517241379


##### 総平方和  Sum of squares for total

In [10]:
# sum of sum of squares for total: ss_total
c1_sst = sum((data1 - ave_all)**2)
c2_sst = sum((data2 - ave_all)**2)
c3_sst = sum((data3 - ave_all)**2)
ss_total = c1_sst + c2_sst + c3_sst
print("ss_total: ", ss_total)

ss_total:  38107.2643678161


##### F検定  F-statistic

In [11]:
mean_squared_between = ss_between / df_between
mean_squared_within = ss_within / df_within

# F値 F-value
f_ratio =  mean_squared_between / mean_squared_within

# 棄却域(F分布の上側5%点) rejection region (the upper 5-percentile of F-distribution)
print(ss.f.ppf(0.95, df_between, df_within))

# p値 p-value
p_val =  1 - ss.f.cdf(f_ratio, df_between, df_within)

print("F-value: {}".format(f_ratio))
print("p-value: {:.5f}".format(p_val))

if p_val > 0.05:
    print("Retain H0")
else:
    print("Reject H0")

3.1051566079400104
F-value: 0.9554459385671826
p-value: 0.38878
Retain H0


### 解釈 Interpretation

p値が有意水準0.05よりも大きいため，帰無仮説は棄却されない．  
つまり，3つのクラスの平均点に差があるとはいえない．

The null hypothesis is not rejected because the p-value is greater than the significance level of 0.05.  
Therefore, it cannot be said that there is a difference in the average scores of the tests in each class.