### 課題2

ある工場では3つのレーンで同じ製品を製造している．  
品質管理のために，各レーンで造られる製品を無作為に20個選び，重さを調べた．
なお，標本データの母集団は正規分布に従うとする．
レーンによって製品の重さの差があるかを検定せよ．

- 検定方法: 3群のデータ，母集団は正規分布に従う -> 一元配置分散分析
- 帰無仮説: レーンによって製品の重さの差はない
- 対立仮説: レーンによって製品の重さの差がある

One factory manufactures the same product in three lanes.
For quality control, 20 products made in each lane were randomly selected and weighed. 
It is assumed that the population of sample data follows a normal distribution.

Test if there is a difference in product weight depending on the lane.

- Three groups of data, population follows normal distribution-> One-way ANOVA
- Null hypothesis: There is no difference in product weight between lanes
- Alternative hypothesis: There is a difference in product weight depending on the lane

In [1]:
import pandas as pd
import scipy.stats as ss 
import statsmodels.api as sm 
import statsmodels.formula.api as smf

##### csvファイルの読み込み Load data into a pandas's dataframe

In [2]:
df = pd.read_csv("data/weights.csv")
display(df.head())
display(df.tail())

Unnamed: 0,lane1,lane2,lane3
0,96.6,86.3,76.0
1,111.4,82.4,88.5
2,84.3,61.4,90.1
3,92.9,82.1,96.1
4,102.4,85.2,77.0


Unnamed: 0,lane1,lane2,lane3
15,93.6,75.3,82.4
16,85.7,98.8,83.0
17,86.9,94.8,108.2
18,86.9,96.2,93.2
19,99.3,92.9,99.8


##### 各グループのデータの抽出　Extraction of data for each group

In [3]:
data1 = df["lane1"]
data2 = df["lane2"]
data3 = df["lane3"]
data_all = pd.concat([data1,data2,data3])

print(data_all)

0      96.6
1     111.4
2      84.3
3      92.9
4     102.4
5     113.2
6     101.7
7     103.4
8      97.9
9      84.2
10     92.5
11    113.9
12     93.3
13     89.2
14    102.4
15     93.6
16     85.7
17     86.9
18     86.9
19     99.3
0      86.3
1      82.4
2      61.4
3      82.1
4      85.2
5      99.4
6      92.0
7      89.5
8     103.6
9      80.0
10    103.9
11     86.9
12    112.6
13     69.9
14     96.3
15     75.3
16     98.8
17     94.8
18     96.2
19     92.9
0      76.0
1      88.5
2      90.1
3      96.1
4      77.0
5      81.1
6      94.7
7      84.8
8      96.0
9      86.4
10     93.6
11     93.2
12     87.1
13     96.2
14     80.4
15     82.4
16     83.0
17    108.2
18     93.2
19     99.8
dtype: float64


### 方法1: 分散分析表から求める方法  How to obtain from the analysis of variance table


分散分析表
<table border="" cellpadding="5" width="100%">
<tbody>
<tr valign="top">
<td align="left" width="8%"><strong>Source</strong></td>
<td align="center" width="22%"><strong>SS</strong></td>
<td align="center" width="19%"><strong>df</strong></td>
<td align="center" width="23%"><strong>MS</strong></td>
<td align="center" width="25%"><strong>F</strong></td>
</tr>
<tr valign="top">
<td align="center"><strong>Between</strong></td>
<td align="left">$SS_{between} = \sum_{j=1}^{k}n_j(\bar{x}_{j} - \bar{x})^2$</td>
<td align="left">$df_{between} = k - 1$</td>
<td align="left">$MSG = \frac{SS_{between}}{df_{between}}$</td>
<td align="left">$ F = \frac{MSG}{MSE} $</td>
</tr>
<tr valign="top">
<td align="center"><strong>Within</strong></td>
<td align="left">$SS_{within} = \sum_{i=1}^{n_j}\sum_{j=1}^{k}(x_{ij} - \bar{x}_{j})^2$</td>
<td align="left">$df_{within} = N - k$</td>
<td align="left">$MSE = \frac{SS_{within}}{df_{within}}$</td>
<td align="center"></td>
</tr>
<tr valign="top">
<td align="center"><strong>Total</strong></td>
<td align="left">$SS_{total} = \sum_{i=1}^{n_j}\sum_{j=1}^{k}(x_{ij} - \bar{x})^2$</td>
<td align="left">$df_{total} = N - 1$</td>
<td align="center"></td>
<td align="center"></td>
</tr>
</tbody></table>


##### 自由度 Degree of freedom

In [4]:
# 水準数 number of levels
k = 3
# データの総数　total numbers of data
N = data_all.shape[0]
# degree of freedom between
df_between = k - 1
# degree of freedom within
df_within = N - k
# degree of freedom in total
df_total = N - 1

##### 水準間の平方和 Sum of squares between levels

In [5]:
ave_all = data_all.mean() # x_bar: 全てのweightの平均, means of all weights

n1 = len(data1) # "ctrl"のサンプル数 number of samples of "ctrl"
n2 = len(data2) # "trt1"のサンプル数 number of samples of "trt1"
n3 = len(data3) # "trt2"のサンプル数 number of samples of "trt2"
ctrl_ave = data1.mean() # "ctrl"の平均値 mean of "ctrl"
trt1_ave = data2.mean() # "trt1"の平均値 mean of "trt1"
trt2_ave = data3.mean() # "trt2"の平均値 mean of "trt2"

# sum of sum of squares between groups: ss_between
ctrl_ssb = n1 * (ctrl_ave - ave_all)**2
trt1_ssb = n2 * (trt1_ave - ave_all)**2 
trt2_ssb = n3 * (trt2_ave - ave_all)**2
ss_between = ctrl_ssb + trt1_ssb + trt2_ssb
print("ss_between: ", ss_between)

ss_between:  682.1823333333352


##### 水準内の平方和 Sum of squares within levels

In [6]:
# sum of sum of squares within levels: ss_within
# SS_within:
ctrl_ssw = sum((data1 - ctrl_ave)**2)
trt1_ssw = sum((data2 - trt1_ave)**2)
trt2_ssw = sum((data3 - trt2_ave)**2)
ss_within =  ctrl_ssw + trt1_ssw + trt2_ssw
print("ss_within: ", ss_within)

ss_within:  5808.021000000001


##### 総平方和  Sum of squares for total

In [7]:
# sum of sum of squares for total: ss_total
ctrl_sst = sum((data1 - ave_all)**2)
trt1_sst = sum((data2 - ave_all)**2)
trt2_sst = sum((data3 - ave_all)**2)
ss_total = ctrl_sst + trt1_sst + trt2_sst
print("ss_total: ", ss_total)

ss_total:  6490.203333333333


##### F検定  F-statistic

In [8]:
mean_squared_between = ss_between / df_between
mean_squared_within = ss_within / df_within

# F値 F-value
f_ratio =  mean_squared_between / mean_squared_within

# 棄却域(F分布の上側5%点) rejection region (the upper 5-percentile of F-distribution)
print(ss.f.ppf(0.95, df_between, df_within))

# p値 p-value
p_val =  1 - ss.f.cdf(f_ratio, df_between, df_within)

print("F-value: {:.5f}".format(f_ratio))
print("p-value: {:.5f}".format(p_val))

if p_val < 0.05:
    print("Reject H0 (帰無仮説を棄却)")
else:
    print("Retain H0 (帰無仮説を棄却できない)")

3.1588427192606465
F-value: 3.34747
p-value: 0.04221
Reject H0 (帰無仮説を棄却)


### 方法2: statsmodelsを用いた方法 Method using stats models

Step 1. statsmodels.formula.apiを用いて回帰分析を行う  Construct a regression model using statsmodels.formula.api  
Step 2. 回帰分析の結果に基づいて一元配置分散分析を行う  Perform one-way ANOVA based on the results of regression analysis

### 方法3: scipyのf_oneway()関数 scipy's f_oneway() function

In [9]:
f, p_val = ss.f_oneway(data1, data2, data3)
print("F-value: {:.5f}".format(f))
print("p-value: {:.5f}".format(p_val))

if p_val < 0.05:
    print("Reject H0 (帰無仮説を棄却)")
else:
    print("Retain H0 (帰無仮説を棄却できない)")

F-value: 3.34747
p-value: 0.04221
Reject H0 (帰無仮説を棄却)


### 解釈 Interpretation

p値が有意水準0.05よりも小さいことから帰無仮説が棄却される．  
つまり，レーンによって製品の重さの差があるといえる．

The null hypothesis is rejected because the p-value is less than the significance level of 0.05.
Therefore, it can be said that there is a difference in the weight of the product depending on the lane.

#### 効果量(Effective size)

In [10]:
# eta_squared (somewhat biased)
eta_squared = ss_between / ss_total
# omega_squared
omega_squared = (ss_between - (df_between * mean_squared_within)) / (ss_total + mean_squared_within)
print("eta squared: {:.5f}".format(eta_squared))
print("omega squared: {:.5f}".format(omega_squared))

eta squared: 0.10511
omega squared: 0.07257
