![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# Lab | Inferential statistics - ANOVA

Note: The following lab is divided in 2 sections which represent activities 3 and 4.

## Part 1

In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on:
    - Null hypothesis
    - Alternate hypothesis
    - Level of significance
    - Test statistic
    - P-value
    - F table

### Context

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data `anova_lab_data.xlsx` file in the `files_for_lab` folder  

- State the null hypothesis
- State the alternate hypothesis
- What is the significance level
- What are the degrees of freedom of model, error terms, and total DoF

Data was collected randomly and provided to you in the table as shown: [link to the image - Data](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.05/7.05-lab_data.png)

### Setting up the ANOVA: Concepts

- H0 (null hypothesis): μ1 = μ2 = μ3 = … = μk (all the population means are equal)
- H1 (alternative hypothesis): at least one population mean is different from the rest

- <b>SSR</b>: regression sum of squares
- <b>SSE</b>: error sum of squares
- <b>SST</b>: total sum of squares (SST = SSR + SSE)
- <b>dfr</b>: regression degrees of freedom (dfr = k-1)
- <b>dfe</b>: error degrees of freedom (dfe = n-k)
- <b>dft</b>: total degrees of freedom (dft = n-1)
- <b>k</b>: total number of groups
- <b>n</b>: total observations
- <b>MSR</b>: regression mean square (MSR = SSR/dfr)
- <b>MSE</b>: error mean square (MSE = SSE/dfe)
- <b>F</b>: The F test statistic (F = MSR/MSE)
- <b>p</b>: The p-value that corresponds to Fdfr, dfe
If the p-value is less than your chosen significance level (e.g. 0.05), then you can reject the null hypothesis and conclude that at least one of the population means is different from the others.

Note: If you reject the null hypothesis, this indicates that at least one of the population means is different from the others, but the ANOVA table doesn’t specify which population means are different. To determine this, you need to perform post hoc tests, also known as “multiple comparisons” tests.

## Part 2

- In this section, use Python to conduct ANOVA.
- What conclusions can you draw from the experiment and why?

In [1]:
import pandas as pd

In [2]:
df=pd.read_excel("./files_for_lab/anova_lab_data.xlsx")
df.columns = [col.lower().strip().replace(" ","_") for col in df.columns]
df

Unnamed: 0,power,etching_rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


In [3]:
#Calculate mean of each population, and mean of means
df_means=df.groupby(['power']).mean()
display(df_means)
df_m_m=df_means.mean()
display(df_m_m)
ov_mean=df_m_m[0]
type(ov_mean)

Unnamed: 0_level_0,etching_rate
power,Unnamed: 1_level_1
160 W,5.792
180 W,6.238
200 W,8.318


etching_rate    6.782667
dtype: float64

numpy.float64

In [4]:
SSR = sum([(i-ov_mean)**2 for i in df_means['etching_rate'].tolist()])

In [5]:
df['power'].unique().tolist()

['160 W', '180 W', '200 W']

In [6]:
SE=[]
for j in df_means['etching_rate'].tolist():
    for ij in df[df['power']=='160 W']['etching_rate'].tolist():
        SE.append((ij-j)**2)
SSE=sum(SE)
SSE
#df_pivot=pd.pivot_table(df, columns=['power'])
#df_pivot

34.1258

In [7]:
SSE=sum([(ij-j)**2 for ij in df[df['power']=='160 W']['etching_rate'].tolist() for j in df_means['etching_rate'].tolist()])
SSE

34.1258

In [8]:
SST=SSR+SSE
SST

37.761130666666666

In [9]:
dfr=len(df_means['etching_rate'].tolist())-1
dfe=len(df['etching_rate'].tolist())-len(df_means['etching_rate'].tolist())
dft=len(df['etching_rate'].tolist())-1

In [10]:
MSR=SST/dfr
MSE=SSE/dfe
F=MSR/MSE

In [11]:
d={'Sum of Squares (SS)':[SSR,SSE,SST],'df':[dfr,dfe,dft],'Mean Squares (MS)':[MSR,MSE,""],'F':[F,"",""]}
results=pd.DataFrame(data=d, index=['Regression','Error','Total'])
results

Unnamed: 0,Sum of Squares (SS),df,Mean Squares (MS),F
Regression,3.635331,2,18.880565,6.639164
Error,34.1258,12,2.843817,
Total,37.761131,14,,


In [12]:
# ANOVA LIBRARY

import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols("etching_rate ~ C(power)",df).fit()
sm.stats.anova_lm(model) # p-value < 0.05, therefore each voltage is producing significantly different results

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
C(power),2.0,18.176653,9.088327,36.878955,8e-06
Residual,12.0,2.95724,0.246437,,
