#  ANOVA

In statistics, **Analysis of Variance (ANOVA)** is also used to analyze the differences among group means. The difference between t-test and ANOVA is the former is ued to compare two groups whereas the latter is used to compare three or more groups. [Read more about the difference between t-test and ANOVA](http://b.link/anova24).

From the ANOVA test, you receive two numbers. The first number is called the **F-value** which indicates whether your null-hypothesis can be rejected. The critical F-value that rejects the null-hypothesis varies according to the number of total subjects and the number of subject groups in your experiment. In [this table](http://b.link/eda14) you can find the critical values of the F distribution. **If you are confused by the massive F-distribution table, don't worry. Skip F-value for now and study it at a later time. In this challenge you only need to look at the p-value.**

The p-value is another number yielded by ANOVA which already takes the number of total subjects and the number of experiment groups into consideration. **Typically if your p-value is less than 0.05, you can declare the null-hypothesis is rejected.**

In this challenge, we want to understand whether there are significant differences among various types of pokemons' `Total` value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA. Use Ironhack's database to load the pokemon data (db: pokemon, table: pokemon_stats). 

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load the data:
data = pd.read_excel('anova_lab_data.xlsx')
data


Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


# Part 1

In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on: - Null hypothesis - Alternate hypothesis - Level of significance - Test statistic - P-value - F table



Context
Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

State the null hypothesis
State the alternate hypothesis
What is the significance level
What are the degrees of freedom of model, error terms, and total DoF
Data was collected randomly and provided to you in the table as shown: link to the image - Data



Null Hypothesis:

In [6]:
# H0 - There is no difference in the mean etching rate

Alternate hypothesis:

In [7]:
# H1 - There is a difference in the mean etching rate

 Level of significance:

In [9]:
# 0.05 - It's not an urgent matter (which would require a lower level of significance) 

# Part 2


In this section, use the Python to conduct ANOVA.
What conclusions can you draw from the experiment and why?

In [10]:
processors = pd.read_excel('anova_lab_data.xlsx')
processors

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


Test statistic:

In [11]:
processors['power_rate'] = processors.groupby('Power ').cumcount() ##is the new index 

In [16]:
processors['power_rate']

0     0
1     0
2     0
3     1
4     1
5     1
6     2
7     2
8     2
9     3
10    3
11    3
12    4
13    4
14    4
Name: power_rate, dtype: int64

In [17]:
processors_pivot = processors.pivot(index='power_rate', columns='Power ', values='Etching Rate')
processors_pivot.columns = ['Power_'+str(x) for x in processors_pivot.columns.values]
processors_pivot.head()

Unnamed: 0_level_0,Power_160 W,Power_180 W,Power_200 W
power_rate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5.43,6.24,8.79
1,5.71,6.71,9.2
2,6.22,5.98,7.9
3,6.01,5.66,8.15
4,5.59,6.6,7.55


P-value:

In [15]:
st.f_oneway(processors_pivot['Power_160 W'],processors_pivot['Power_180 W'],processors_pivot['Power_200 W'])

F_onewayResult(statistic=36.87895470100505, pvalue=7.506584272358903e-06)

p-value is close to  0, (pvalue=7.506584272358903e-06)

so we reject  H0, meaning there is difference in the mean etching rate (the beam has effect)
