#  ANOVA

In statistics, **Analysis of Variance (ANOVA)** is also used to analyze the differences among group means. The difference between t-test and ANOVA is the former is ued to compare two groups whereas the latter is used to compare three or more groups. [Read more about the difference between t-test and ANOVA](http://b.link/anova24).

From the ANOVA test, you receive two numbers. The first number is called the **F-value** which indicates whether your null-hypothesis can be rejected. The critical F-value that rejects the null-hypothesis varies according to the number of total subjects and the number of subject groups in your experiment. In [this table](http://b.link/eda14) you can find the critical values of the F distribution. **If you are confused by the massive F-distribution table, don't worry. Skip F-value for now and study it at a later time. In this challenge you only need to look at the p-value.**

The p-value is another number yielded by ANOVA which already takes the number of total subjects and the number of experiment groups into consideration. **Typically if your p-value is less than 0.05, you can declare the null-hypothesis is rejected.**

In this challenge, we want to understand whether there are significant differences among various types of pokemons' `Total` value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA. 

In [1]:
# Import libraries

import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load the data:
data = pd.read_csv('pokemon.txt')
data

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


**To achieve our goal, we use three steps:**

1. **Extract the unique values of the pokemon types.**

1. **Select dataframes for each unique pokemon type.**

1. **Conduct ANOVA analysis across the pokemon types.**

#### First let's obtain the unique values of the pokemon types. These values should be extracted from Type 1 and Type 2 aggregated. Assign the unique values to a variable called `unique_types`.

*Hint: the correct number of unique types is 19 including `NaN`. You can disregard `NaN` in next step.*

In [3]:
# Your code here

data['Type 1'].unique()

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

In [4]:
data['Type 2'].unique()

array(['Poison', nan, 'Flying', 'Dragon', 'Ground', 'Fairy', 'Grass',
       'Fighting', 'Psychic', 'Steel', 'Ice', 'Rock', 'Dark', 'Water',
       'Electric', 'Fire', 'Ghost', 'Bug', 'Normal'], dtype=object)

In [5]:
data['Type 2'].isna().sum()

386

In [6]:
def fillnan(row):
    if type(row["Type 2"]) == float:
        return row["Type 1"]
    else:
        return row["Type 2"]

In [8]:
data["Type 2"] = data.apply(fillnan, axis = 1)
data.head(10)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,Fire,309,39,52,43,60,50,65,1,False
5,5,Charmeleon,Fire,Fire,405,58,64,58,80,65,80,1,False
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
7,6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
8,6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
9,7,Squirtle,Water,Water,314,44,48,65,50,64,43,1,False


In [9]:
unique_types= pd.concat([data["Type 1"], data["Type 2"]]).unique()

In [10]:
unique_types

array(['Grass', 'Fire', 'Water', 'Bug', 'Normal', 'Poison', 'Electric',
       'Ground', 'Fairy', 'Fighting', 'Psychic', 'Rock', 'Ghost', 'Ice',
       'Dragon', 'Dark', 'Steel', 'Flying'], dtype=object)

#### Second we will create a list named `pokemon_totals` to contain the `Total` values of each unique type of pokemons.

Why we use a list instead of a dictionary to store the pokemon `Total`? It's because ANOVA only tells us whether there is a significant difference of the group means but does not tell which group(s) are significantly different. Therefore, we don't need know which `Total` belongs to which pokemon type.

*Hints:*

* Loop through `unique_types` and append the selected type's `Total` to `pokemon_groups`. Be sure to loop through BOTH `Type 1` and `Type 2` to cover all occurrances of each unique type.
* Skip the `NaN` value in `unique_types`. `NaN` is a `float` variable which you can find out by using `type()`. The valid pokemon type values are all of the `str` type.
* At the end, the length of your `pokemon_totals` should be 18.

In [12]:
pokemon_totals1 = []
for i in unique_types:
    total_values = data.Total[data["Type 1"] == i].sum()
    pokemon_totals1.append(total_values)
pokemon_totals1

[29480,
 23820,
 48211,
 26146,
 39365,
 11176,
 19510,
 14000,
 7024,
 11244,
 27129,
 19965,
 14066,
 10403,
 17617,
 13818,
 13168,
 1940]

In [13]:
pokemon_totals2 = []
for i in unique_types:
    total_values = data.Total[data["Type 2"] == i].sum()
    pokemon_totals2.append(total_values)
pokemon_totals2

[23541,
 17617,
 30266,
 6105,
 26556,
 19170,
 13898,
 20690,
 15687,
 21571,
 33462,
 9770,
 10326,
 13153,
 14018,
 14050,
 13145,
 45057]

In [14]:
pokemon_totals = pd.DataFrame([pokemon_totals1, pokemon_totals2], columns = unique_types)

In [15]:
pokemon_totals

Unnamed: 0,Grass,Fire,Water,Bug,Normal,Poison,Electric,Ground,Fairy,Fighting,Psychic,Rock,Ghost,Ice,Dragon,Dark,Steel,Flying
0,29480,23820,48211,26146,39365,11176,19510,14000,7024,11244,27129,19965,14066,10403,17617,13818,13168,1940
1,23541,17617,30266,6105,26556,19170,13898,20690,15687,21571,33462,9770,10326,13153,14018,14050,13145,45057


In [16]:
pokemon_totals.shape

(2, 18)

#### Now we run ANOVA test on `pokemon_totals`.

*Hints:*

* To conduct ANOVA, you can use `scipy.stats.f_oneway()`. Here's the [reference](http://b.link/scipy44).

* What if `f_oneway` throws an error because it does not accept `pokemon_totals` as a list? The trick is to add a `*` in front of `pokemon_totals`, e.g. `stats.f_oneway(*pokemon_groups)`. This trick breaks the list and supplies each list item as a parameter for `f_oneway`.

In [18]:
# Your code here

from scipy.stats import f_oneway

In [20]:
#H0 the means of the various groups are the same
#H1 they are not the same

st.stats.f_oneway(pokemon_totals.Grass,pokemon_totals.Fire,pokemon_totals.Water,pokemon_totals.Bug,pokemon_totals.Normal,pokemon_totals.Poison,pokemon_totals.Electric,pokemon_totals.Ground,pokemon_totals.Fairy,pokemon_totals.Fighting,pokemon_totals.Psychic,pokemon_totals.Rock,pokemon_totals.Ghost,pokemon_totals.Ice,pokemon_totals.Dragon,pokemon_totals.Dark,pokemon_totals.Steel,pokemon_totals.Flying)

  st.stats.f_oneway(pokemon_totals.Grass,pokemon_totals.Fire,pokemon_totals.Water,pokemon_totals.Bug,pokemon_totals.Normal,pokemon_totals.Poison,pokemon_totals.Electric,pokemon_totals.Ground,pokemon_totals.Fairy,pokemon_totals.Fighting,pokemon_totals.Psychic,pokemon_totals.Rock,pokemon_totals.Ghost,pokemon_totals.Ice,pokemon_totals.Dragon,pokemon_totals.Dark,pokemon_totals.Steel,pokemon_totals.Flying)


F_onewayResult(statistic=1.391572237351771, pvalue=0.24660701358738799)

#### Interpret the ANOVA test result. Is the difference significant?

In [None]:
# The p_value is higher than 0.05, so we failed to reject the null hypothesis
# that the means of the different pokemons Totals are the same 