# Lab | Inferential statistics - T-test & P-value

## Part 1.1: One tailed t-test 

#### T-tests and P-values
In statistics, t-test is used to test if two data samples have a significant difference between their means. There are two types of t-test:

* **Student's t-test** (a.k.a. independent or uncorrelated t-test). This type of t-test is to compare the samples of **two independent populations** (e.g. test scores of students in two different classes). `scipy` provides the [`ttest_ind`](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.ttest_ind.html) method to conduct student's t-test.

* **Paired t-test** (a.k.a. dependent or correlated t-test). This type of t-test is to compare the samples of **the same population** (e.g. scores of different tests of students in the same class). `scipy` provides the [`ttest_rel`](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.ttest_rel.html) method to conduct paired t-test.

Both types of t-tests return a number which is called the **p-value**. If p-value is below 0.05, we can confidently declare the null-hypothesis is rejected and the difference is significant. If p-value is between 0.05 and 0.1, we may also declare the null-hypothesis is rejected but we are not highly confident. If p-value is above 0.1 we do not reject the null-hypothesis.

Read more about the t-test in [this article](http://b.link/test50) and [this Quora](http://b.link/unpaired97). Make sure you understand when to use which type of t-test. 

Pair Vs Unpaired T-Test

    - Pair t-test:
        - determine if there is a significant difference between the means of two DEPENDENT variables/groups.
        - There is a relationship between variables/groups
        - Does NOT assume equal variance between groups
        
    - Unpaired t-test:
        - determine if there is a significant difference between the means of two INDEPENDENT variables/groups.
        - There is NO a relationship between variables/groups
        - Assumes equal variance between groups

In [114]:
# Import libraries
import pandas as pd
import numpy as np
import scipy.stats as st
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt
import seaborn as sns


### One tailed t-test 
In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, **the times it takes each machine to pack ten cartons** are recorded. The results, in **seconds**, are shown in the tables in the file files_for_lab/ttest_machine.xlsx.

Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

##### Import dataset

In [84]:
# Load .txt file and using whitespace as the delimiter
machine = pd.read_csv('ttest_machine.txt', delimiter='\s+')

display(machine, machine.shape)

Unnamed: 0,New_machine,Old_machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


(10, 2)

In [85]:
# Two unrelated variables/groups

##### Confidence interval

In [86]:
# Converting each column into array
new_machine = machine['New_machine'].values
old_machine = machine['Old_machine'].values
display(new_machine,old_machine)

array([42.1, 41. , 41.3, 41.8, 42.4, 42.8, 43.2, 42.3, 41.8, 42.7])

array([42.7, 43.6, 43.8, 43.3, 42.5, 43.5, 43.1, 41.7, 44. , 44.1])

In [87]:
# Calculating confidence interval for new machine
confidence_level = 0.95
degrees_freedom = len(new_machine) - 1
sample_mean = np.mean(new_machine)

sample_standard_error = st.sem(new_machine) 
confidence_interval = st.t.interval(confidence_level,degrees_freedom,
                                             sample_mean, sample_standard_error)

print( 'Confidence interval for NEW MACHINE is ', confidence_interval, '.' )

Confidence interval for NEW MACHINE is  (41.65108555006849, 42.628914449931514) .


In [88]:
# Calculating confidence interval for new machine
confidence_level = 0.95
degrees_freedom = len(old_machine) - 1
sample_mean = np.mean(old_machine)

sample_standard_error = st.sem(old_machine) 
confidence_interval = st.t.interval(confidence_level,degrees_freedom,
                                             sample_mean, sample_standard_error)

print( 'Confidence interval for OLD MACHINE is ', confidence_interval, '.' )

Confidence interval for OLD MACHINE is  (42.69356181052482, 43.76643818947519) .


In [89]:
# Larger samples, larger confidence interval

##### Hypothesis Test

EXAMPLE

- Someone thinks (hypothesises) that the mean of cholesterol values in the population is 5.6 (evidence for or against)

- We select a value for alpha of 0.05 (p-value threshold, significance level)


    - Two-sided test:
        - Null hypothesis or H0: mean cholesterol value = 5.6  (much easier to get evidence against)
        - Alternative hyp or H1: mean cholesterol value <> 5.6 (or !=)

    - One-sided test:
        - Null hypothesis or H0: mean cholesterol value >= 5.6
        - Alternative hyp or H1: mean cholesterol value < 5.6

**Hypothesis**: "A new machine will pack faster on the average than the machine currently used."

**One-sided test** --> one faster than the other


**H0 or null hypothesis** = new machine is faster than the old one


**H1 or alternative hypotesis** = new machine <= faster than the old one



In [90]:
# Mean time in second
mean_new_machine = new_machine.mean()
mean_old_machine = old_machine.mean()
print('Mean seconds for NEW MACINHE: ',mean_new_machine, '\nMean seconds for OLD MACINHE: ',mean_old_machine )

Mean seconds for NEW MACINHE:  42.14 
Mean seconds for OLD MACINHE:  43.230000000000004


##### T-test

In [91]:
#from scipy.stats import ttest_1samp

# if we don't assume equal variance the test will be more robust
st.ttest_ind(new_machine, old_machine, equal_var=False) 
#stat, pval = ttest_1samp(new_machine, 43.23) # or machine?
print('stat is  ', stat)
#print('pvalue for the two-tailed test is ', pval)
print('pvalue for the one-tailed test is ', pval/2)

stat is   -5.043318535038297
pvalue for the one-tailed test is  0.0003483188038379669


- pval one-tailed (half) / take into account stat

     - if > 0.05 not enough evidence to reject H0 (Null) --> alternative
     - **if < 0.05 enough evidence to support H0 (Null) --> null**

In [92]:
43.23-new_machine.mean()

1.0899999999999963

In [93]:
st.sem(new_machine)*stat

-1.0899999999999963

**Assuming that there is sufficient evidence to conduct the t test, the data does provide sufficient evidence to show that new machine is faster than the old one.**

## Part 1.2: Matched Pairs Test

#### Import dataset

In this challenge we will work on the Pokemon dataset you have already used. The goal is to test whether different groups of pokemon (e.g. Legendary vs Normal, Generation 1 vs 2, single-type vs dual-type) have different stats (e.g. HP, Attack, Defense, etc.). Use pokemon.csv

In [94]:
# Your code here:
pokemon = pd.read_csv('pokemon.txt')
pokemon = pokemon.reset_index(drop=True)
pokemon = pokemon.drop(['#'], axis=1)
display(pokemon.head(),pokemon.shape)

Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


(800, 12)

**Our goal** is to see whether there is a significant difference between each Pokemon's defense and attack scores. **Our hypothesis is that the defense and attack scores are equal**. Compare the two columns to see if there is a statistically significant difference between them and comment your result.

#### First we want to define a function with which we can test the means of a feature set of two samples. 

In the next cell you'll see the annotations of the Python function that explains what this function does and its arguments and returned value. This type of annotation is called **docstring** which is a convention used among Python developers. The docstring convention allows developers to write consistent tech documentations for their codes so that others can read. It also allows some websites to automatically parse the docstrings and display user-friendly documentations.

Follow the specifications of the docstring and complete the function.

In [95]:
from scipy.stats import ttest_ind

def t_test_features(s1, s2, features=['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', 'Total']):
    """
    Test the means of a feature set of two samples using a two-sample t-test
    
    Args:
        s1 (dataframe): sample 1
        s2 (dataframe): sample 2
        features (list): an array of features to test
    
    Returns:
        dict: a dictionary of t-test scores for each feature where the feature name is the key and the p-value is the value
    """
    
    # Calculate the t-test p-value for each feature in the feature list
    results = {}
    for feature in features:
        t, p = ttest_ind(s1[feature], s2[feature], equal_var=False) #st.ttest_ind?
        results[feature] = p
    
    # Return the dictionary of t-test p-values
    return results


#### Using the `t_test_features` function, conduct t-test for Lengendary vs non-Legendary pokemons.

*Hint: your output should look like below:*

```
{'HP': 1.0026911708035284e-13,
 'Attack': 2.520372449236646e-16,
 'Defense': 4.8269984949193316e-11,
 'Sp. Atk': 1.5514614112239812e-21,
 'Sp. Def': 2.2949327864052826e-15,
 'Speed': 1.049016311882451e-18,
 'Total': 9.357954335957446e-47}
 ```

In [96]:
# Two samples based on 'Legendary' status
s1 = pokemon[pokemon['Legendary']]
s2 = pokemon[~pokemon['Legendary']]

# Call the t_test_features function
results = t_test_features(s1, s2)

results

{'HP': 1.0026911708035284e-13,
 'Attack': 2.520372449236646e-16,
 'Defense': 4.8269984949193316e-11,
 'Sp. Atk': 1.5514614112239812e-21,
 'Sp. Def': 2.2949327864052826e-15,
 'Speed': 1.049016311882451e-18,
 'Total': 9.357954335957446e-47}

#### From the test results above, what conclusion can you make? Do Legendary and non-Legendary pokemons have significantly different stats on each feature?

- H0: Our hypothesis is that the defense and attack scores are equal.
- H1: Our hypothesis is that the defense and attack scores are NOT equal.



- Legendary Vs Non-Legendary
            - The p-values for all the features are smaller than 0.05, so there are indeed significant differences between Legendary and non-Legendary Pokemon for all of the features --> keep H1

#### Next, conduct t-test for Generation 1 and Generation 2 pokemons.

In [97]:
pokemon['Generation'].value_counts()

1    166
5    165
3    160
4    121
2    106
6     82
Name: Generation, dtype: int64

In [98]:
# Split the pokemon dataframe into two samples based on 'Generation' status = 1
#s1 = pokemon[(pokemon['Generation']==1)]
#s2 = pokemon[~(pokemon['Generation']==1)]

# or s2 == 2?

# Call the t_test_features function on the two samples
#results = t_test_features(s1, s2)

#results

In [99]:
# Split the pokemon dataframe into two samples based on 'Generation' status = 2
#s1 = pokemon[(pokemon['Generation']==2)]
#s2 = pokemon[~(pokemon['Generation']==2)]

# Call the t_test_features function on the two samples
#results = t_test_features(s1, s2)

#results

In [113]:
# Two samples based on 'Generation' status = 1 / 2
s1 = pokemon[(pokemon['Generation']==1)]
s2 = pokemon[(pokemon['Generation']==2)]

# or s2 == 2?

# Call the t_test_features function
results = t_test_features(s1, s2)

results

{'HP': 0.14551697834219623,
 'Attack': 0.24721958967217725,
 'Defense': 0.5677711011725426,
 'Sp. Atk': 0.12332165977104388,
 'Sp. Def': 0.18829872292645752,
 'Speed': 0.00239265937312135,
 'Total': 0.5631377907941676}

#### What conclusions can you make?

- H0: Our hypothesis is that the defense and attack scores are equal.
- H1: Our hypothesis is that the defense and attack scores are NOT equal.



- Generation 1 Vs Generation 2:

        - All of the p-values are greater than 0.05 except for 'Speed', so there is NO significant difference between most of the features of the two samples --> keep H0

#### Compare pokemons who have single type vs those having two types.

In [101]:
pokemon['Type 1'].value_counts(dropna=False)

Water       112
Normal       98
Grass        70
Bug          69
Psychic      57
Fire         52
Electric     44
Rock         44
Dragon       32
Ground       32
Ghost        32
Dark         31
Poison       28
Steel        27
Fighting     27
Ice          24
Fairy        17
Flying        4
Name: Type 1, dtype: int64

In [102]:
pokemon['Type 2'].value_counts(dropna=False)

NaN         386
Flying       97
Ground       35
Poison       34
Psychic      33
Fighting     26
Grass        25
Fairy        23
Steel        22
Dark         20
Dragon       18
Ice          14
Rock         14
Water        14
Ghost        14
Fire         12
Electric      6
Normal        4
Bug           3
Name: Type 2, dtype: int64

In [103]:
# Type 1 and type 2

# All Pokemon that have values for both Type 1 and Type 2
s1 = pokemon.loc[pokemon['Type 2'].notna()]

# All Pokemon that have only Type 1 (there is no nan in Type1)
s2 = pokemon.loc[pokemon['Type 2'].isna()]

# Call the t_test_features function
results = t_test_features(s1, s2)

results

{'HP': 0.11314389855379414,
 'Attack': 0.00014932578145948305,
 'Defense': 2.7978540411514693e-08,
 'Sp. Atk': 0.00013876216585667907,
 'Sp. Def': 0.00010730610934512779,
 'Speed': 0.02421703281819093,
 'Total': 1.1157056505229961e-07}

#### What conclusions can you make?

- H0: Our hypothesis is that the defense and attack scores are equal.
- H1: Our hypothesis is that the defense and attack scores are NOT equal.



- Single Type Pokemon Vs 2-Type Pokemon

        - There are significant differences (p-val < 0.05) for 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed', and 'Total' features of the two samples --> Keep H1

        - However, the p-val for 'HP' is greater than 0.05, which suggests that there may not be a significant difference in that case.

#### Now, we want to compare whether there are significant differences of `Attack` vs `Defense`  and  `Sp. Atk` vs `Sp. Def` of all pokemons. Please write your code below.

*Hint: are you comparing different populations or the same population?*

- Matched pairs --> st.test_rel( )

In [110]:
# Attack Vs Defense
st.ttest_rel(pokemon['Attack'], pokemon['Defense'])

Ttest_relResult(statistic=4.325566393330478, pvalue=1.7140303479358558e-05)

In [112]:
# Sp.Atk Vs Sp.Def
st.ttest_rel(pokemon['Sp. Atk'], pokemon['Sp. Def'])

Ttest_relResult(statistic=0.853986188453353, pvalue=0.3933685997548122)

#### What conclusions can you make?

- H0: Our hypothesis is that the defense and attack scores are equal.
- H1: Our hypothesis is that the defense and attack scores are NOT equal.



- Attack Vs Defense

        - There is a significant difference between the mean values of the 'Attack' and 'Defense' as p-value < 0.05, which indicates strong evidence against the null hypothesis that the mean 'Attack' and 'Defense' values are equal --> keep H0
        
- Sp. Atk Vs Sp. Def

        - P-value of 0.39 is relatively large, indicating there is evidence against the null hypothesis that the mean 'Sp. Atk' and 'Sp. Def' values are equal --> keep H1

    