In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

### One tailed t-test - 
In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. 

The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. 

Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

### H0  = mean speed of new machines >= old machine
### H1 = mean speed of new machines < old machine

In [15]:
sample_machine_data = pd.read_excel('/Users/devirughani/Desktop/IronHack/Week_7/Day_2/Labs/lab-t-tests-p-values/files_for_lab/machine.csv.xlsx')
sample_machine_data 

Unnamed: 0,New Machine,Old Machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [18]:
sample_machine_data['difference'] = sample_machine_data ['New Machine']-sample_machine_data ['Old Machine']
sample_machine_data 

Unnamed: 0,New Machine,Old Machine,difference
0,42.1,42.7,-0.6
1,41.0,43.6,-2.6
2,41.3,43.8,-2.5
3,41.8,43.3,-1.5
4,42.4,42.5,-0.1
5,42.8,43.5,-0.7
6,43.2,43.1,0.1
7,42.3,41.7,0.6
8,41.8,44.0,-2.2
9,42.7,44.1,-1.4


In [22]:
sample_diff_mean, sample_diff_std = sample_machine_data['difference'].mean(), sample_machine_data['difference'].std(ddof=1)
sample_diff_mean, sample_diff_std

(-1.0900000000000012, 1.125907041751967)

In [24]:
t = sample_diff_mean / ( sample_diff_std / np.sqrt(sample_machine_data.shape[0]) )
print("The mean of our samples differences is: {:.2f}".format(sample_diff_mean))
print("The standard deviation of our samples differences is: {:.2f}".format(sample_diff_std))
print("Our t statistic is: {:.2f}".format(t))

The mean of our samples differences is: -1.09
The standard deviation of our samples differences is: 1.13
Our t statistic is: -3.06


In [28]:
tc = st.t.ppf(0.05,df= sample_machine_data.shape[0] - 1)
tc

-1.8331129326536337

In [26]:
st.t.cdf(t,df = sample_machine_data.shape[0] - 1)

0.006770167825816246

T-statistic (-3.06) is < tc (-1.83..) 

CDF/p-value is much smaller than alpha too.

Hence we can reject the null hypothesis that new machines are faster than old machines.

## Matched Pairs Test
- In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv). Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment your result.

### H0  = mean defense score = mean attack score
### H1 = mean defense score != mean attack score

In [34]:
pokemon_data = pd.read_csv('/Users/devirughani/Desktop/IronHack/Week_7/Day_2/Labs/lab-t-tests-p-values/files_for_lab/pokemon.csv')
display(pokemon_data.head())
pokemon_data.shape

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


(800, 13)

In [42]:
pokemon_sample=pokemon_data.sample(30, random_state=1)
pokemon_sample=pokemon_sample[['Attack','Defense']]

In [44]:
pokemon_sample['difference'] = pokemon_sample['Attack']-pokemon_sample['Defense']
pokemon_sample.head()

Unnamed: 0,Attack,Defense,difference
8,104,78,26
510,92,75,17
175,46,34,12
735,50,58,-8
242,105,75,30


In [45]:
sample_diff_mean, sample_diff_std = pokemon_sample['difference'].mean(), pokemon_sample['difference'].std(ddof=1)
sample_diff_mean, sample_diff_std

(2.8, 36.32079294288603)

In [47]:
pokemon_t = sample_diff_mean / ( sample_diff_std / np.sqrt(pokemon_sample.shape[0]) )
print("The mean of our samples differences is: {:.2f}".format(sample_diff_mean))
print("The standard deviation of our samples differences is: {:.2f}".format(sample_diff_std))
print("Our t statistic is: {:.2f}".format(pokemon_t))

The mean of our samples differences is: 2.80
The standard deviation of our samples differences is: 36.32
Our t statistic is: 0.42


In [50]:
tc = st.t.ppf((1-(0.05/2)),df= pokemon_sample.shape[0] - 1)
tc

2.045229642132703

In [52]:
st.t.ppf((0.05/2),df= pokemon_sample.shape[0] - 1)

-2.0452296421327034

In [51]:
st.t.cdf(t,df = pokemon_sample.shape[0] - 1)

0.0023574682352490183

The p value is < 0.05/2 (0.025) and therefore we can reject the null hypothesis and state that there is a difference between attack and defense.

## ANOVA Test
In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on: 
- Null hypothesis 
- Alternate hypothesis 
- Level of significance 
- Test statistic 
- P-value 
- F table

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. 

- State the null hypothesis
- State the alternate hypothesis
-  What is the significance level
- What are the degrees of freedom of model, error terms, and total DoF



### H0  = There is no significant difference between etching rates for different levels of power
### H1 = There is a statistically significant difference between etching rates for different levels of power

In [53]:
anova_test=pd.read_excel('/Users/devirughani/Desktop/IronHack/Week_7/Day_2/Labs/lab-t-tests-p-values/files_for_lab/anova_lab_data.xlsx')
anova_test.head()

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71


In [62]:
anova_test.columns = ['Power', 'Etching Rate']
anova_test.head()

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71


In [57]:
anova_test['Power'].unique()

array(['160 W', '180 W', '200 W'], dtype=object)

In [61]:
group_df = anova_test.groupby('Power')['Etching Rate'].agg(etching_rate_mean='mean',Samples='size').reset_index()
group_df

Unnamed: 0,Power,etching_rate_mean,Samples
0,160 W,5.792,5
1,180 W,6.238,5
2,200 W,8.318,5


SST:

This term computes how much it **deviates each group mean from the global mean** and add the squares of those deviations multiplied by the number of members in the group divided by the number of members minus one.


In [63]:
S2t = 0
for power in anova_test['Power'].unique():
    ng = len(anova_test[anova_test['Power'] == power])  
    S2t  += ( ( anova_test[anova_test['Power'] == power]['Etching Rate'].mean() - anova_test['Etching Rate'].mean() ) ** 2) * ng
S2t /= ( anova_test['Power'].nunique() - 1 )
print("The value of S2t is {:.2f}".format(S2t)) 

The value of S2t is 9.09


SSE:

This other term, computes **how much every single value of every group deviates from the group mean**.

In summary, SST computes the variance of the group means from the global mean, while SSE computes the variance of the values against the global mean.

In [64]:
S2E = 0
for power in anova_test['Power'].unique():
    for rate in anova_test[anova_test['Power'] == power]['Etching Rate']:
        S2E += ( rate - anova_test[anova_test['Power'] == power]['Etching Rate'].mean() ) ** 2
S2E /= ( len(anova_test) - anova_test['Power'].nunique() ) 

print()
print("The value of S2E is {:.2f}".format(S2E))


The value of S2E is 0.25


In [67]:
F = S2t / S2E
print("The value of F is {:.2f}".format(F))

The value of F is 36.88


The shape of the F distribution depends on two sets of degrees of freedom:  $d_{1}=K-1$ and $d_{2}=N-K$ 

In [68]:
d1 = anova_test['Power'].nunique() - 1
d2 = len(anova_test) - anova_test['Power'].nunique()

print("Number of degrees of freedom d1: ",d1)
print("Number of degrees of freedom d2: ",d2)

Number of degrees of freedom d1:  2
Number of degrees of freedom d2:  12


The probability to get any F value lower or equal to our F can be obtained with the CDF:

In [69]:
st.f.cdf(F,dfn=d1, dfd=d2)

0.9999924934157276

Thus, the probability to get any value smaller or equal to F

$$P(x \le F=36.88)= 0.9999$$

The opposite is given by

In [71]:
1 - st.f.cdf(F,dfn=d1, dfd=d2)

7.5065842723986975e-06

Therefore, the probability to get a value bigger than F is:

$$P(x > F) = 1 - P(x \le F) = so so small < 0.05$$

Therefore, we reject the H0

The critical value which corresponds to an area of 0.05 is given by:

In [72]:
Fc = st.f.ppf(1-0.05,dfn=d1, dfd=d2)

print("The critical value which corresponds to an area of 0.05 is: {:.2f}".format(Fc))

The critical value which corresponds to an area of 0.05 is: 3.89


As our F (36.88) is bigger than the critical value, we reject H0 and can say that the mean etching rates for different power are statistically different.

Doing all in one function using scipy:

In [74]:
print(st.f_oneway(anova_test[anova_test['Power'].str.contains("160 W")]['Etching Rate'],anova_test[anova_test['Power'].str.contains("180 W")]['Etching Rate'],
                  anova_test[anova_test['Power'].str.contains("200 W")]['Etching Rate']))

F_onewayResult(statistic=36.87895470100505, pvalue=7.506584272358903e-06)


As the p_value is < 0.05 we reject the H0.