In [2]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

# 1. One tailed t-test

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used.

To test that hypothesis, the times it takes each machine to pack ten cartons are recorded.

The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

In [16]:
# Get the data
data = pd.read_csv('machine.txt', sep='\t')
data

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [21]:
data.columns

Index(['New machine', '    Old machine'], dtype='object')

In [24]:
# We here have independent samples, as the measure are made on separate machines.
# H0 will be : the old machine is faster or needs the same time as the new machine : time_old <= time_new
# H1 : the new machine is faster : time_new < time_old
st.ttest_ind(data['New machine'], data['    Old machine'], equal_var=False)

Ttest_indResult(statistic=-3.397230706117603, pvalue=0.0032422494663179747)

In [None]:
# In this example, 0.05 confidence level seems good enough.
# The low p-value pushes us to reject the H0 hypothesis.
# The negative statistic shows that the new machine seems to be statistically faster than the old one.

# 2. Matched Pairs Test

In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv).

Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores.

Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment your result.

In [25]:
# Get the data
pokemon = pd.read_csv('pokemon.csv')
pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [27]:
pokemon.shape

(800, 13)

In [None]:
# There are at the moment 898 Pokemons in total. Source for the chek :
# (https://www.wargamer.com/pokemon-trading-card-game/how-many-pokemon-are-there#:~:text=There%20are%20898%20Pok%C3%A9mon.,Within%20those%2C%2059%20are%20Legendary.)
# So we can consider that our dataset is a sample(a pretty big one, I admit ...)

In [26]:
# Here we want to compare two different measures for each Pokemon, so we are indeed using paired samples.
# H0 will be : both figures are statistically similar
# H1 will be : both figures are not statistically similar
st.ttest_rel(pokemon['Defense'], pokemon['Attack'])

Ttest_relResult(statistic=-4.325566393330478, pvalue=1.7140303479358558e-05)

In [None]:
# In this example, 0.05 confidence level seems good enough.
# The p value is again very low, meaning that both Attakc and Defense are significantly different.
# The negative statistic amount seem to show that Defense is statistically lower than Attack.

# 3. OPTIONAL PART | Inferential statistics - ANOVA

### Part 1

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam.

Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine.

You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

- State the null hypothesis

Null Hypothesis will be that there is no effect, meaning that the ecthing rate should not vary when we apply different levels of power of the plasma.
The mean of etching rate should thus be equal for all the groups we will have.

- State the alternate hypothesis

The alternate hypothesis will be that changing the power of plasma has an effect on the etching rate. There will be at least one group for which the mean etching rate is different

- What is the significance level

The significance level is the probability to reject the null hypothesis when it is actually correct.
Setting at 0.05 for example, means that we are ready to accept 5% of errors in our hypothesis testing.

- What are the degrees of freedom of model, error terms, and total DoF

In [28]:
plasma_data = pd.read_excel('anova_lab_data.xlsx')
plasma_data

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


In [36]:
plasma_data['Power '].value_counts()

160 W    5
180 W    5
200 W    5
Name: Power , dtype: int64

For each of our groups, we will have 5 measures. And for each group, we will only calculate the mean etching rate, which means 1 parameter. -> we have 4 degrees of freedom for each of our groups (5 measures - 1 parameter).

And consequently, our Total DoF is 12 (4 + 4 + 4) (or 15 measures - 3 samples).

In [None]:
An error term indicates the uncertainty in a statistical model. It is a residual variable.
source : https://www.investopedia.com/terms/e/errorterm.asp#:~:text=An%20error%20term%20is%20a,variables%20and%20the%20dependent%20variables.

### Part 2

In [33]:
plasma_data.groupby(plasma_data['Power ']).agg({'Etching Rate':np.mean})

Unnamed: 0_level_0,Etching Rate
Power,Unnamed: 1_level_1
160 W,5.792
180 W,6.238
200 W,8.318


In [38]:
#pivoting the data
plasma_data['power_count'] = plasma_data.groupby('Power ').cumcount()

plasma_data_pivot = plasma_data.pivot(index='power_count', columns='Power ', values='Etching Rate')
plasma_data_pivot.head()

Power,160 W,180 W,200 W
power_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5.43,6.24,8.79
1,5.71,6.71,9.2
2,6.22,5.98,7.9
3,6.01,5.66,8.15
4,5.59,6.6,7.55


In [40]:
st.f_oneway(plasma_data_pivot['160 W'],plasma_data_pivot['180 W'],plasma_data_pivot['200 W'])

F_onewayResult(statistic=36.87895470100505, pvalue=7.506584272358903e-06)

The p value is far below the 0.05 (and very small genrally speaking), which forces us to reject the null Hypothesis.
We can say that the mean Etching Rate of at least one power level is different than for the other two.

Etching Rate thus varies when we change the power of the plasma beam.