# Lab | Inferential statistics - T-test & P-value

## 1. One tailed t-test - 

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other?

In [24]:
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [25]:
# data = pd.read_csv (r'C:\Users\Elena\Documents\IRONHACK\WEEK7\LABS\lab-t-tests-p-values\files_for_lab\machine.txt',sep=" ",header = None)
# print (data)

In [26]:
new_machine = [42.1,41,41.3,41.8,42.4,42.8,43.2,42.3,41.8,42.7]
old_machine = [42.7,43.6,43.8,43.3,42.5,43.5,43.1,41.7,44,44.1]

In [27]:
data = pd.DataFrame({"new_machine":new_machine,"old_machine":old_machine})
data.head(10)

Unnamed: 0,new_machine,old_machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [28]:
#One-tailed

**Setting our hypotesis:** new machine will pack faster on the average than the machine currently used

* H₀: mu_NEW>mu_OLD  VS  h1: mu_NEW<mu_OLD  -------> H₀: 𝛍≥k vs H₁: 𝛍<k --- k= 43,23 ( old_machine average speed)
                              
It is reasnoble to say that the average speed of new_machine is grater to 43,23 seconds?

In [29]:
st.ttest_ind(data['new_machine'], data['old_machine'], equal_var=False) # if we don't assume equal variance the test will be more robust



Ttest_indResult(statistic=-3.397230706117603, pvalue=0.0032422494663179747)

In [30]:
#for the single tailed experiment
print('p value (single tailed):',st.ttest_ind(data['new_machine'], data['old_machine'], equal_var=False,)[1]/2)

p value (single tailed): 0.0016211247331589873


**INTERPRETATION**

* First let´s check the **p value**(single tailed):is less than my significance level (5%) so I could reject the null H. 

* Let´s check the **statistic**: is -5,04 , it is negative so we can provide evidence to reject our null hypotesis. 

#Negative statistic means \mu_after-\mu_before < 0   ---->  \mu_after < \mu_before

#your statistics must have the same sign as the Hypotesis you are tring to prove.

**We can conclude that the new machine is not faster in average than the old one** 


### 2.Matched Pairs Test - 

In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv). 

Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. 

Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment your result.

In [31]:
data = pd.read_csv (r'C:\Users\Elena\Documents\IRONHACK\WEEK7\LABS\lab-t-tests-p-values\files_for_lab\pokemon.csv')
data.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


**Setting our hypotesis:** defense and attack scores are equal

 *  H₀: mu_D=mu_A  VS  h1: mu_D#mu_A------> H₀: 𝛍₁=𝛍₂ vs H₁: 𝛍₁≠𝛍₂

Data in the two samples is dependent, is the same Pokemon Type. MATCHED PAIR

In [32]:
#H₀: 𝛍₁=𝛍₂ vs H₁: 𝛍₁≠𝛍₂
st.ttest_rel(data['Attack'], data['Defense'])

Ttest_relResult(statistic=4.325566393330478, pvalue=1.7140303479358558e-05)

**INTERPRETATION**

First let´s check the **p value**:is very much less than my significance level (5%) so I could reject the null H.
(it is a double sided test so we don´t need to check the stadistic to set our final conclusion). 

#Negative statistic means \mu_attack-\mu_defense < 0 ----> \mu_atack < \mu_defense

**We can conclude that Attack and Defense are very dependent**

#Negative statistic means \mu_attack-\mu_defense < 0 ----> \mu_atack < \mu_defense

## OPTIONAL PART | Inferential statistics - ANOVA

## Part 1

In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on: - Null hypothesis - Alternate hypothesis - Level of significance - Test statistic - P-value - F table

### Context

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

* State the null hypothesis
* State the alternate hypothesis
* What is the significance level
* What are the degrees of freedom of model, error terms, and total DoF

Data was collected randomly and provided to you in the table as shown:

[link to image -Data](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.05/7.05-lab_data.png)

In [33]:
#check if changing the power of the plasma beam has any effect on the etching rate by the machine(watts)

#there is any difference in the mean etching rate for different levels of power

In [34]:
data_pw = pd.read_excel (r'C:\Users\Elena\Documents\IRONHACK\WEEK7\LABS\lab-t-tests-p-values\files_for_lab\anova_lab_data.xlsx')
data_pw.head(20)

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


**STATE THE NULL/ALTERNATE HYPOTHESIS:**

h1:changing the power of the plasma beam has any effect on the etching rate by the machine

Is the etching affected ( depending on) the different power (W)? Our null hypothesis will set their independecy. 

We are trying to understand if we can reasonable say that the etchin rate  mean of several powers is  the same
 * H0: = the means of each etchin rate are the same ( the different power do not affects to the etching rate).INDEPENDENT VARIABLES.
 
 * H1: = the means of each etchin rate are different ( the different power affects to the etching rate).DEPENDENT VARIABLES.


**STATE SIGFNICANT LEVEL:**
5%

**What are the degrees of freedom of model, error terms, and total DoF**

## Part 2

In this section, use the Python to conduct ANOVA.

What conclusions can you draw from the experiment and why?

In [36]:
mapping = {data_pw.columns[0]: 'power', data_pw.columns[1]: 'etching_rate'}
machine = data_pw.rename(columns=mapping)
machine.head()

Unnamed: 0,power,etching_rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71


In [37]:
machine['power'] = machine['power'].apply(lambda x: x.replace("W", '')).astype({'power':'float64'})

In [41]:
machine

Unnamed: 0,power,etching_rate,power_count
0,160.0,5.43,0
1,180.0,6.24,0
2,200.0,8.79,0
3,160.0,5.71,1
4,180.0,6.71,1
5,200.0,9.2,1
6,160.0,6.22,2
7,180.0,5.98,2
8,200.0,7.9,2
9,160.0,6.01,3


In [42]:
machine['power_count'] = machine.groupby('power').cumcount() ##is the new index 

In [40]:
import numpy as np

In [46]:
machine_pivot = machine.pivot(index='power_count', columns='power', values='etching_rate')
machine_pivot.columns = ['Power_count'+str(x) for x in machine_pivot.columns.values]
machine_pivot.head()

Unnamed: 0_level_0,Power_count160.0,Power_count180.0,Power_count200.0
power_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,5.43,6.24,8.79
1,5.71,6.71,9.2
2,6.22,5.98,7.9
3,6.01,5.66,8.15
4,5.59,6.6,7.55


In [49]:

st.f_oneway(machine_pivot['Power_count160.0'],machine_pivot['Power_count180.0'],machine_pivot['Power_count200.0'])
# f_oneway is the ANOVA (ST call it like that)

F_onewayResult(statistic=36.87895470100505, pvalue=7.506584272358903e-06)

**Result: changing the power of the plasma beam does have an effect on the etching rate by the machine!**