# Lab | Inferential statistics - T-test & P-value

Instructions

1.We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

 In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

2.An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances. Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file files_for_lab/student_gpa.txt. At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

Test statistics can be calculated as: <div>
    <img src="unpooled_variances.png" width="200"/>
</div>

Degrees of freedom is (n1-1)+(n2-1).

## Import libraries

In [126]:
import pandas as pd
import numpy as np
import statistics
import math
from scipy.stats import t
from scipy.stats import ttest_ind

In [127]:
df = pd.read_csv('files_for_lab/machine-new.csv')
df.columns = ['new_machine', 'old_machine']
df

Unnamed: 0,new_machine,old_machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


**Null hypotesis**: The speed of old and new machines are equal <br>
**Alternative hypotesis** : a new machine will pack faster on the average than the old machine <br>
**Significance level**: 0.05(%) <br>
**As we want to compare the mean of two group,we will use T-test** <br>
**One-tailed distribution** - We set condition in alternative hypotesis,that the new machine is faster than the older one <br>
**Sample sizes** : n1=10,n2=10

In [128]:
old_mean = df['old_machine'].mean()
old_mean

43.230000000000004

In [129]:
new_mean = df['new_machine'].mean()
new_mean

42.14

In [130]:
old_std = df['old_machine'].std()
old_std

0.7498888806572157

In [131]:
new_std = df['new_machine'].std()
new_std

0.6834552736727638

### Summary statistics about old machine

In [132]:
old_mean = 43.23 # The average speed of an old machine
old_std = 0.74
n1 = 10 # Sample size of an old machine

### Summary statistics about new machine

In [133]:
new_mean = 42.14
new_std = 0.68
n2 = 10

### Finding T-statistic

In [138]:
pooled_sample_std = math.sqrt(((n1-1)*old_std**2 + (n2-1)*new_std**2)/(n1+n2-2))
statistic = (new_mean-old_mean)/(pooled_sample_std*math.sqrt((1/n1)+(1/n2)))
print("T Statistic is: ", statistic)

T Statistic is:  -3.4297764266251507


### Finding T-value

According to significance level and degree of freedom we can find t-value from the t-test table

### Degree of freedom

In [141]:
degree_of_freedom = (10-1)+(10-1)  # According to degree of freedom and significance level we can find t-value from t-test table

The t value with α = 0.05 and 21 degrees of freedom is 1.734

**As T-statistic(3.42) is greater than t-value(1.734),we reject the Null Hypotesis,that the average speed of two maschines are same,hence we can accept alternative hypotesis,that the new machine is faster than the older one**

## Another method

In [140]:
from scipy.stats import t

print("P value is: ", 1- t.cdf(statistic,n1+n2-2))
print("Critical Value of z is: ", t.ppf(0.025, n1+n2-2)) #alpha is 0.05

P value is:  0.9985061251572739
Critical Value of z is:  -2.10092204024096


**We will accept the Null hypothesis, because the critical value is higher than the statistics value and the P-value is more than the significance value**

# Problem 2

An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances. Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file files_for_lab/student_gpa.txt. At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

Test statistics can be calculated as: link to the image - Test statistics calculation for Unpooled Variance Case

Degrees of freedom is (n1-1)+(n2-1).

In [92]:
df1 = pd.read_csv('files_for_lab/student_gpa.csv')

In [93]:
df1

Unnamed: 0,Sophomores,Juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98
5,2.6,3.47
6,2.92,3.26
7,3.6,3.2
8,2.28,3.19
9,2.82,2.65


**Null Hypotesis**: mean GPAs of sophomores and juniors at the university are equal <br>
**Alternative Hypotesis**:mean GPAs of sophomores and juniors at the university are different

In [103]:
import scipy
scipy.stats.ttest_ind(df1['Sophomores'], df1['Juniors'], axis=0, equal_var=False, nan_policy='omit',
                                                                            alternative='two-sided')

Ttest_indResult(statistic=-0.9231495630900278, pvalue=0.3642180675348565)

**As p-value is greater than significance level(0.005) we accept the Null Hypotesis**