In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [1]:
import pandas as pd
import numpy as np
from scipy import stats
from scipy.stats import ttest_ind

In [2]:
# Load data
data = pd.read_csv('files_for_lab/machine.txt', delimiter='\t', encoding='utf-16')
data

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [3]:
data.columns

Index(['New machine', '    Old machine'], dtype='object')

In [4]:
data.columns = ['new_machine','old_machine']

In [5]:
# Calculate the sample means and sample standard deviations
mean_new = np.mean(data.new_machine)
mean_old = np.mean(data.old_machine)
std_new = np.std(data.new_machine, ddof=1)
std_old = np.std(data.old_machine, ddof=1)

print('mean_new =', round(mean_new,2))
print('mean_old =', round(mean_old,2))
print('std_new =', round(std_new,2))
print('std_old =', round(std_old,2))

mean_new = 42.14
mean_old = 43.23
std_new = 0.68
std_old = 0.75


**Step1**: Define the null hypothesis: The mean packing time of the new machine is equal to or longer than the mean packing time of the old machine.
<br>H0: μ_new >= μ_old

**Step 2:** Define the alternative hypothesis: The mean packing time of the new machine is shorter than the mean packing time of the old machine.
<br>Ha: μ_new < μ_old 

**Step 3:** Determine if it is a one-tailed or a two-tailed test.
<br>In this case: one-tailed test.

**Step 4:** Decide a test statistics based on the information available. Assuming data is normally distributed and the number of observation is less than 30 (n=10), we will use a t-test.

**Step 5:** Level of significance: This defines the rejection region/critical region, it's the probability of making the wrong decision when the null hypothesis is true. Usually it is 0.05.
 
**Step 6:** Calculate the test statistic based on the given information.

**Step 7:** Check the table to determine the critical value.
<br> For t-test you have to calculate according to the degrees of freedom (df), which is the *sample_size - 1*.

**Step 8:** Make conclusions:
* If the test statistic falls in the critical region, then we reject the Null Hypothesis
* If the test statistic falls in the region between the critical region, then we fail to reject the Null Hypothesis.

In [6]:
# Conduct the t-test
t_statistic, p_value = ttest_ind(data.new_machine, data.old_machine)
print("Test statistic:", t_statistic)
print("p-value:", p_value)

Test statistic: -3.3972307061176026
p-value: 0.0032111425007745158


In [7]:
# Manual calculation
ttest = (mean_new-mean_old)/np.sqrt(std_new*std_new/len(data.new_machine)+std_old*std_old/len(data.old_machine))
ttest

-3.397230706117603

In [8]:
# Compute critical region values
alpha = 0.05
df = len(data.new_machine) + len(data.old_machine) - 2
critical_region = stats.t.ppf(1-alpha, df)
critical_region

1.7340636066175354

In [9]:
# Compare the absolute value of t-statistic with the critical region
if abs(t_statistic) > critical_region:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Reject the null hypothesis


An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances. Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file files_for_lab/student_gpa.txt. At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

In [10]:
# Load data
data1 = pd.read_csv('files_for_lab/student_gpa.txt', delimiter='\t', encoding='utf-8')
print(data1)

    Sophomores    Juniors
0         3.04       2.56
1         1.71       2.77
2         3.30       2.70
3         2.88       3.00
4         2.11       2.98
5         2.60       3.47
6         2.92       3.26
7         3.60       3.20
8         2.28       3.19
9         2.82       2.65
10        3.03       3.00
11        3.13       3.39
12        2.86       2.58
13        3.49        NaN
14        3.11        NaN
15        2.13        NaN
16        3.27        NaN


In [11]:
data1.columns

Index(['Sophomores', '  Juniors'], dtype='object')

In [12]:
# Initialize empty lists
sophomores = []
juniors = []

# Convert the columns into lists
sophomores = data1['Sophomores'].tolist()
juniors = data1['  Juniors'].dropna().tolist()

print("Sophomores:", sophomores)
print("Juniors:", juniors)

Sophomores: [3.04, 1.71, 3.3, 2.88, 2.11, 2.6, 2.92, 3.6, 2.28, 2.82, 3.03, 3.13, 2.86, 3.49, 3.11, 2.13, 3.27]
Juniors: [2.56, 2.77, 2.7, 3.0, 2.98, 3.47, 3.26, 3.2, 3.19, 2.65, 3.0, 3.39, 2.58]


In [13]:
# Calculate the sample means and sample standard deviations
ns = len(sophomores)
nj = len(juniors)
mean_s = np.mean(sophomores)
mean_j = np.mean(juniors)
std_s = np.std(sophomores, ddof=1)
std_j = np.std(juniors, ddof=1)

print('n_sophommores =', ns)
print('n_juniours =', nj)
print('mean_sophomores =', round(mean_s,2))
print('mean_juniors =', round(mean_j,2))
print('std_sophomores =', round(std_s,2))
print('std_juniors =', round(std_j,2))

n_sophommores = 17
n_juniours = 13
mean_sophomores = 2.84
mean_juniors = 2.98
std_sophomores = 0.52
std_juniors = 0.31


H0: μ_sophomores = μ_juniors

Ha: μ_sophomores != μ_juniors 

Two-tailed T-test

Alpha = 0.05

In [14]:
# Conduct the t-test
t_statistic, p_value = ttest_ind(sophomores, juniors, equal_var=False)
print("Test statistic:", t_statistic)
print("p-value:", p_value)

Test statistic: -0.9231495630900278
p-value: 0.3642180675348571


In [15]:
# Manual calculation
ttest = (mean_s-mean_j)/np.sqrt((std_s*std_s/ns)+(std_j*std_j/nj))
ttest

-0.9231495630900276

In [16]:
# Compute critical region values
alpha = 0.05
df = ns + nj - 2
critical_region = stats.t.ppf(1-alpha/2, df)
critical_region

2.048407141795244

In [17]:
# Compare the absolute value of t-statistic with the critical region
if abs(t_statistic) > critical_region:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")

Fail to reject the null hypothesis
