In [1]:
#Importing libraries
import pandas as pd
import numpy as np
import scipy as sp
from scipy import stats
import matplotlib.pyplot as plt

## Data Preprocessing

In [2]:
#Data dictionary 
dat = {'yield': [93.1, 93.6, 91.6, 92.5, 95.1, 94.6, 94.2, 91.9], 'catalyst': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']}

In [3]:
df = pd.DataFrame.from_dict(dat)
cat_A = df[df['catalyst'] =='A']
cat_B = df[df['catalyst'] =='B']

In [4]:
cat_A

Unnamed: 0,yield,catalyst
0,93.1,A
1,93.6,A
2,91.6,A
3,92.5,A


In [5]:
cat_A['yield'].describe() #sample mean X_A = 92.70 sample std deviation sigma_A = 0.86

count     4.000000
mean     92.700000
std       0.860233
min      91.600000
25%      92.275000
50%      92.800000
75%      93.225000
max      93.600000
Name: yield, dtype: float64

In [6]:
cat_B['yield'].describe() #sample mean X_B = 93.95 sample std deviation sigma_B = 1.42

count     4.000000
mean     93.950000
std       1.415392
min      91.900000
25%      93.625000
50%      94.400000
75%      94.725000
max      95.100000
Name: yield, dtype: float64

In [7]:
yield_A = np.asarray(cat_A['yield'].tolist())
X_A = yield_A.mean() #X_A = cat_A['yield'].describe()['mean']
sigma_A = yield_A.std()

yield_B = np.asarray(cat_B['yield'].tolist())
X_B = float(yield_B.mean())
sigma_B = yield_B.std()

##### To be tested: X_B > X_A ? 
**Hypothesis testing**: the null hypothesis H_0: X_B <= X_A and the alternative hypothesis H_1:  X_B > X_A (same for all the following tests)

## TEST 1: One-sample Gauss-test

One-sample test for comparing an **estimated mean** with **known variance** with an **known population mean**.

**Assumptions**: X_A and sigma_A are the true values // Estimation: sigma_B = sigma_A

In [8]:
#sample mean X_A = 92.70 sample std deviation sigma_A = sigma_B = 0.86
#sample mean X_B = 93.95

#Z-Transformation:
def Z_score(x, mu, sigma, n):
    Z = (x - mu)/(sigma/np.sqrt(n))
    return Z

Z = Z_score(X_B, X_A, sigma_A, 4)
Z #test statistic (z-score)

3.355780276070123

In [9]:
result1 = sp.stats.norm.sf(abs(Z)) #p-value
print(result1)

0.00039570709399832055


The probability of a value being larger than X_A equals 0.0004 (0.04 %). If we use a significance level of α = 0.05, we would reject the null hypothesis of our hypothesis test because this p-value is less than 0.05.

## TEST 2: One-sample one-sided t-test

One-sample one-sided t-test comparing **estimated mean and variance** with the **true population mean**.

**Assumptions**: X_A is the true value, X_B is estimated // Estimation: sigma_B = sigma_A // std errors estimated from catalyst B

In [10]:
#sample mean X_A = 92.70 sample std deviation sigma_A = sigma_B = 0.86
#sample mean X_B = 93.95

result2 = sp.stats.ttest_1samp(yield_B, X_A, alternative='greater')
result2.statistic #test statistic

1.7662956527090035

In [11]:
result2.pvalue #p-value

0.08775976170119658

The probability of a value being larger than X_A equals 8.8 %.

The result is not significant as p = 0.088 > 0.05 so the null hypothesis can not be rejected based on the available data.

## TEST 3: Two-sample one-sided t-test

Two-sample one-sided t-test comparing **estimated mean and variance** with an **estimated mean**.

**Assumptions**: X_A and X_B are estimated // Estimation: sigma_B = sigma_A // std errors estimated from catalysts A and B

In [12]:
#sample mean X_A = 92.70 sample std deviation sigma**2 = 0.5*(sigma_B**2 + sigma_A**2)
#sample mean X_B = 93.95

result3 = sp.stats.ttest_ind(yield_B, yield_A, equal_var=True, alternative='greater', trim=0)
result3.statistic

1.5093873935364372

In [13]:
result3.pvalue

0.09096874750914657

The probability of a value being larger than X_A equals 9.1 %.

The p-value is larger than in test 2; the reason for this is that we have another t-distribution as before due to 
different number of degrees of freedoms.