In [1]:
import numpy as np
import pandas as pd

import scipy.stats as stats

In [2]:
df = pd.read_csv('../../data/tests/Cholesterol_R.csv')

### Viewing the Dataset

In [3]:
df

Unnamed: 0,ID,Before,After4weeks,After8weeks,Margarine
0,1,6.42,5.83,5.75,B
1,2,6.76,6.2,6.13,A
2,3,6.56,5.83,5.71,B
3,4,4.8,4.27,4.15,A
4,5,8.43,7.71,7.67,B
5,6,7.49,7.12,7.05,A
6,7,8.05,7.25,7.1,B
7,8,5.05,4.63,4.67,A
8,9,5.77,5.31,5.33,B
9,10,3.91,3.7,3.66,A


### Checking the Means and Standard Deviations

In [4]:
print (df['Before'].mean())
print (df['Before'].std())

6.407777777777778
1.1910872717349006


In [5]:
print (df['After4weeks'].mean())
print (df['After4weeks'].std())

5.841666666666668
1.123352388271505


In [6]:
print (df['After8weeks'].mean())
print (df['After8weeks'].std())

5.778888888888889
1.1019121823068931


### Adding Information to the Dataframe

In [7]:
compare = df[['Before', 'After4weeks']].copy()
compare['Mean'] = compare.mean(axis = 1)
compare['Difference'] = df['After4weeks'] - df['Before']
compare

Unnamed: 0,Before,After4weeks,Mean,Difference
0,6.42,5.83,6.125,-0.59
1,6.76,6.2,6.48,-0.56
2,6.56,5.83,6.195,-0.73
3,4.8,4.27,4.535,-0.53
4,8.43,7.71,8.07,-0.72
5,7.49,7.12,7.305,-0.37
6,8.05,7.25,7.65,-0.8
7,5.05,4.63,4.84,-0.42
8,5.77,5.31,5.54,-0.46
9,3.91,3.7,3.805,-0.21


In [8]:
print (compare['Difference'].mean())

-0.5661111111111112


### Check for Normality

From Wikipedia, "The Shapiro–Wilk test tests the null hypothesis that a sample x_1, ..., x_n came from a normally distributed population."

"The null-hypothesis of this test is that the population is normally distributed. Thus, if the p value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. On the other hand, if the p value is greater than the chosen alpha level, then the null hypothesis (that the data came from a normally distributed population) can not be rejected (e.g., for an alpha level of .05, a data set with a p value of less than .05 rejects the null hypothesis that the data are from a normally distributed population)."

In [9]:
stats.shapiro(compare['Difference'])

ShapiroResult(statistic=0.9774226546287537, pvalue=0.9195953607559204)

Because the pvalue is greater than .05, we can assume that the data are normally distributed.

### Conduct the Paired T Test

From [this link](https://www.jmp.com/en_us/statistics-knowledge-portal/t-test/paired-t-test.html)

#### What is the paired t-test?
The paired t-test is a method used to test whether the mean difference between pairs of measurements is zero or not.

#### When can I use the test?
You can use the test when your data values are paired measurements. For example, you might have before-and-after measurements for a group of people. Also, the distribution of differences between the paired measurements should be normally distributed.

Now, we will want to use ttest_rel() from scipy for this case. From the scipy docs,

"Calculate the t-test on TWO RELATED samples of scores, a and b.

This is a two-sided test for the null hypothesis that 2 related or repeated samples have identical average (expected) values."

In [10]:
stats.ttest_rel(compare['After4weeks'],
                compare['Before'])

Ttest_relResult(statistic=-15.438872730914381, pvalue=1.9575345773928476e-11)

Because the pvalue is less than .05, we can conclude that the reduction in cholesterol is statistically significant.

It is also worth mentioning that we should check that the results are practically important as well as statistically significant. We can check this by dividing the mean difference by the mean cholesterol level before the diet.

In [11]:
(compare['Difference'].mean()) / (compare['Before'].mean())

-0.08834749436448762

The average person lost 8.8% of their cholesterol by following this diet.