# Compare Sample Means (parametric)
- Student’s t-test
- Paired Student’s t-test
- Analysis of Variance Test (ANOVA)
- Repeated Measures ANOVA Test

## t-test
- It compares **mean** of **two groups**
- It is a parametric statistical test.
- It's used to study if there is **statistical difference** between **two groups**

## Types of t-test
- One sample t-test
- Paired t-test(Dependent)
- Unpaired t-test(Independent)

Unpaired t-test also have 2 categories 

- Student's t-test
  - Equal variance
  - Two sample t-test
- Welch t-test
  - Unequal variance
  - Unequal variance t-test

## Selection of t-test
- One sample t-test(for one sample)
- Paired t-test(for dependent samples)
- Student t-test(When sample size and variance are equal)
- Welch t-test(When sample size and variance are different)


In [None]:
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns 
sns.set(font_scale=2, palette= "viridis")
from sklearn.preprocessing import scale
import researchpy as rp
from scipy import stats

In [None]:
data = pd.read_csv('../data/pulse_data.csv')
data.shape

In [None]:
data.head()

## One Sample t-test
It compares the mean of one sample 
- Known(from previous study) mean ($\mu$)
- Hypothetical mean($\mu$)

### Interpretation
__Question: Is the average height different from a established height?__

__Hypothesis__ 
- H0: The average age is $\mu$ = 20
- Ha: The average age is $\mu$ $\neq$ 20

In [None]:
data['Height'].describe() 

In [None]:
stats.skew(data['Height'])

In [None]:
stats.kurtosis(data['Height'])

In [None]:
sns.histplot(data['Height'], kde=True)
plt.show() 

In [None]:
stats.ttest_1samp(data['Height'], 20)

In [None]:
stat, p,  = stats.ttest_1samp(data['Age'], 20)
print(f'stat={stat}, p-value={p}') 
alpha = 0.05 
if p > alpha:
    print('The average age is 20(fail to reject H0, result is not significant)')
else:
    print('Ha: The average age is not 20(reject H0, result is significant)')

## Student's t-test
- The independent t-test is also called the two sample t-test, student’s t-test, or unpaired t-test. 
- It’s an univariate test that tests for a significant difference between the mean of two unrelated groups.
- It compares the mean of two independent samples.

## Assumptions
The assumptions that the data must meet in order for the test results to be valid are:
- The independent variable (IV) is categorical with at least two levels (groups)
- The dependent variable (DV) is continuous which is measured on an interval or ratio scale
- The distribution of the two groups should follow the normal distribution
- The variances between the two groups are equal
    - This can be tested using statistical tests including Levene’s test, F-test, and Bartlett’s test.

If any of these assumptions are violated then another test should be used.


### Interpretation
__Question: Is there a difference in the height between men and women?__

__Hypothesis__
- H0: the means of the samples are equal.
- Ha: the means of the samples are unequal.

__References__

https://pythonfordatascienceorg.wordpress.com/independent-t-test-python/

In [None]:
data['Gender'].unique()

In [None]:
data['Height'].describe()

In [None]:
data.shape

In [None]:
data.groupby('Gender')['Height'].describe()

In [None]:
plt.figure(figsize=(10,8))
sns.boxplot(data=data, x='Height', y="Gender")
plt.show()

In [None]:
# Subsets of data 
sample_01 = data[(data['Gender'] == 'Male')]

sample_02 = data[(data['Gender'] == 'Female')]

In [None]:
sample_01.shape, sample_02.shape

In [None]:
# sample size should be equal 
sample_01 = sample_01.sample(50)
sample_01.shape, sample_02.shape

## The Hypothesis Being Tested

* Null Hypothesis (H0): u1 = u2, which translates to the mean of `sample_01` is equal to the mean of `sample 02`
* Alternative Hypothesis (H1): u1 ? u2, which translates to the means of `sample01` is not equal to `sample 02`

## Homogeneity of variance
Of these tests, the most common assessment for homogeneity of variance is Levene's test. The Levene's test uses an F-test to test the null hypothesis that the variance is equal across groups. A p value less than .05 indicates a violation of the assumption.

https://en.wikipedia.org/wiki/Levene%27s_test

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.levene.html

To know, [Click here](https://en.wikipedia.org/wiki/Levene%27s_test) why we test for levene's test?

## Levene's test 
 Levene's test is an inferential statistic used to assess the equality of variances for a variable calculated for two or more groups
 
 ### Interpretation
 - H0: The variances are equal between two groups 
 - Ha: The variances are not equal between two groups 

In [None]:
stats.levene(sample_01['Height'], sample_02['Height'])

In [1]:
stat, p,  = stats.levene(sample_01['Height'], sample_02['Height'])
print(f'stat={stat}, p-value={p}')
if p > alpha:
    print('The variances are equal between two groups(reject H0, not significant)')
else:
    print('The variances are not equal between two groups(reject H0, significant)')

NameError: name 'stats' is not defined

__If the test were to be significant, a viable alternative would be to conduct a Welch’s t-test__

## Normal Distribution  of Residuals

In [None]:
plt.figure(figsize=(12,8))
sns.histplot(sample_01['Height'], kde=True)
plt.show() 

In [None]:
# Checking for normality by Q-Q plot graph
plt.figure(figsize=(12, 8))
stats.probplot(sample_01['Height'], plot=plt, dist='norm')
plt.show()

__the data should be on the red line. If there are data points that are far off of it, it’s an indication that there are some deviations from normality.__

### Checking normal distribution by `shapiro method`
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html
- https://stats.stackexchange.com/questions/15696/interpretation-of-shapiro-wilk-test

In [None]:
stat, p_value = stats.shapiro(sample_01['Height'])
print(f'statistic = {stat}, p-value = {p_value}')
alpha = 0.05 
if p_value > alpha: 
    print("The sample has normal distribution(Fail to reject the null hypothesis, the result is not significant)")
else: 
    print("The sample does not have a normal distribution(Reject the null hypothesis, the result is significant)")

In [None]:
stat, p_value = stats.shapiro(sample_02['Height'])
print(f'statistic = {stat}, p-value = {p_value}')
alpha = 0.05 
if p_value > alpha: 
    print("The sample has normal distribution(Fail to reject the null hypothesis, the result is not significant)")
else: 
    print("The sample does not have a normal distribution(Reject the null hypothesis, the result is significant)")

__Note:-__[See here](https://stats.stackexchange.com/questions/15696/interpretation-of-shapiro-wilk-test)

W test statistic and the second value is the p-value. Since the test statistic does not produce a significant p-value, the data is indicated to be normally distributed

The data met all the assumptions for the t-test which indicates the results can be trusted and the t-test is an appropriate test to be used.

## Independent t-test by using `scipy.stats`

In [None]:
stats.ttest_ind(sample_01['Height'], sample_02['Height'])

In [None]:
stat, p,  = stats.ttest_ind(sample_01['Height'], sample_02['Height'])
print(f'stat={stat}, p-value={p}')
if p > alpha:
    print('Accept null hypothesis that the means are equal between two groups')
else:
    print('Reject the null hypothesis that the means are not equal between two groups.')

## Independent t-test using `researchpy`

https://researchpy.readthedocs.io/en/latest/ttest_documentation.html

In [None]:
descriptives, results = rp.ttest(sample_01['Height'], sample_02['Height'])

In [None]:
descriptives

In [None]:
results

## Paired t-test
- It compares the mean between two related samples.(each subject is measured twice)

In [None]:
bp_reading = pd.read_csv('../data/blood_pressure.csv')
bp_reading.head() 

In [None]:
bp_reading.shape

In [None]:
bp_reading.describe().T

In [None]:
bp_reading[['bp_before', 'bp_after']].boxplot(figsize=(12, 8))
plt.show() 

## The Hypothesis Being Tested
* Null Hypothesis (H0): u1 = u2, which translates to the mean of sample 01 is equal to the mean of sample 02
* Alternative hypothesis (Ha): u1 ? u2, which translates to the means of sample 01 is not equal to sample 02

## Assumption check 

* The samples are independently and randomly drawn
* The distribution of the residuals between the two groups should follow the normal distribution
* The variances between the two groups are equal

In [None]:
stat, p,  = stats.levene(bp_reading['bp_after'], bp_reading['bp_before'])
print(f'stat={stat}, p-value={p}')
alpha = 0.05 
if p > alpha:
    print('Accept null hypothesis that the variances are equal between two groups')
else:
    print('Reject the null hypothesis that the variances are not equal between two groups.')

In [None]:
bp_reading['bp_diff'] = scale(bp_reading['bp_after'] - bp_reading['bp_before'])
bp_reading.head() 

In [None]:
bp_reading[['bp_diff']].head()

In [None]:
bp_reading[['bp_diff']].hist(figsize=(12, 8))
plt.show() 

In [None]:
plt.figure(figsize=(15, 8))
stats.probplot(bp_reading['bp_diff'], plot=plt)

plt.title('Blood pressure difference Q-Q plot')
plt.show()

**Note:-** The corresponding points are lies very close to line that means are our sample data sets are normally distributed

In [None]:
stat, p_value = stats.shapiro(bp_reading['bp_diff'])
print(f'statistic = {stat}, p-value = {p_value}')
alpha = 0.05 
if p_value > alpha: 
    print("The sample has normal distribution(Fail to reject the null hypothesis, the result is not significant)")
else: 
    print("The sample does not have a normal distribution(Reject the null hypothesis, the result is significant)")

## Using Researchpy
- https://researchpy.readthedocs.io/en/latest/ttest_documentation.html

In [None]:
rp.ttest(bp_reading['bp_after'], bp_reading['bp_before'], 
         paired = True, equal_variances=False)

## Welch's t-test
- It compares the mean of two independent samples.
- It assumes:
  - Samples don't have equal variance
  - Sample size is not equal. 
  
Welch's t-test Assumptions
Like every test, this inferential statistic test has assumptions. The assumptions that the data must meet in order for the test results to be valid are:

- The independent variable (IV) is categorical with at least two levels (groups)
- The dependent variable (DV) is continuous which is measured on an interval or ratio scale
- The distribution of the two groups should follow the normal distribution
If any of these assumptions are violated then another test should be used.

## Interpretation
- **Null hypothesis (H0):** u1 = u2, which translates to the mean of sample 1 is equal to the mean of sample 2
- **Alternative hypothesis (HA):** u1 ≠ u2, which translates to the mean of sample 1 is not equal to the mean of sample 2

In [None]:
us_mortality = pd.read_csv('../data/USRegionalMortality.csv')
us_mortality.head() 

In [None]:
sample_01 = us_mortality[(us_mortality['Cause'] == "Heart disease") & (us_mortality['Sex'] == 'Male')]

sample_02 = us_mortality[(us_mortality['Cause'] == "Heart disease") & (us_mortality['Sex'] == 'Female')]

In [None]:
sample_01.shape, sample_02.shape

In [None]:
stat, p_value =stats.shapiro(sample_01['Rate'])
print(f'statistic = {stat}, p-value = {p_value}')
alpha = 0.05 
if p_value > alpha: 
    print("The sample has normal distribution(Fail to reject the null hypothesis, the result is not significant)")
else: 
    print("The sample does not have a normal distribution(Reject the null hypothesis, the result is significant)")

In [None]:
stat, p_value =stats.shapiro(sample_02['Rate'])
print(f'statistic = {stat}, p-value = {p_value}')
alpha = 0.05 
if p_value > alpha: 
    print("The sample has normal distribution(Fail to reject the null hypothesis, the result is not significant)")
else: 
    print("The sample does not have a normal distribution(Reject the null hypothesis, the result is significant)")

In [None]:
stats.ttest_ind(sample_01['Rate'], sample_02['Rate'], equal_var = False)

In [None]:
stat, p_value =stats.ttest_ind(sample_01['Rate'], sample_02['Rate'], equal_var = False)
print(f'statistic = {stat}, p-value = {p_value}')
alpha = 0.05 
if p_value > alpha: 
    print("The sample means are equal (Fail to reject the null hypothesis, the result is not significant)")
else: 
    print("The sample means are not equal (Reject the null hypothesis, the result is significant)")

## Using Researchpy
- https://researchpy.readthedocs.io/en/latest/ttest_documentation.html

In [None]:
des, res = rp.ttest(sample_01['Rate'], sample_01['Rate'],
                            equal_variances= False)

In [None]:
des

In [None]:
res 