##### Paul Mc Grath - Machine Learning & Stats- Winter 2022 Module- Assessment  
---

# Statisitics Exercises

### Exercise 3

Take the code from the Examples section of the scipy stats documentation for independent samples t-tests.  
Add it to your own notebook and add explain how it works using MarkDown cells and code comments  
Improve it in any way you think it could be improved.

##### Requirements:

scipy==1.10  
conda==22.9.0  
Note: 'trim' function below did not work initially.  Fixed with pip update for scipy.  
Make sure to check for scipy updates.  
Scipy version 1.10 or higher required.

In [1]:
# download required libraries
import numpy as np
import matplotlib.pyplot as plt

<br>

#### Scipy.stats.ttest_ind summary:  
The function can be used within python code once the required packages- libraries are installed.  
For more information refer to: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html  
This test assumes that the populations have identical variances by default i.e. 'equal_var=True'.  
If True (default), perform a standard independent 2 sample test that assumes equal population variances.  
If False, perform Welch’s t-test, which does not assume equal population variance

In [2]:
# import stats module from scipy
from scipy import stats
# rng = random number generator
rng = np.random.default_rng()

In [3]:
np.random.normal(3, 2.5, size=(2, 4))

array([[2.27664714, 2.87179562, 5.65745644, 0.73893539],
       [1.25857885, 5.3840175 , 4.58105606, 3.94078243]])

In [4]:
# Generate two datasets using scipy norm random function
# Both datasets created with a mean of 5 and same n population of 500
# Thus by definition they are from the same population and the t-test 
# should return a high p value i.e. null hypothes (that both pops have same mean)
# is not overturned
rvs1 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
rvs2 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)

<br>

#### scipy.stats.ttest_ind  
This is a test for the null hypothesis that 2 independent samples have identical average (expected) values.  
This test assumes that the populations have identical variances by default.

In [5]:
# run the scipy.stats.t-test_ind on the datasets
# ttest_ind = independent
stats.ttest_ind(rvs1, rvs2)

Ttest_indResult(statistic=0.964869172283574, pvalue=0.33484408216393535)

In [6]:
#repeat
stats.ttest_ind(rvs1, rvs2)

Ttest_indResult(statistic=0.964869172283574, pvalue=0.33484408216393535)

##### As expected stats.ttest_ind returns a high p value.  
##### Samples means do not differ sufficiently to overturn the Null hypothesis.

##### Added parameter (equal_var=False) to function

If True (which is scipy.stats default), scipy.stats.ttest performs a standard independent 2 sample test that assumes equal population variances.  
If False, perform Welch’s t-test, which does not assume equal population variance.

Welshs t=test is an adaptation of Student's t-test,and is more reliable when the two samples have unequal variances and possibly unequal sample sizes.  
These tests are often referred to as "unpaired" or "independent samples" t-tests.  
The test assumes that both groups of data are sampled from populations that follow a normal distribution, but it does not assume that those two populations have the same variance.  
The formula to calculate the degrees of freedom for Welch’s t-test takes into account the difference between the two standard deviations.  
If the two samples have the same standard deviations, though, then the degrees of freedom for the Welch’s t-test will be the exact same as the degrees of freedom for the Student’s t-test.  
Thus as the two sample populations should have the same distribution the Welsh test is not expected to change the output.

In [7]:
stats.ttest_ind(rvs1, rvs2, equal_var=False)

Ttest_indResult(statistic=0.964869172283574, pvalue=0.3348442533559467)

##### Result: no impact from assigning equal_var =False <br> <br>

#### t-test for populations with unequal variances

This test assumes that the populations have identical variances by default.

In [8]:
# create a new set of values with same mean to original datasets
# rvs1 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
# rvs2 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
rvs3 = stats.norm.rvs(loc=5, scale=20, size=500, random_state=rng)

In [9]:
#run the independent t-test on the previous and new datasets with differing means
stats.ttest_ind(rvs1, rvs3)

Ttest_indResult(statistic=-0.1007517204960184, pvalue=0.9197677894107497)

Stats.ttest_ind returns a  p value >0.05.  
Samples means do not differ sufficiently to overturn the Null hypothesis.

##### Conclusion: ttest_ind underestimates p for unequal variances: <br><br>


#### Equal Var= False

In [10]:
# Added parameter (equal_var=False) to function
stats.ttest_ind(rvs1, rvs3, equal_var=False)

Ttest_indResult(statistic=-0.1007517204960184, pvalue=0.9197755160600811)

Stats.ttest_ind returns a lower p value.  

##### Conclusion: ttest_ind underestimates p for unequal variances:<br><br>

#### Different population sizes

In [11]:
# create a new random generated dataset with a smaller population
# rvs1 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
rvs4 = stats.norm.rvs(loc=5, scale=20, size=100, random_state=rng)

In [12]:
#execute the t-test on two populations of different n , equal_var as True (default)
stats.ttest_ind(rvs1, rvs4)

Ttest_indResult(statistic=-0.8301693401092002, pvalue=0.40677433898069393)

In [13]:
#execute the t-test  on two populations of different n but with changing the equal_var parameter to = False
stats.ttest_ind(rvs1, rvs4, equal_var= False)

Ttest_indResult(statistic=-0.5162484028739178, pvalue=0.6067443079700066)

##### Conclusion: When n1 != n2, the equal variance t-statistic is no longer equal to the unequal variance t-statistic: <br> <br>


#### Different population sizes, variance and mean

In [14]:
# create a new random generated dataset -different mean, n and variance (based on sample size n)
#rvs1 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
rvs5 = stats.norm.rvs(loc=8, scale=20, size=100, random_state=rng)

In [15]:
stats.ttest_ind(rvs1, rvs5)

Ttest_indResult(statistic=-2.2302152252698484, pvalue=0.026103469749390016)

In [16]:
stats.ttest_ind(rvs1, rvs5, equal_var=False)

Ttest_indResult(statistic=-1.3981705254651016, pvalue=0.16494044645676803)

##### Conclusion: The same effect When n1 != n2, and unequal variance, the equal variance t-statistic is no longer equal to the unequal variance t-statistic:


<br>

#### Permutations  
#####  __re-run the below cells to check for differenc in output__

In [17]:
# original
stats.ttest_ind(rvs1, rvs5, permutations=10,random_state=rng)

Ttest_indResult(statistic=-2.2302152252698484, pvalue=0.18181818181818182)

In [18]:
# repeat
stats.ttest_ind(rvs1, rvs5, permutations=10,random_state=rng)

Ttest_indResult(statistic=-2.2302152252698484, pvalue=0.09090909090909091)

In [19]:
#original
stats.ttest_ind(rvs1, rvs5, permutations=10000,random_state=rng)

Ttest_indResult(statistic=-2.2302152252698484, pvalue=0.027197280271972803)

In [20]:
# repeat
stats.ttest_ind(rvs1, rvs5, permutations=10000,random_state=rng)

Ttest_indResult(statistic=-2.2302152252698484, pvalue=0.025197480251974803)

##### Conclusion: When performing a permutation test, more permutations typically yields more accurate results.

<br>

#### Trim of data where likely outliers

Take these two samples, one of which has an extreme tail (potential outlier at 763.3?).

In [21]:
a = (56, 128.6, 12, 123.8, 64.34, 78, 763.3)
b = (1.1, 2.9, 4.2)

Use the trim keyword to perform a trimmed (Yuen) t-test. For example, using 20% trimming, trim=.2, the test will reduce the impact of one (np.floor(trim*len(a))) element from each tail of sample a. It will have no effect on sample b because np.floor(trim*len(b)) is 0.

In [22]:
# run the t_test without trim (as default)
stats.ttest_ind(a, b)

Ttest_indResult(statistic=1.099305186099593, pvalue=0.30361296704535845)

In [23]:
# run the t_test with trim (20%)
stats.ttest_ind(a, b, trim=.2)

Ttest_indResult(statistic=3.4463884028073513, pvalue=0.01369338726499547)

In [24]:
# run the t_test with trim (30%)
stats.ttest_ind(a, b, trim=.3)

Ttest_indResult(statistic=2.832256715395378, pvalue=0.04723681941400341)

In [25]:
# run the t_test with trim (10%)
stats.ttest_ind(a, b, trim=.1)

Ttest_indResult(statistic=1.099305186099593, pvalue=0.30361296704535845)

As can be seen the trim function is to we used wisely as can lead to false positives if significant data is removed, potential for false negatives if outliers are not omitted.  
Trimming is recommended if the underlying distribution is long-tailed or contaminated with outliers.  
This requires judgement by the data scientist.  
Where there are potential outliers other statisitical tools such as simple sorting and visual check i.e. boxplots, histograms.Converting the normal distribution to a standard normal distribution.


To trim a datset it is first sorted and then a predetermined percentage at either end of the dataset are removed.  <br>

Why trim?: Trimmed means are robust estimators of central tendency.  
To compute a trimmed mean, we remove a predetermined amount of observations on each side of a distribution, and average the remaining observations  
If you think you’re not familiar with trimmed means, you already know one famous member of this family: the median  
Indeed, the median is an extreme trimmed mean, in which all observations are removed except one or two.

Using trimmed means confers two advantages:

- Trimmed means provide a better estimation of the location of the bulk of the observations than the mean when sampling from asymmetric distributions; 
- The standard error of the trimmed mean is less affected by outliers and asymmetry than the mean, so that tests using trimmed means can have more power than tests using the mean.  

Note: if we use a trimmed mean in an inferential test, we make inferences about the population trimmed mean, not the population mean  
The same is true for the median or any other measure of central tendency.


## How could it be improved

As can be seen above more permutations typically yields more accurate results. Use of a random number generator to ensure reproducibility.  
Refer to: Numpy documentation as below

##### References

##### Ronald Fisher, a Bad Cup of Tea, and the Birth of Modern Statistics:
https://www.sciencehistory.org/distillations/ronald-fisher-a-bad-cup-of-tea-and-the-birth-of-modern-statistics 
##### Basic Statistics- Trimmed Means  
https://garstats.wordpress.com/2017/11/28/trimmed-means/ 
https://www.scribbr.com/statistics/outliers/  
##### Random Number Generator 
https://numpy.org/doc/stable/reference/random/generator.html

##### End  
---