# Statisitics Exercises

### Exercise 3

Take the code from the Examples section of the scipy stats documentation for independent samples t-tests, add it to your own notebook and add explain how it works using MarkDown cells and code comments. Improve it in any way you think it could be improved.

##### Requireements:

scipy==1.10Python==3.17Numpy==matplotlib== 
Note: 'trim' function below did not work initially.  
Make sure to check for scipy updates.  
Scipy version 1.10 or higher required.

In [1]:
# download required libraries

import numpy as np
import matplotlib.pyplot as plt

##### Summary:  
This test assumes that the populations have identical variances by default i.e. 'equal_var=True'.  
If True (default), perform a standard independent 2 sample test that assumes equal population variances.  
If False, perform Welch’s t-test, which does not assume equal population variance

In [2]:
# import stats module from scipy
# Note: 'trim' function below did not work initially
# make sure to check for scipy updates
# scipy- version 1.10 or higher required

from scipy import stats
# rng = random number generator
rng = np.random.default_rng()

Test with sample with identical means:

In [3]:
rvs1 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
rvs2 = stats.norm.rvs(loc=5, scale=10, size=500, random_state=rng)
stats.ttest_ind(rvs1, rvs2)
# Output varies when re-run- rng?
# which above parameters will reduce the output variability when run and why?

Ttest_indResult(statistic=-1.4281763141801487, pvalue=0.1535539742001945)

#### Comments:  
Output varies when re-run- rng?  

Which above parameters will reduce the output variability when run and why?

##### Added parameter (equal_var=False) to function

In [4]:
# Added parameter (equal_var=False) to function
# Only slight difference in output from above??
stats.ttest_ind(rvs1, rvs2, equal_var=False)
# Output linked to cell output above- 'rng' above generates the n random numbers

Ttest_indResult(statistic=-1.428176314180149, pvalue=0.1535561315214235)

ttest_ind underestimates p for unequal variances:

In [5]:
# create a new set of values
rvs3 = stats.norm.rvs(loc=5, scale=20, size=500, random_state=rng)
stats.ttest_ind(rvs1, rvs3)

Ttest_indResult(statistic=-0.586307140237495, pvalue=0.5578017130663026)

In [6]:
# Added parameter (equal_var=False) to function
# Only slight difference in output from above??
stats.ttest_ind(rvs1, rvs3, equal_var=False)

Ttest_indResult(statistic=-0.586307140237495, pvalue=0.5578516896723974)

When n1 != n2, the equal variance t-statistic is no longer equal to the unequal variance t-statistic:

In [7]:
rvs5 = stats.norm.rvs(loc=8, scale=20, size=100, random_state=rng)
stats.ttest_ind(rvs1, rvs5)
# Ttest_indResult(statistic=-2.8415950600298774, pvalue=0.0046418707568707885)
stats.ttest_ind(rvs1, rvs5, equal_var=False)
# Ttest_indResult(statistic=-1.8686598649188084, pvalue=0.06434714193919686)

Ttest_indResult(statistic=0.42722856797285547, pvalue=0.6700609949260787)

When performing a permutation test, more permutations typically yields more accurate results. Use a np.random.Generator to ensure reproducibility:

In [8]:
# stats.ttest_ind(rvs1, rvs5, permutations=10000,random_state=rng)
# Ttest_indResult(statistic=-2.8415950600298774, pvalue=0.0052994700529947)

Take these two samples, one of which has an extreme tail.

In [9]:
a = (56, 128.6, 12, 123.8, 64.34, 78, 763.3)
b = (1.1, 2.9, 4.2)

Use the trim keyword to perform a trimmed (Yuen) t-test. For example, using 20% trimming, trim=.2, the test will reduce the impact of one (np.floor(trim*len(a))) element from each tail of sample a. It will have no effect on sample b because np.floor(trim*len(b)) is 0.

In [10]:
stats.ttest_ind(a, b)
# Ttest_indResult(statistic=3.4463884028073513,pvalue=0.01369338726499547)

Ttest_indResult(statistic=1.099305186099593, pvalue=0.30361296704535845)

In [11]:
stats.ttest_ind(a, b, trim=.2)
# Ttest_indResult(statistic=3.4463884028073513,pvalue=0.01369338726499547)

Ttest_indResult(statistic=3.4463884028073513, pvalue=0.01369338726499547)

Why trim?: Trimmed means are robust estimators of central tendency. To compute a trimmed mean, we remove a predetermined amount of observations on each side of a distribution, and average the remaining observations. If you think you’re not familiar with trimmed means, you already know one famous member of this family: the median. Indeed, the median is an extreme trimmed mean, in which all observations are removed except one or two.

Using trimmed means confers two advantages:

Trimmed means provide a better estimation of the location of the bulk of the observations than the mean when sampling from asymmetric distributions;
the standard error of the trimmed mean is less affected by outliers and asymmetry than the mean, so that tests using trimmed means can have more power than tests using the mean.
Important point: if we use a trimmed mean in an inferential test (see below), we make inferences about the population trimmed mean, not the population mean. The same is true for the median or any other measure of central tendency. So each robust estimator is a tool to answer a specific question, and this is why different estimators can return different answers.

To trim a datset it is first sorted and then a predetermined percentage at either end of the dataset are removed.

##### References
<br>
Ronald Fisher, a Bad Cup of Tea, and the Birth of Modern Statistics:  <br>
https://www.sciencehistory.org/distillations/ronald-fisher-a-bad-cup-of-tea-and-the-birth-of-modern-statistics <br><br>
Basic Statistics- Trimmed Means <br>https://garstats.wordpress.com/2017/11/28/trimmed-means/

pip install --upgrade scipy
Requirement already satisfied: scipy in c:\users\polma\anaconda3\lib\site-packages (1.6.2)
Collecting scipy
  Downloading scipy-1.10.0-cp38-cp38-win_amd64.whl (42.2 MB)
     ---------------------------------------- 42.2/42.2 MB 4.3 MB/s eta 0:00:00
Requirement already satisfied: numpy<1.27.0,>=1.19.5 in c:\users\polma\anaconda3\lib\site-packages (from scipy) (1.20.1)
Installing collected packages: scipy
  Attempting uninstall: scipy
    Found existing installation: scipy 1.6.2
    Uninstalling scipy-1.6.2:
      Successfully uninstalled scipy-1.6.2
Successfully installed scipy-1.10.0