# Code example: MP574 - Basic Statistics

*Disclamer*: This notebook more so covers running basic statistics in python. In 2023, the MP574 class was heavy on statistics with an end learning goal of learning the fundamentals of machine / deep learning. In 2024, the class evolved to have more of a focus on the deep learning aspects of the classa and it's application. The course will continue to evolve with time. However, running statistical tests on a datasample, or generating a sample of data based off of a known distribution will continue to be prelavent in both the courses and research.

This notebook discusses
* How to generate a distribution of data
* How to perform basic statistical tests on data
* How to convert between score and probability for some distributions # WORK IN PROGRESS

There are multiple different python packages out there that can generate distributions and perform statistical tests. Here, we will focus primarily on two well known packages: `numpy` for data generation, and `scipy.stats` for statistical testing.

A basic `matplotlib.pyplot` will be used for visualization, `math` will be used for accompanied calculations, and `random` will be used for comparisons.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import math as m
from scipy import stats
import random

---

# Generating data distributions with `numpy`

`np`'s module `random` (i.e. `np.random`) contains pre-defined generators for different data samples. Documentation for `np.random` can be found [here](https://numpy.org/doc/1.16/reference/routines.random.html).

Where as the python inherent package `random` generates a single value, `np.random` is extremely efficient at generating a sample of data`.

The most basic and default generator is a uniform distribution.

functions `np.random` and `np.rand` both sample a uniform distribution, similar to `random.random`

In [None]:
random_np_value = np.random.random()
random_rand_value = random.random()
print(f'Random value from numpy: {random_np_value}')
print(f'Random value from random: {random_rand_value}')

If we wanted 10 random values from that distribution, we could do so with `random` and a for loop like so:

In [None]:
sample = []
for _ in range(10):
    sample.append(random.random())
print(f'Random sample of data from a for loop: {sample}')

Or, you could use `np.random.random` with a defined *size* for the sample.

In [None]:
sample = np.random.random(size = 10)

print(f'Random sample of data from numpy: {sample}')

When interested in lots of random numbers or a distribution of data, a package like `numpy` enables efficient generation.

`numpy.random` is home to many different data distribution generators with a full list found [here](https://numpy.org/doc/1.16/reference/routines.random.html#distributions) within the documentation. Below are some key examples of distributions:

---
### Guassian / Normal Distribution

A guassian or normal distribution is defined with respect to the distribution average ($\mu$, mu) and standard deviation ($\sigma$, sigma). Following `numpy.random`, those are defined as the location (`loc`) and scale (`scale`) as follows.

```python
np.random.normal(loc = mu, scale = sigma)
```

Running that by itself will return a single value sampled from that distribution. If you also define the argument `size` you can generate multiple values. Below, we visualize 100,000 samples of a normal distribution from numpy: 

In [None]:
mu = 10
sigma = 5

# Generate a normal distribution of data with location / average = mu & size / standard deviation = sigma
normal_dist = np.random.normal(loc=mu, scale=sigma, size=100000)

# Calculate a histogram of the data sample for the sake of visualization
histogram, bin_edges = np.histogram(normal_dist, bins=100, range = [mu - (sigma * 5), mu + (sigma * 5)], density=True)
bin_centers = [(bin_edges[i] + bin_edges[i + 1]) / 2 for i in range(len(bin_edges) - 1)]

# Calculate a theoretical distribution based off of the same mu % sigma
x = np.asarray(bin_centers)
y = (1 / (sigma * m.sqrt(2 * m.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# Plot the data
fig, ax = plt.subplots()
ax.bar(bin_centers, histogram, label='Sample')
ax.plot(x, y, ls='--', c='r', label='Theoretical')

ax.legend()
ax.set_ylabel('Probability')
ax.set_xlabel('Value')
ax.set_title(f'Normal Distribution Plot (' + r'$\rm\mu$' + f' = {mu}, ' + r'$\rm\sigma$' + f' = {sigma})')

plt.show()

---

# Running statistical tests using scipy

There are many different statistical tests to examine differences between datasets that you will come across in class and through research. `scipy`'s module `stats` contains many different useful functions that can aid in your analysis!

Here we will use a combination of `np.random` to generate distributions of data and `scipy.stats` to test for their significance.

The output of a `scipy.stats` test typically contain atleast two elements, the statistic first and the associated p-value second. Other descriptors may be output as well depending on the t-test given.

`scipy.stats` has a ton of different functions that can be found [here](https://docs.scipy.org/doc/scipy/reference/stats.html).

---
### Paired T-Test

A paired t-test examines differences between means of a distribution where the assumption is that each distribution follows a normal or gaussian distribution.

There are implentations for both independent or related datasets:

```python
independent_test = stats.ttest_ind( ... )
related_test = stats.ttest_rel( ... )
```

Typically, the t-test is performed with "two-tails". Here, the argument `alternative` is used to define the type of test that you are running. Default is `"two-sided"`, however there are options for `"less"` and `"greater"`.

Here, the output is a `TtestResult` that can be treated as a `list` of the relavent elements.

In [None]:
# Two groups of n=100 data distributions randomly generated from gaussian distributions
# using different mu and sigmas.
group_A = np.random.normal(loc = 100, scale = 10, size = 100)
group_B = np.random.normal(loc = 80, scale = 5, size = 100)

# Running the t-test and gathering the direct output
out = stats.ttest_ind(group_A, group_B)
print(f'The output of the stats.ttest_ind is: {out}\n')

# Running the t-test and gathering the specific outputs
test_statistic, p_value = stats.ttest_ind(group_A, group_B)
print(f'The test statistic is: {test_statistic}')
print(f'The p value is: {p_value}')


---
# Converting between test statistics and p-value

A common problem in statistical tests is trying to match test statistic with an associated p-value and vise versa. There are classic look up tables that can be used to get a rough calculation and multiple websites that offer this functionality. Of course, this can also be accomplished in python.

The `scipy.stats` package contains distribution samplers (used above), however it also includes the theoretical distribution and can be used to convert test statistic and p-value.

A guassian / normal distribution is contained by `scipy.stats.norm`. This distribution contains the property `.ppf` (percent point function) as well as `.cdf` (Cumulative distribution function).

The percent point function takes in the lower tail `q`, or p-value, and returns the test statistic associated with it.

The cumulative distribution function does the opposite, taking in the test statistic and returns the associated p-value.

In [None]:
p_value = 0.99
z_score = stats.norm.ppf(q = p_value)

print(f'Considering a Guassian distribution...')
print(f'For given p-value {p_value}, the z_score is: {z_score}')

In [None]:
z_score = 0.86
p_value = stats.norm.cdf(z_score)

print(f'Considering a Guassian distribution...')
print(f'For given z_score {z_score}, the p-value is {p_value}')

These functions work for other distributions as well. When working with specific distributions, make sure to check the documentation to ensure all the required arguments are provided. For instance, the $\rm\chi^2$ distribution requires the degrees of freedom:

In [None]:
p_value = 0.95
df = 3
chi2_value = stats.chi2.ppf(q=p_value, df=df)

print(f'Considering a chi squared distribution with {df} degrees of freedom')
print(f'For given p-value of {p_value}, the chi-squared value is {chi2_value}')

---

# Additional Examples:

## More examples of data distributions

### Binomial or Bernoulli Distribution

A descrete distribution of choises.

Note, a single choice (n=1, it either is or isn't) is a Bernoulli distribution. However, as you increase the number of choices, you get a Binomial.

`np.random.binomial` covers both use cases through defining the hyper parameters `n` for number of choices and `p` for probability of outcome.

```python
np.random.binomial(n = n, p = p)
```

n = 1, Bernoulli distribution.

i.e. a coin flip:

In [None]:
# There is n = 1 choices (Heads or Tail) where each option has 50% chance of occuring
n = 1
p = 0.5

# Calcualte a binomial distribution of 1000 coin flips
binomial_dist = np.random.binomial(n=n, p=p, size=1000)

# Calculate a histogram of the data sample for the sake of visualization
histogram, bin_edges = np.histogram(binomial_dist, bins=n+1, range = [0, n+1], density=True)
bin_centers = [(bin_edges[i] + bin_edges[i + 1]) / 2 for i in range(len(bin_edges) - 1)]

# Calculate a theoretical distribution based off of the same mu % sigma
x = np.asarray(bin_centers)
y = np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# Plot the data
fig, ax = plt.subplots()
ax.bar(bin_centers, histogram, label='Sample')
# ax.plot(x, y, ls='--', c='r', label='Theoretical')

ax.legend()
ax.set_ylabel('Probability')
ax.set_xlabel('Value')
ax.set_title(f'Binomial Distribution Plot (n = {n}, p = {p})')

plt.show()

n = 10, Binomial Distribution with a probability shift

In [None]:
# There is n = 1 choices (Heads or Tail) where each option has 50% chance of occuring
n = 10
p = 0.2

# Calcualte a binomial distribution of 1000 coin flips
binomial_dist = np.random.binomial(n=n, p=p, size=1000)

# Calculate a histogram of the data sample for the sake of visualization
histogram, bin_edges = np.histogram(binomial_dist, bins=n+1, range = [0, n+1], density=True)
bin_centers = [(bin_edges[i] + bin_edges[i + 1]) / 2 for i in range(len(bin_edges) - 1)]

# Calculate a theoretical distribution based off of the same mu % sigma
x = np.asarray(bin_centers)
y = np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# Plot the data
fig, ax = plt.subplots()
ax.bar(bin_centers, histogram, label='Sample')
# ax.plot(x, y, ls='--', c='r', label='Theoretical')

ax.legend()
ax.set_ylabel('Probability')
ax.set_xlabel('Value')
ax.set_title(f'Binomial Distribution Plot (n = {n}, p = {p})')

plt.show()

### Chi-Squared Distribution

A continous distribution based off of the degrees of freedom (`df`) in an experiment.

```python
np.random.chisquare(df = df)
```

In [None]:
df = 3

# Generate a Chi Squared distribution with df degrees of freedom
chi_squared_distribution = np.random.chisquare(df=df, size=100000)

# Calculate a histogram of the data sample for the sake of visualization
histogram, bin_edges = np.histogram(chi_squared_distribution, bins=100, range = [0, 15], density=True)
bin_centers = [(bin_edges[i] + bin_edges[i + 1]) / 2 for i in range(len(bin_edges) - 1)]

# Calculate a theoretical distribution based off of the same mu % sigma
x = np.asarray(bin_centers)
y = (x ** ((df/2) - 1) * np.exp(-x / 2)) / (2 ** (df / 2) * m.gamma(df / 2))
# y = (1 / (sigma * m.sqrt(2 * m.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

# Plot the data
fig, ax = plt.subplots()
ax.bar(bin_centers, histogram, label='Sample')
ax.plot(x, y, ls='--', c='r', label='Theoretical')

ax.legend()
ax.set_ylabel('Probability')
ax.set_xlabel('Value')
ax.set_title(r'$\rm\chi^2$' + f' Distribution Plot (df = {df})')

plt.show()

---

## Additional Statistical tests

### Wilcoxon Sign-Rank Test

The paired T-Test is a staple for determining statistical significance. However, the assumption the dataset follows a normal / gaussian distribution must be met and have enough data samples.

When a low number of data samples is present, a Wilcoxon Sign-Rank test is a strong alternative in determnining differences between sample *distributions*. Here the assumption is you have equal number of data points in both samples.

Note: Wilcoxon tests *distirbution* differences not *mean* differences. 

Example 1: Two completely different distributions but close in proximity

Re-run the cell multiple times and see how often the p value falls below 0.05 (significant).

In [None]:
# Example 1:
# Group A is following a normal distribution
# Group B is following a Chi-Squared Distribution

group_A = np.random.normal(loc = 5, scale = 2, size = 10)
group_B = np.random.chisquare(df = 2, size = 10)

# Running the Wilcoxon Sign-Rank test and gathering the specific outputs
print(f'Output of Wilcoxon Sign-Rank test:')
test_statistic, p_value = stats.wilcoxon(group_A, group_B)
print(f'The test statistic is: {test_statistic}')
print(f'The p value is: {p_value}\n')

Example 2: Two very similar Gaussian distributions

Re-run the cell multiple times and see how often the p value falls below 0.05 (significant). Note how p-values differ between the Wilcoxon sign-rank test and the equivalent paired t-test.

In [None]:
# Example 2:
# Group A is following a normal distribution
# Group B is following a similar normal distribution

group_A = np.random.normal(loc = 5, scale = 2, size = 5)
group_B = np.random.normal(loc = 2, scale = 3, size = 5)

# Running the Wilcoxon Sign-Rank test and gathering the specific outputs
print(f'Output of the Wilcoxon Sign-Rank test:')
test_statistic, p_value = stats.wilcoxon(group_A, group_B)
print(f'The test statistic is: {test_statistic}')
print(f'The p value is: {p_value}\n')


# Running an equivalent paired t-test to see how the p-values differe.
print(f'Output of an equivalent t-test:')
test_statistic, p_value = stats.ttest_ind(group_A, group_B)
print(f'The test statistic is: {test_statistic}')
print(f'The p value is: {p_value}')


---

### Mann-Whitney U-Test

A Wilcoxon Sign-Rank test gets the job done when you are experimenting with a small n cohort. However, the test still assumes you have an equal number of data in both cohorts. In the events where that *doesn't* occur, you can use a Mann-Whitney U-Test.

A Mann-Whiteny U-Test does a very similar thing as the Wilcoxon Sign-Rank test. However, it can handle differences in datasize.

In [None]:
# Group A is following a normal distribution with n = 10
# Group B is following a Chi-Squared Distribution with n = 7

group_A = np.random.normal(loc = 5, scale = 2, size = 10)
group_B = np.random.chisquare(df = 3, size = 7)

# Running the Mann-Whiteny U-Test and gathering the specific outputs
print(f'Output of the Mann-Whiteny U-Test:')
test_statistic, p_value = stats.mannwhitneyu(group_A, group_B)
print(f'The test statistic is: {test_statistic}')
print(f'The p value is: {p_value}\n')

print(f'Output of an equivalent Wilcoxon Sign-Rank test:')
test_statistic, p_value = stats.wilcoxon(group_A, group_B)
print(f'The test statistic is: {test_statistic}')
print(f'The p value is: {p_value}')

Notice, the Mann-Whitney U-test can handle uneven datasets. However, the Wilcoxon signed rank test requires equal samples.