## Module 1.1: Distributions and Random Processes

### 1.1.4 Normality Tests

When analyzing data on the assumption it is normally distributed, should test that assumptions first. Properties of normal distributions do not necessarilly apply to data that has a different underlying distribution. Example: an **`ANOVA test`** assumes normality in the data.


There are a number of normality tests that provide a way to determine if it is likely the data comes from a normal distribution.

One method of testing for normality is to compute the skew and kurtosis of the data. A **standard normal distribution** has a **`skew of 0`** and a **`kurtosis of 3`**

In [1]:
import pandas as pd
AAPL_df_2020 = pd.read_csv('/Users/brendan/Desktop/Python/June 2022/AAPL_2020.csv')
AAPL_df_2020['Gain'] = AAPL_df_2020['close'].diff()
AAPL_df_2020.dropna(inplace=True)

In [2]:
from scipy import stats
stats.skew(AAPL_df_2020['Gain'])

0.23824193868443833

In [3]:
stats.kurtosis(AAPL_df_2020['Gain'])

2.333720024369075

The AAPL stock price increased do not appear to have a normal distribution based on this data. Let's have a look at the histogram again.

In [4]:
import altair as alt

alt.Chart(AAPL_df_2020).mark_bar().encode(
    alt.X("Gain", bin=alt.Bin(maxbins=100)),
    y='count()')

A high Kurtosis is obtained from very high 'sharp' peaks such as this one. The skew is not that high, but is postive, indicating a right-leaning distribution

More objective tests are available in the `scipy.stats` package. For instance, the Shapiro-Wilk test is commonly used and is a good test for small to medium datasets, with up to a few thousand datapoints

In [5]:
statistic, p = stats.shapiro(AAPL_df_2020['Gain'])

In [6]:
p # comaprw the p value to the acceptable aplha value

6.582437436009059e-06

In [7]:
if p > 0.05:
    print("The data looks like it was drawn from a normal distribution")
else:
    print("The data does not look like it was drawn from a normal distribution")

The data does not look like it was drawn from a normal distribution


### What is a p-value?

The p-value above is a commonly used term to describe the probability of your test being true.

As it is a probability, it has a value between 0 and 1. Values near 0 indicate that your test is 'not likely to be true' and values near 1 indicate that your test is 'likely to be true'. Often, we apply a threshhold, and if the p value is greater than that threshhold, accept the outcome as "likely enough, and we continue as if it were true"; that is, we accept the outcome of the test as a 'positive.'

It is very common to use a threshold of 0.05 when performing a test. That is, if the test has > 95% chance of being true, accept it as such. While this is an adequate rule of thumb, it is not a one-size-fits-all solution to the problem of choosing a p value threshold.

Where this is normally seen in classical statistics is with a `Null`, and `Alternative` hypothesus. The null hypothesis is the 'nothing is surprising' hypothesis and the alternative is 'there is something interesting here'. For the Shapiro-Wilk above, the hypothesis are:

+ The **`Null`** hypothesis: the data is drawn from a normal distribution
+ The **`Alternative`** hypothesis: The data was not drawn from a normal distribution

These tests are mutually exclusive tests. If the value for the Null hypothesis is x, then the probability of the alternative being true is 1 - x.

To reject the Null hypothesis, to indicate something else is going on, require the p value to be less than 0.05, for there to be a greater than 95% chance the Alternative hypothesis is true.

This might seem like a high standard to meet, but humans often see patterns in data that are not there. Statistics are used to test these patterns and to ensure that don't be over-confident with pattern matching.

Using a different set of data:

In [8]:
import numpy as np
heights = np.array([
    205.61624376, 155.80577135, 202.09636984, 159.19312848,
    160.0263383 , 147.44200373, 160.96891569, 160.76304892,
    167.59165377, 164.31571823, 151.11269914, 176.43856129,
    176.88435091, 138.04177187, 183.87507305, 162.81488426,
    167.96767641, 144.68437342, 180.88771461, 179.18997091,
    189.81672505, 163.68662119, 175.70135072, 167.32793289,
    163.72509862, 207.93257342, 177.41722601, 167.28154916,
    170.26294662, 187.01142671, 178.3108478 , 168.8711774 ,
    202.77222671, 138.55043572, 187.10284379, 155.13494037,
    175.24219374, 188.54739561, 191.42024196, 174.34537673,
    158.36285104, 183.17014557, 166.36310929, 185.3415384 ,
    163.87673308, 173.70401469, 168.78499868, 167.39762991,
    166.89193943, 191.04035344, 148.02108024, 140.82772936,
    168.85378921, 142.13536543, 189.77084606, 173.7849811 ,
    157.61303804, 171.62493617, 173.30529631, 162.92083214,
    169.52974326, 142.01039665, 176.01691215, 170.32439763,
    172.64616031, 158.35076247, 185.96332979, 176.6176222 ,
    204.68516079, 161.43591954, 172.42384543, 179.36900257,
    170.01353653, 194.40269002, 139.96802012, 156.47281846,
    210.21895193, 153.30508193, 157.10282665, 200.07040619,
    174.69616438, 168.97403285, 188.9396949 , 156.19358617,
    179.56494356, 175.04014032, 164.1384659 , 167.90219562,
    184.80752625, 143.56580744, 169.80537836, 186.5894398 ,
    166.39251657, 165.65510886, 195.49137372, 152.21650272,
    163.14001055, 170.27382512, 147.63901378, 190.32910286])

In [9]:
statistic, p = stats.shapiro(heights)

In [10]:
if p > 0.05:
    print("The data looks like it was drawn from a normal distribution")
    print("p={:.3f}".format(p))
else:
    print("The data does not look like it was drawn from a normal distribution")

The data looks like it was drawn from a normal distribution
p=0.278


### Exercise

Two other commonly used tests for normality are available in `scipy.stats`. They are `stats.normaltest` and `stats.ktest`.

In [12]:
statistic_chi, p_c = stats.normaltest(heights)
statistic_k, p_k = stats.kstest(heights, cdf='norm')
print(str(p_c), str(p_k))

0.6994130645220737 0.0


## Statsmodels

Will now perform a normality test using the **`statsmodels`** package. This package allows for higher-level statistics than the `scipy` module. 

In [13]:
import statsmodels.api as sm

  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,
  from pandas import (to_datetime, Int64Index, DatetimeIndex, Period,


In [14]:
statistic, p_value = sm.stats.diagnostic.kstest_normal(heights)

In [15]:
if p_value > 0.05:
    print("The data looks like it was drawn from a normal distribution")
    print("p={:.3f}".format(p_value))
else:
    print("The data does not look like it was drawn from a normal distribution")

The data looks like it was drawn from a normal distribution
p=0.395


### Exercise

Review the documentation for `statsmodels` and run the Jarque-Bera test for normality on this dat.

In [None]:
from statsmodels.stats import stattools

jbstat, pvalue, skew, kurtosis = stattools.jarque_bera(heights)
print(p_value)