# Module 1.1: Distributions and Random Processes

## 1.1.3:  Moments

Moments describe distributions. A normal distribution is fully described by the first two moments, which are the mean and the variance. Reviewing the `help` for the `stats.norm` function, these are the only 2 parameters that can input

In [41]:
from scipy import stats
import altair as alt
import random
import numpy as np

In [42]:
#stats.norm?

The mean is referred to as the first location - it specifies where the nomral distribution is centered

In [43]:
def plot_histogram_normal(mean, std_dev, color):
    distribution = stats.norm(mean, std_dev)
    normal_values = pd.DataFrame({'value':distribution.rvs(5000)})
    
    chart = alt.Chart(normal_values).mark_bar().encode(
        alt.X('value', bin=alt.Bin(maxbins=100)),
        y='count()',
        color = alt.value(color))
    return chart

chart_1 = plot_histogram_normal(0,1,'red')
chart_2 = plot_histogram_normal(3,1,'blue')
chart_1 + chart_2

The mean is the expected value of the distribution. Given all other things equal, if chose *n* values randomly from the distribution, the average value (mean) would be euqal to the mean of the distribution.

In [44]:
actual_mean = 57
std_dev = random.random() * 10
N_TRIALS = 100000

distribution = stats.norm(actual_mean, std_dev)
normal_values = distribution.rvs(N_TRIALS)

In [45]:
np.mean(normal_values)

56.99016696403107

In [46]:
error = np.mean(normal_values) - actual_mean
print(f'The actual mean was {actual_mean:.2f}, while the computed mean was {np.mean(normal_values):.2f}')
print(f'This gives an error of {error:.3f}')

The actual mean was 57.00, while the computed mean was 56.99
This gives an error of -0.010


Note that the mean is not the median, altho in a normal distribution, they are usually about the same. But the median is not a 'moment'.

In [47]:
round(np.median(normal_values),2)

56.99

The second moment of a normal distribution is the `variance`, also known as the scale factor of the distribution. It is the expected value of a squared difference between the value and the mean.

In [48]:
V = np.var(normal_values)
V

18.355292972091846

Note that the square of the result makes the unit squared as well. As a result it's not directly comparable to the initial value.

As a result, can not directly comapre this to the original units - cannot say the variance is " about 0.5% of the mean." This statement is meaningless as the units are different. So, for this reason neeed to use the quare root of the variance -  the `standard deviation`, which is in the same units as X.


In [49]:
chart_3 = plot_histogram_normal(0,1,'green')
chart_4 = plot_histogram_normal(6,2,'orange')
chart_3 + chart_4

The larger std deviation makes the distribution more spread out, but it is the same shape, simply 'scaled'

## Further Moments

There are 2 further moments in common use. The third sequentially is called the `skew`. It can be visualized as 'pulling' the distribution to the left (negative skew) or right (positive skew).

A normal distribution is symmetrical, and has a skew of 0. This is why it does not appear in the equation or function calls to genereate the normal distribution.

The fourth standardized moment is the `kurtosis`, more commonly seen in financail data than in many other datasets. A higher value indicates 'fatter tails' than a standard normal distribution. The kurtosis value of a normal distribution is always 3 - this is the baseline when interpreting the kurtosis value of other distributions.

In [50]:
def plot_histogram_normal_skewed(mean, std_dev, skew, color):
    distribution = stats.skewnorm(skew, loc=mean, scale = std_dev)
    normal_values = pd.DataFrame({'value': distribution.rvs(5000)})
    
    chart = alt.Chart(normal_values).mark_bar().encode(
        alt.X('value',bin = alt.Bin(maxbins=100)),
        y='count()',
        color=alt.value(color))
    return chart

In [51]:
plot_histogram_normal_skewed(0,1,-4,'blue')

For seeing the kurtosis in action, should look at some data

In [52]:
import pandas as pd
AAPL_df_2020 = pd.read_csv('/Users/brendan/Desktop/Python/June 2022/AAPL_2020.csv')
AAPL_df_2019 = pd.read_csv('/Users/brendan/Desktop/Python/June 2022/AAPL_2019.csv')
SP500_df_2019 = pd.read_csv('/Users/brendan/Desktop/Python/June 2022/SP500_2019.csv')

In [53]:
AAPL_df_2020.head()

Unnamed: 0,ticker,date,marketcap,ev,evebitda,pe,close
0,AAPL,2020-12-31 00:00:00,2255969.1,2330389.1,28.8,39.3,132.69
1,AAPL,2020-12-30 00:00:00,2273481.0,2347901.0,29.0,39.6,133.72
2,AAPL,2020-12-29 00:00:00,2293033.0,2367453.0,29.2,39.9,134.87
3,AAPL,2020-12-28 00:00:00,2323976.3,2398396.3,29.6,40.5,136.69
4,AAPL,2020-12-24 00:00:00,2243727.8,2318147.8,28.6,39.1,131.97


### Exercises
+ 1. Compute the increase in price each day
+ 2. Plot a histogram of these increases
+ 3. Investigate the `stats.skew` and `stats.kurtosis` functions for the dataset

In [59]:
AAPL_df_2020 = AAPL_df_2020.iloc[::-1]
AAPL_df_2020['% Change'] = AAPL_df_2020['close'].pct_change()
AAPL_df_2020 = AAPL_df_2020[1:]


chart = alt.Chart(AAPL_df_2020,title='AAPL 2020').mark_bar().encode(
    alt.X('% Change', bin=alt.Bin(maxbins=100)),
    y='count()',
    color = alt.value('blue'))
chart.display()

skew = round(stats.skew(AAPL_df_2020['% Change']),3)
kurt = round(stats.kurtosis(AAPL_df_2020['% Change']),3)

print('Skew: ' + str(skew))
print('Kurtosis: ' + str(kurt))

Skew: 0.563
Kurtosis: 4.109


## Z Scores

A `z-score` is a common normalization method used for data. It removes the scale of the data, and instead considers the size of the data in terms of the std deviation. It is a transformation of the data from one scale to the other, using the mean and std deviation

In [61]:
original_data = np.array([10,20,5,105,30,17,19], dtype=np.float32)
m = np.mean(original_data)
s = np.std(original_data)

The transformation to the z-score is to subtract the mean, and divide by the std deviation

In [63]:
zscores = (original_data - m)/s
zscores

array([-0.612737  , -0.29735765, -0.77042663,  2.3833666 ,  0.01802167,
       -0.39197144, -0.3288956 ], dtype=float32)

The value of the z-scores are normalized, allowing to compare data from different scales, ex: for comparing the stock prices of AAPL and MSFT for a period of one month, where direct comps are initially hard.

In [71]:
# extract MSFT from Sp500 file
GOOGL_2019 = SP500_df_2019[SP500_df_2019['ticker']=='GOOGL']
GOOGL_2019 = GOOGL_2019.iloc[::-1]
GOOGL_2019 = GOOGL_2019.sort_values(by=['date'])
GOOGL_2019['% Change'] = GOOGL_2019['close'].pct_change()
GOOGL_2019 = GOOGL_2019[1:]
GOOGL_2019.head()

Unnamed: 0,ticker,date,marketcap,ev,evebitda,pe,close,% Change
44648,GOOGL,2019-01-03 00:00:00,713179.5,703722.5,17.0,38.0,1025.47,-0.027696
122710,GOOGL,2019-01-04 00:00:00,749761.0,740304.0,17.9,39.9,1078.07,0.051294
43696,GOOGL,2019-01-07 00:00:00,748265.8,738808.8,17.8,39.9,1075.92,-0.001994
64381,GOOGL,2019-01-08 00:00:00,754837.9,745380.9,18.0,40.2,1085.37,0.008783
74716,GOOGL,2019-01-09 00:00:00,752250.8,742793.8,17.9,40.1,1081.65,-0.003427


If compare the means, see that GOOGL has a higher adjusted close

In [73]:
AAPL_df_2020['close'].mean(), GOOGL_2019['close'].mean()

(95.15378137651832, 1191.7608764940237)

However, might be more interested to see whether movemetns swing wildly, or are stable with regards to price

In [79]:
alt.Chart(AAPL_df_2020, title='AAPL 2020').mark_bar(opacity=0.4).encode(
    x=alt.X('close', bin=alt.Bin(maxbins=30)),
    y=alt.Y('count()', stack=None)) 

In [80]:
alt.Chart(GOOGL_2019, title='GOOGL 2019').mark_bar(opacity=0.4).encode(
    x=alt.X('close', bin=alt.Bin(maxbins=30)),
    y=alt.Y('count()', stack=None))

To truly compare these distributions, need to convert them to z scores

In [88]:
GOOGL_2019_zscore = GOOGL_2019.pivot(columns='ticker', index='date', values='close')
z_score_GOOGL = (GOOGL_2019_zscore - GOOGL_2019_zscore.mean())/GOOGL_2019_zscore.std()

AAPL_2020_zscore = AAPL_df_2020.pivot(columns='ticker', index='date', values='close')
z_score_AAPL = (AAPL_2020_zscore - AAPL_2020_zscore.mean())/AAPL_2020_zscore.std()
len(z_score_GOOGL), len(z_score_AAPL)

(251, 247)

In [89]:
alt.Chart(z_score_AAPL.melt(value_name='z_score_close')).mark_bar(opacity=0.4).encode(
    x=alt.X('z_score_close', bin=alt.Bin(maxbins=30)),
    y=alt.Y('count()', stack=None))

In [90]:
alt.Chart(z_score_GOOGL.melt(value_name='z_score_close')).mark_bar(opacity=0.4).encode(
    x=alt.X('z_score_close', bin=alt.Bin(maxbins=30)),
    y=alt.Y('count()', stack=None))

We can now compare the distributions, visually and directly against each other. This specific analysis doesn't tell us much, but we can use z-scores to compare distributions of data from different scales, as we saw above.