# Solutions
## Thought exercises
1. Explore the JupyterLab interface and look at some of the shortcuts available. Don't worry about memorizing them now (eventually they will become second nature and save you a lot of time), just get comfortable using notebooks.
2. Is all data normally distributed? 
> No. Even data that might appear to be normally distributed could belong to a different distribution. There are tests to check for normality. You can read more [here](https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/).
3. When would it make more sense to use the median instead of the mean for the measure of center? 
> When your data has outliers, it may make more sense to use the median over the mean as your measure of center.

## Coding exercises
If you need a Python refresher, work through the [`python_101.ipynb`](../../lab_01/python_101.ipynb) notebook in lab 1.

### Exercise 4: Generate the data

In [None]:
import random

random.seed(0)
salaries = [round(random.random()*1000000, -3) for _ in range(100)]

### Exercise 5: Calculating statistics and verifying
#### mean

In [None]:
from statistics import mean

sum(salaries) / len(salaries) == mean(salaries)

#### median

First, we define a function to calculate the median:

In [None]:
import math

def find_median(x):
    x.sort()
    midpoint = (len(x) + 1) / 2 - 1 # subtract 1 bc index starts at 0
    if len(x) % 2:
        # x has odd number of values
        return x[int(midpoint)]
    else:
        return (x[math.floor(midpoint)] + x[math.ceil(midpoint)]) / 2

Then, we check its output matches the expected output:

In [None]:
from statistics import median

find_median(salaries) == median(salaries)

#### mode

In [None]:
from statistics import mode
from collections import Counter

Counter(salaries).most_common(1)[0][0] == mode(salaries)

#### sample variance
Remember to use Bessel's correction.

In [None]:
from statistics import variance

sum([(x - sum(salaries) / len(salaries))**2 for x in salaries]) / (len(salaries) - 1) == variance(salaries)

#### sample standard deviation
Remember to use Bessel's correction.

In [None]:
from statistics import stdev
import math

math.sqrt(sum([(x - sum(salaries) / len(salaries))**2 for x in salaries]) / (len(salaries) - 1)) == stdev(salaries)

### Exercise 6: Calculating more statistics
#### range

In [None]:
max(salaries) - min(salaries)

#### coefficient of variation

In [None]:
from statistics import mean, stdev

stdev(salaries) / mean(salaries)

#### interquartile range
First, we define function to calculate a quantile:

In [None]:
import math

def quantile(x, pct):
    x.sort()
    index = (len(x) + 1) * pct - 1
    if len(x) % 2:
        # odd, so grab the value at index
        return x[int(index)]
    else:
        return (x[math.floor(index)] + x[math.ceil(index)]) / 2

Then, we check that it calculates the 1<sup>st</sup> quantile correctly:

In [None]:
sum([x < quantile(salaries, 0.25) for x in salaries]) / len(salaries) == 0.25

and the 3<sup>rd</sup> quantile:

In [None]:
sum([x < quantile(salaries, 0.75) for x in salaries]) / len(salaries) == 0.75

Finally, we can calculate the IQR:

In [None]:
q3, q1 = quantile(salaries, 0.75), quantile(salaries, 0.25)
iqr = q3 - q1
iqr

#### quartile coefficent of dispersion

In [None]:
iqr / (q1 + q3)

### Exercise 7: Scaling data
#### min-max scaling

In [None]:
min_salary, max_salary = min(salaries), max(salaries)
salary_range = max_salary - min_salary

min_max_scaled = [(x - min_salary) / salary_range for x in salaries]
min_max_scaled[:5]

#### standardizing

In [None]:
from statistics import mean, stdev

mean_salary, std_salary = mean(salaries), stdev(salaries)

standardized = [(x - mean_salary) / std_salary for x in salaries]
standardized[:5]

### Exercise 8: Calculating covariance and correlation
#### covariance
We haven't covered NumPy yet, so this is just here to check our solution (0.26) &mdash; there will be rounding errors on our calculation:

In [None]:
import numpy as np
np.cov(min_max_scaled, standardized)

Our method, aside from rounding errors, gives us the same answer as NumPy:

In [None]:
from statistics import mean

running_total = [
    (x - mean(min_max_scaled)) * (y - mean(standardized))
    for x, y in zip(min_max_scaled, standardized)
]

cov = mean(running_total)
cov

#### Pearson correlation coefficient ($\rho$)

In [None]:
from statistics import stdev
cov / (stdev(min_max_scaled) * stdev(standardized))