# Solutions
## Thought exercises
1. Explore the JupyterLab interface and look at some of the shortcuts available. Don't worry about memorizing them now (eventually they will become second nature and save you a lot of time), just get comfortable using notebooks.
2. Is all data normally distributed? 
> No. Even data that might appear to be normally distributed could belong to a different distribution. There are tests to check for normality, but this is beyond the scope of this book. You can read more [here](https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/).
3. When would it make more sense to use the median instead of the mean for the measure of center? 
> When your data has outliers, it may make more sense to use the median over the mean as your measure of center.

## Coding exercises
If you need a Python refresher, work through the [`python_101.ipynb`](../../ch_01/python_101.ipynb) notebook in chapter 1.

### Exercise 4: Generate the data

In [1]:
import random

random.seed(0)
salaries = [round(random.random()*1000000, -3) for _ in range(100)]

In [2]:
salaries

[844000.0,
 758000.0,
 421000.0,
 259000.0,
 511000.0,
 405000.0,
 784000.0,
 303000.0,
 477000.0,
 583000.0,
 908000.0,
 505000.0,
 282000.0,
 756000.0,
 618000.0,
 251000.0,
 910000.0,
 983000.0,
 810000.0,
 902000.0,
 310000.0,
 730000.0,
 899000.0,
 684000.0,
 472000.0,
 101000.0,
 434000.0,
 611000.0,
 913000.0,
 967000.0,
 477000.0,
 865000.0,
 260000.0,
 805000.0,
 549000.0,
 14000.0,
 720000.0,
 399000.0,
 825000.0,
 668000.0,
 1000.0,
 494000.0,
 868000.0,
 244000.0,
 325000.0,
 870000.0,
 191000.0,
 568000.0,
 239000.0,
 968000.0,
 803000.0,
 448000.0,
 80000.0,
 320000.0,
 508000.0,
 933000.0,
 109000.0,
 551000.0,
 707000.0,
 547000.0,
 814000.0,
 540000.0,
 964000.0,
 603000.0,
 588000.0,
 445000.0,
 596000.0,
 385000.0,
 576000.0,
 290000.0,
 189000.0,
 187000.0,
 613000.0,
 657000.0,
 477000.0,
 90000.0,
 758000.0,
 877000.0,
 923000.0,
 842000.0,
 898000.0,
 923000.0,
 541000.0,
 391000.0,
 705000.0,
 276000.0,
 812000.0,
 849000.0,
 895000.0,
 590000.0,
 950000.0,
 580

### Exercise 5: Calculating statistics and verifying
#### mean

In [3]:
from statistics import mean

sum(salaries) / len(salaries) == mean(salaries)

True

#### median

First, we define a function to calculate the median:

In [4]:
import math

def find_median(x):
    x.sort()
    midpoint = (len(x) + 1) / 2 - 1 # subtract 1 bc index starts at 0
    if len(x) % 2:
        # x has odd number of values
        return x[int(midpoint)]
    else:
        return (x[math.floor(midpoint)] + x[math.ceil(midpoint)]) / 2

Then, we check its output matches the expected output:

In [5]:
from statistics import median

find_median(salaries) == median(salaries)

True

#### mode

In [6]:
from statistics import mode
from collections import Counter

Counter(salaries).most_common(1)[0][0] == mode(salaries)

True

In [21]:
Counter(salaries).most_common(1)[0][0]

477000.0

In [27]:
salaries.count(477000.0)

3

In [29]:
_salaries = list(set(salaries))
mode = 0
_mode = 0
for v in _salaries:
    if _mode < salaries.count(v):
        mode = v
        _mode = salaries.count(v)

mode, _mode
    

(477000.0, 3)

#### sample variance
Remember to use Bessel's correction.

In [7]:
from statistics import variance

sum([(x - sum(salaries) / len(salaries))**2 for x in salaries]) / (len(salaries) - 1) == variance(salaries)

True

#### sample standard deviation
Remember to use Bessel's correction.

In [8]:
from statistics import stdev
import math

math.sqrt(sum([(x - sum(salaries) / len(salaries))**2 for x in salaries]) / (len(salaries) - 1)) == stdev(salaries)

True

### Exercise 6: Calculating more statistics
#### range

In [31]:
max(salaries) - min(salaries)

995000.0

#### coefficient of variation

In [32]:
from statistics import mean, stdev

stdev(salaries) / mean(salaries)

0.45386998894439035

#### interquartile range
First, we define function to calculate a quantile:

In [34]:
import math

def quantile(x, pct):
    x.sort()
    index = (len(x) + 1) * pct - 1
    if len(x) % 2:
        # odd, so grab the value at index
        return x[int(index)]
    else:
        return (x[math.floor(index)] + x[math.ceil(index)]) / 2

Then, we check that it calculates the 1<sup>st</sup> quantile correctly:

In [35]:
sum([x < quantile(salaries, 0.25) for x in salaries]) / len(salaries) == 0.25

True

In [37]:
[x < quantile(salaries, 0.25) for x in salaries]

[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False,
 False]

and the 3<sup>rd</sup> quantile:

In [12]:
sum([x < quantile(salaries, 0.75) for x in salaries]) / len(salaries) == 0.75

True

Finally, we can calculate the IQR:

In [38]:
q3, q1 = quantile(salaries, 0.75), quantile(salaries, 0.25)
iqr = q3 - q1
iqr

417500.0

In [39]:
q3 + 1.5 * iqr

1445750.0

In [40]:
q1 - 1.5 * iqr

-224250.0

#### quartile coefficent of dispersion

In [41]:
iqr / (q1 + q3)

0.3417928776094965

### Exercise 7: Scaling data
#### min-max scaling

In [42]:
min_salary, max_salary = min(salaries), max(salaries)
salary_range = max_salary - min_salary

min_max_scaled = [(x - min_salary) / salary_range for x in salaries]
min_max_scaled

[0.0,
 0.01306532663316583,
 0.07939698492462312,
 0.0814070351758794,
 0.08944723618090453,
 0.10050251256281408,
 0.10854271356783919,
 0.18693467336683417,
 0.18894472361809045,
 0.19095477386934673,
 0.23919597989949748,
 0.2442211055276382,
 0.25125628140703515,
 0.2592964824120603,
 0.26030150753768844,
 0.27638190954773867,
 0.28241206030150756,
 0.2904522613065327,
 0.3035175879396985,
 0.31055276381909547,
 0.32060301507537686,
 0.3256281407035176,
 0.385929648241206,
 0.39195979899497485,
 0.4,
 0.40603015075376886,
 0.4221105527638191,
 0.43517587939698493,
 0.4462311557788945,
 0.4492462311557789,
 0.45226130653266333,
 0.4733668341708543,
 0.47839195979899496,
 0.47839195979899496,
 0.47839195979899496,
 0.48743718592964824,
 0.49547738693467336,
 0.5065326633165829,
 0.5095477386934674,
 0.5125628140703518,
 0.5417085427135678,
 0.542713567839196,
 0.5487437185929648,
 0.5507537688442211,
 0.5527638190954773,
 0.5698492462311557,
 0.5778894472361809,
 0.5819095477386935,


#### standardizing

In [43]:
from statistics import mean, stdev

mean_salary, std_salary = mean(salaries), stdev(salaries)

standardized = [(x - mean_salary) / std_salary for x in salaries]
standardized

[-2.199512275430514,
 -2.150608309943509,
 -1.9023266390094862,
 -1.8948029520114855,
 -1.8647082040194827,
 -1.8233279255304788,
 -1.7932331775384762,
 -1.4998093846164489,
 -1.4922856976184482,
 -1.4847620106204475,
 -1.304193522668431,
 -1.285384305173429,
 -1.2590514006804265,
 -1.228956652688424,
 -1.2251948091894236,
 -1.165005313205418,
 -1.142434252211416,
 -1.112339504219413,
 -1.0634355387324086,
 -1.037102634239406,
 -0.9994841992494026,
 -0.9806749817544008,
 -0.7549643718143799,
 -0.7323933108203778,
 -0.7022985628283751,
 -0.6797275018343729,
 -0.6195380058503674,
 -0.5706340403633628,
 -0.529253761874359,
 -0.517968231377358,
 -0.5066827008803569,
 -0.4276839874013496,
 -0.40887476990634786,
 -0.40887476990634786,
 -0.40887476990634786,
 -0.37501817841534474,
 -0.34492343042334195,
 -0.3035431519343381,
 -0.2922576214373371,
 -0.28097209094033604,
 -0.17187862946932592,
 -0.16811678597032556,
 -0.14554572497632348,
 -0.1380220379783228,
 -0.1304983509803221,
 -0.06654701

### Exercise 8: Calculating covariance and correlation
#### covariance
We haven't covered NumPy yet, so this is just here to check our solution (0.26) &mdash; there will be rounding errors on our calculation:

In [1]:
import numpy as np
np.cov(min_max_scaled, standardized)

NameError: name 'min_max_scaled' is not defined

Our method, aside from rounding errors, gives us the same answer as NumPy:

In [18]:
from statistics import mean

running_total = [
    (x - mean(min_max_scaled)) * (y - mean(standardized))
    for x, y in zip(min_max_scaled, standardized)
]

cov = mean(running_total)
cov

0.26449129918250414

#### Pearson correlation coefficient ($\rho$)

In [19]:
from statistics import stdev
cov / (stdev(min_max_scaled) * stdev(standardized))

0.9900000000000001

<hr>
<div>
    <a href="../../ch_01/introduction_to_data_analysis.ipynb">
        <button>&#8592; Introduction to Data Analysis</button>
    </a>
    <a href="../../ch_01/python_101.ipynb">
        <button>Python 101</button>
    </a>
    <a href="../../ch_02/1-pandas_data_structures.ipynb">
        <button style="float: right;">Chapter 2 &#8594;</button>
    </a>
</div>
<hr>