# Chapter 3 - Descriptive and Inferential Statistics

Statistics is the heart of many data-driven innovations. Machine learning in itself is a statistical tool, searching for possible hypotheses
to correlate relationships between different variables in data.

## What Is Data?

The source of not just truth…but intelligence! It’s the fuel for artificial intelligence
and it is believed that the more data you have, the more truth you have. Therefore,
you can never have enough data. Data is not important in itself. It’s the
analysis of data (and how it is produced) that is the driver of all these innovations and
solutions.

Always ask questions about how the data was obtained, and then scrutinize
how that process could have biased the data

## Descriptive Statistics

### Mean and Weighted Mean

The mean is the average of a set of values. The operation is simple to do: sum the
values and divide by the number of values. The mean is useful because it shows where
the “center of gravity” exists for an observed set of values.
The mean is calculated the same way for both populations and samples

### Means

#### Calculating mean in Python

In [1]:
samples = [1, 3, 2, 5, 0, 7, 2, 3]

mean = sum(samples) / len(samples)
print(mean)  # 2.875

2.875


There are two versions of the mean you will see: the sample mean $\tilde{x}$ and the population mean μ as expressed here

$\tilde{x}$ = $\frac{x_1 + x_2 + x_3 + ... x_n}{n}$ = $\displaystyle\sum_{}$ $\frac{x_i}{n}$

μ = $\frac{x_1 + x_2 + x_3 + ... x_n}{N}$ = $\displaystyle\sum_{}$ $\frac{x_i}{N}$

The ***n*** and the ***N*** represent the sample and population size, respectively, but mathematically they
represent the same thing: the number of items. 

The same goes for calling the sample mean $\tilde{x}$ (“x-bar”) and the population mean μ (“mu”). Both x and μ are the same calculation, just different names depending on whether it’s a sample or population we are working with.

The mean we commonly use gives equal importance to each value. But we can manipulate the mean and give each item a different weight

weighted mean =  $\frac{(x_1 * w_1) + (x_2 * w_2) + (x_3 * w_3) + ... (x_n * w_n)}{w_1 + w_2 + w_3 + ... w_n}$

Calculating a weighted mean in Python

In [2]:
# Three exams of .20 weight each and final exam of .40 weight
sample = [90, 80, 63, 87]
weights = [0.2, 0.2, 0.2, 0.4]

weighted_mean = sum(s * w for s, w in zip(sample, weights)) / sum(weights)

print(weighted_mean)

81.4


We weight each exam score through multiplication accordingly and instead of dividing by the value count, we divide by the sum of weights. Weightings don’t have to be percentages, as any numbers used for weights will end up being proportionalized

In [1]:
# Calculating a weighted mean in Python (alt example)
sample = [90, 80, 63, 87]
weights = [1.0, 1.0, 1.0, 2.0]

weighted_mean = sum(s * w for s, w in zip(sample, weights)) / sum(weights)

print(weighted_mean)  # 81.4

81.4


#### Median

The ***median*** is the middlemost value in a set of ordered values

In [7]:
# calculating the median in Python

def median(vals: list[int]) -> int:
    ordered_vals = sorted(vals)
    length = len(vals)
    middle = length // 2 - 1 if length % 2 == 0 else length // 2
    return (
        (ordered_vals[middle] + ordered_vals[middle + 1]) / 2.0
        if length % 2 == 0
        else ordered_vals[middle]
    )


samples = [0, 1, 5, 7, 9, 10, 14]
print(median(samples))

7


#### Mode

The mode is the most frequently occurring set of value.

In [14]:
# Calculating the mode in python
from collections import defaultdict
def mode(vals: list[int]) -> list[int]:
    counts = defaultdict(lambda: 0)
    for s in vals:
        counts[s] += 1
    max_count = max(counts.values())
    return [v for v in set(vals) if counts[v] == max_count]


samples = [1,3,2,5,7,0,2,3]
print(mode(samples)) # bimodal -> [2, 3]    
    

[2, 3]


#### Variance and Standard Deviation

variance (population): $\frac{(x_1 - mean)^2 + (x_2 - mean)^2 + ... + (x_n - mean)^2}{N}$

more formally the variance is:
$σ^2$ = $\displaystyle\sum_{}$ $\frac{(x_i - μ)^2}{N}$

In [15]:
# calculating population variance in Python
def variance(vals: list[int]) -> float:
    mean = sum(vals) / len(vals)
    return sum((v-mean)**2 for v in vals) / len(vals) 

data = [0,1,5,7,9,10,14]
print(variance(data)) # 21.387755102040817    

21.387755102040817


standard deviation (population): $\sqrt{\frac{(x_1 - mean)^2 + (x_2 - mean)^2 + ... + (x_n - mean)^2}{N}}$

more formally the standard deviation is:
$σ$ = $\sqrt{\frac{\displaystyle\sum_{}(x_i - μ)^2}{N}}$

In [18]:
# Calculating standard deviation in Python
from math import sqrt
def standard_deviation(vals: list[int]) -> float:
    return sqrt(variance(vals))

data = [0,1,5,7,9,10,14]
print(standard_deviation(data)) # 4.624689730353899

4.624689730353899


Sample Variance and Standard Deviation

$s^2$ = $\frac{\displaystyle\sum_{}(x_i - \tilde{x})^2}{n-1}$

s = $\sqrt{\frac{\displaystyle\sum_{}(x_i - \tilde{x})^2}{n-1}}$

In [20]:
# Calculating standard deviation for a sample
from math import sqrt
# Number of pets each person owns
data = [0, 1, 5, 7, 9, 10, 14]

def sample_variance(vals: list[int]) -> float:
    mean = sum(vals) / len(vals)  # sample mean
    return sum((v-mean)**2 for v in vals) / (len(vals) - 1)

def sample_standard_deviation(vals: list[int]) -> float:
    return sqrt(sample_variance(vals))

print(sample_variance(data)) # 24.952380952380953
print(sample_standard_deviation(data)) # 4.995235825502231

24.952380952380953
4.995235825502231


#### The Normal Distribution

The most famous distribution of all time. (goat of distributions)

The normal distribution, also known as the Gaussian distribution, is a symmetrical bell-shaped distribution that has most mass around the mean, and its spread is defined as a standard deviation.

#### The Probability Density Function (PDF)

The probability density function (PDF) that creates the normal distribution is as follows:

f(x) = $\frac{1}{σ}$ * $\sqrt{2\pi}$ $e^{-\frac{1}{2} \left( \frac{x - \mu}{\sigma} \right)^2}$

In [1]:
# The normal distribution function in Python
import math

def normal_pdf(x: float, mu: float, std_dev: float) -> float:
    return (1 / (std_dev * math.sqrt(2 * math.pi))) * math.e ** (-0.5 * ((x - mu) / std_dev) ** 2)


#### The Cumulative Distribution Function (CDF)

With the normal distribution, the vertical axis is not the probability but rather the likelihood for the data. To find the probability we need to look at a given range, and then find the area under the curve for that range.

In [6]:
# The normal distribution CDF in Python
from scipy.stats import norm

mean = 64.43
std_dev = 2.99

x = norm.cdf(64.43, mean, std_dev)
print(x) # should be .5

0.5


In [7]:
# get the probability of getting a between 62 and 66
from scipy.stats import norm

mean = 64.43
std_dev = 2.99

x = norm.cdf(66, mean, std_dev) - norm.cdf(62, mean, std_dev)
print(x)

0.4920450147062894


#### The Inverse CDF (also called the **PPF** or ***quantile function***)

look up a probability and then return the corresponding x-value. For example, I want to find the weight that 95% of golden retrievers fall under. This is easy to do when I use the inverse CDF.

In [9]:
# Using the inverse CDF (called ppf()) in Python

from scipy.stats import norm

x = norm.ppf(.95, loc=64.43, scale=2.99)
print(f"I find that 95% of golden retrievers are {x} or fewer pounds.")

I find that 95% of golden retrievers are 69.3481123445849 or fewer pounds.


You can also use the inverse CDF to generate random numbers that follow the normal distribution

In [10]:
# Generating random numbers from a normal distribution
import random
from scipy.stats import norm

for _ in range(0, 1000):
    random_p = random.uniform(0.0, 1.0)
    random_weight = norm.ppf(random_p, loc=64.43, scale=2.99)
    print(random_weight)


64.48727988662567
59.48655344383311
63.772389099002076
61.79730948610473
61.74836560160968
64.65995932782118
65.98140771908476
63.556175412020195
64.2353265794966
65.61002518201079
63.098314048038574
67.3405212369525
63.01015978300586
65.92392314188095
70.56731483454213
68.08924327631105
65.65469053145881
59.75066305820398
65.89205216955166
65.29533661050067
59.75689389396095
64.7412682149752
65.43488244204124
62.311104053326936
64.85610516786406
66.19690955145299
62.769941238709585
66.16298748348805
69.08248320477912
65.60582513462181
69.47474739711491
65.54731908696684
68.31159971653354
70.49168406802544
65.87482289770885
65.06669068611151
67.23068186821418
64.314575849647
64.45729708358218
64.20540712900653
62.20000165207339
66.46793990597057
61.010604418965364
63.60125014387583
60.68831669146661
63.16782403519688
63.71835390636501
66.8685445526938
66.02720420898105
58.81831691305514
63.355910993994165
70.48590329849496
63.40532639444895
67.89605095419304
56.9784608930255
64.1887577

#### Z Scores