### Mean 

The mean is the average of a set of values. The operation is simple to do: sum the values and divide by number of values. The mean is useful because it shows where the 'center of gravirty' exists for an ovserved set of values.

In [7]:
sample = [1, 3, 2, 5, 7, 0, 2, 3]
mean = sum(sample) / len(sample)
print(mean)

2.875


The sample mean:
$$\bar{x} = \frac{x_1 + x_2 + x_3 + ... + x_n}{n} = \sum\frac{x_i}{n}$$

The population mean:
$$\mu = \frac{x_1 + x_2 + x_3 + ... + x_N}{N} = \sum\frac{x_i}{N}$$

The weighted mean can be helpful when we want some values to contribute to the mean more than others
weighted mean:
$$\bar{x} = \frac{w_1x_1 + w_2x_2 + w_3x_3 + ... + w_nx_n}{w_1 + w_2 + w_3 + ... + w_n}$$

In [8]:
# Three exams of .20 weights eachc and final exam of .40 weight
sample = [90, 80, 63, 87]
weights = [.20, .20, .20, .40]
weighted_mean = sum(s * w for s, w in zip(sample, weights)) / sum(weights)
print(weighted_mean)

81.4


In [9]:
# Weightings don't have to be percentages, as any numbers used for weights will end up being proportionalized
sample = [90, 80, 63, 87]
weights = [1.0, 1.0, 1.0, 2.0]
weighted_mean = sum(s * w for s, w in zip(sample, weights)) / sum(weights)
print(weighted_mean)

81.4


### Median

The median is the middlemost value in a set of ordered values. You sequentially order the values, and the median wil be the centermost values. If you have an even number of values, you average the two centermost values. 

Then median can be a helpful alternative to the mean when data is skewed by outliers. It is less sensitive to outliers and cuts data strictly down the middle based on their relative order. When your median is very different from your mean, that means you have a skewed dataset with outliers

0, 1, 5, |7|, 9, 10, 14

In [10]:
# Number of pets each person owns
sample = [0, 1, 5, 7, 9, 10, 14]

def median(values):
  ordered = sorted(values)
  n = len(ordered)
  mid = int(n / 2) - 1 if n % 2 == 0 else int(n / 2)

  if n % 2 == 0:
    return (ordered[mid] + ordered[mid + 1]) / 2.0
  else:
    return ordered[mid]
  
print(median(sample))

7


### Mode

The mode is the most frequently occuring set of values. It primarily becomes useful when your data is repetitive and you want to find which values occur the most fequently.

When no value occurs more than once, there is no mode. When two values occur with an equal amount of frequency, then the dataset is considered bimodal.

In [11]:
from collections import defaultdict

sample = [1, 3, 2, 5, 7, 0, 2, 3]

def mode(values):
  counts = defaultdict(lambda: 0)

  for s in values:
    counts[s] += 1

  max_count = max(counts.values())
  modes = [v for v in set(values) if counts[v] == max_count]
  return modes


print(mode(sample))

[2, 3]


### Variance and Standard Deviation

population variance = $$\frac{(x_1 - mean)^2 + (x_2 - mean)^2 + ... + (x_n - mean)^2}{N}$$

More formally written as: $$\sigma^2 = \frac{\sum(x_i - \mu)^2}{N}$$

Standard deviation: $$\sigma = \sqrt{\frac{\sum(x_i - \mu)^2}{N}}$$

In [12]:
from math import sqrt

data = [0, 1, 5, 7, 9, 10, 14]

def variance(values):
  n = len(values)
  mean = sum(values) / n
  variance = sum((v - mean)**2 for v in values) / n
  return variance

def std_dev(values):
  return sqrt(variance(values))


print(std_dev(data))

4.624689730353898


### Sample Variance and Standard Deviation

sample variance: $$s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1}$$

sample standard deviation: $$s = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n-1}}$$

In [13]:
# Number of pets each person owns
data = [0, 1, 5, 7, 9, 10, 14]

def variance(values, is_sample: bool = False):
  n = len(values)
  mean = sum(values) / n
  _variance = sum((v - mean)**2 for v in values) / (n - (1 if is_sample else 0))
  return _variance

def std_dev(values, is_sample: bool = False):
  return sqrt(variance(values, is_sample))

print('VARIANCE = {}'.format(variance(data, is_sample=True)))
print('STD DEV = {}'.format(std_dev(data, is_sample=True)))

VARIANCE = 24.95238095238095
STD DEV = 4.99523582550223


### The Normal Distribution

The normal distribution, also known as the Gaussian distribution, is a symmetrical bell-shaped distribution that has most mass around the mean, and its spread is defined as a standard deviation. The tails on either side become thinner as you move away from the mean.

##### Properties of a Normal Distribution

- It's symmetrical; both sides are identically mirrored at the mean, which is the center.
- Most mass is at the center around the mean
- It has a spread (being narrow or wide) that is specified by standard deviation
- The tails are the least likely outcomes and approach zero infinitely but never touch zero.
- It resembles a lot of phenomena in nature and daily life, and even generalizes nonnormal problems because of the central limit theorem.

#### The Probability Density Function (PDF)

$$f(x) = \frac{1}{\sigma} * \sqrt{2\pi}* e^{-\frac{1}{2}(\frac{x - \mu^2}{\sigma})}$$

In [16]:
import math

# normal distribution, returns likelihood
def normal_pdf(x: float, mean: float, std_dev: float) -> float:
  return (1.0 / (2.0 * math.pi * std_dev ** 2) ** 0.5) * math.exp(-1.0 * ((x - mean) ** 2 / (2.0 * std_dev ** 2)))

#### The Cumulative Distribution Function (CDF)

$$F(x) P[X \leq x] = \int_{-\infty}^{x} f(t) dt$$

In [19]:
from scipy.stats import norm

mean = 64.43
std_dev = 2.99

x = norm.cdf(64.43, mean, std_dev)
print(x)

0.5


In [21]:
# getting a middle range probability using the CDF

mean = 64.43
std_dev = 2.99

x = norm.cdf(66, mean, std_dev) - norm.cdf(62, mean, std_dev)
print(x)

0.4920450147062894


#### The Inverse CDF

$$F^{-1}(p) = \inf\{x \in \mathbb{R} : F(x) \geq p\}$$

In [23]:
# Find the weight that 95% of golden retrievers fall under.

x = norm.ppf(.95, loc=64.43, scale=2.99)
print(x)

69.3481123445849


In [28]:
# You can also use the inverse CDF to generate random numbers that follow the normal distribution. 
# If we want to create a simulation that generates one thousand realistic golden retriever weights,
# I just generate a random value between 0.0 and 1.0, pass it to the inverse CDF, and return the weight value as show

import random

for i in range(0, 1000):
  random_p = random.uniform(0.0, 1.0)
  random_weight = norm.ppf(random_p, loc=64.43, scale=2.99)
  print(random_weight)

65.36190274234445
66.13919785721752
66.17636148164478
64.1553231328954
63.2205750766737
61.90649204262768
62.624635536254566
60.45314445562548
57.7117322381723
64.53117665228635
66.23357812162635
63.07370697715183
67.62383089380177
67.85760570720105
62.94315272700308
64.09894105023355
65.45218733372745
65.37228339954834
61.49485750635907
60.75157293099403
66.80800344181371
62.72397181077187
62.8510731879396
66.18467475901991
68.33860463372439
70.4285484813041
61.49440005136825
65.83676388762856
64.65513915866036
65.87754809390454
66.79300632746808
60.50991871902245
66.23617544178104
64.21732139487933
64.9041999471367
61.78731523092455
63.78935189443749
59.70384202959469
66.89803735928554
60.17245170799434
61.357717892143626
73.12922732070419
62.659616920175466
64.87925711452608
61.999202018543286
69.11636205710775
63.785790983171076
65.21004063441717
62.65757277521312
65.6595272689457
68.37769170124734
64.37178580147322
65.3559123990179
67.42909317196708
61.2517682218685
59.68207648574

### Z-Scores 

A z-score is a measure of how many standard deviations away from the mean a data point is. It is calculated as:
$$z = \frac{x - \mu}{\sigma}$$

Example: We have two homes from two different neighborhoods. <br>
Neighborhood A has a mean home value of $140,000 and standard deviation of $3,000. <br>
Neighborhood B has a mean home value of $800,000 and standard deviation of $10,000. <br>
House A from neighborhood A is worth $150,000. <br>
House B from neighborhood B is worth $815,000. <br>
Which home is more expensive relative to the average home in its neighborhood?

House A: $$z_a = \frac{150000-140000}{3000} = 3.33$$
House B: $$z_b = \frac{815000-800000}{10000} = 1.5$$

So house A is more expensive relative to the average home in its neighborhood.

In [31]:
def z_score(x, mean, std):
  return (x - mean) / std

def z_to_x(z, mean, std):
  return (z * std) + mean 

mean = 140000
std_dev = 3000
x = 150000

z = z_score(x, mean, std_dev)
back_to_x = z_to_x(z, mean, std_dev)

print('Z-Score: {}'.format(z))
print('Back to X: {}'.format(back_to_x))

Z-Score: 3.3333333333333335
Back to X: 150000.0


#### Coefficient of Variation

The coefficient of variation is a measure of relative variability. It compares two distributions and measures how spread out each of them is. It is calculated as:
$$cv = \frac{\sigma}{\mu}$$

$$cv_A = \frac{3000}{140000} = 0.0214$$
$$cv_B = \frac{10000}{800000} = 0.0125$$

So neighborhood A, although cheaper than neighborhood B is more spread out, meaning it has more price diversity.