<a href="https://colab.research.google.com/github/akshay-r13/ds_from_scratch/blob/main/05_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistics

Statistics refers to the mathematics and techniques we use to understand data.

## Describing a single set of data


### Central tendencies

The central tendencies are 
1. mean
2. median
3. quantile

#### Mean

Mean gives us a notion of where the data is centred

In [None]:
# Compute mean of several data points
def mean(x):
  return sum(x) / len(x)

In [None]:
num_friends = [100, 49, 41, 40, 25]
mean(num_friends)

51.0

> Mean of the data **depends on every data point in the dataset**

#### Median

Median of the data is the middle most value (or) the average of the 2 middle values of sorted data.

In [None]:
# Function to compute median

def median(x):
  n = len(x)
  sorted_x = sorted(x)
  midpoint = n // 2

  # If length of x is odd
  if n % 2 != 0:
    return sorted_x[midpoint]
  else:
    return (sorted_x[midpoint-1] + sorted_x[midpoint]) / 2

In [None]:
median(num_friends)

41

In [None]:
median([1, 3, 4, 10, 12])

4

In [None]:
median([1, 3, 4, 7, 10, 12])

5.5

The prerequisite to finding the median is sorting the data.

Also mean changes with every change in values in data. However median may not. By changing our dataset by a small value `e`, the median may
1. remain the same
2. Increase by `e`
3. or increase by a value less than `e`

> The mean is **highly sensitive to outliers**. However, the median is not as sensitive.

#### Quantile

Quantile is a generalization of Median. It denotes **the value less than which a percentage of data exists**.

The **median is the 50th quantile**

In [None]:
def quantile(x, p):
  n = len(x)
  p_index = int(n * p)
  sorted_x = sorted(x)
  return sorted_x[p_index]

In [None]:
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0.5))
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0.9))
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0.4))
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0.3))

6
10
5
4


In [None]:
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9], 0.5))
print(median([1, 2, 3, 4, 5, 6, 7, 8, 9]))

5
5


#### Mode

Mode refers to the **most frequently occuring value in a dataset**

In [None]:
from collections import Counter

In [None]:
def mode(x):
  c = Counter(x)
  max_count = max(c.values())
  return [x for x, count in c.items() if count == max_count]



In [None]:
mode([1, 3, 1, 2, 3, 4 ,3 ,5, 3, 4, 1, 1])

[1, 3]

### Dispersion

Dispersion refers to the measure of how spread out data is. 
Typicall, values close to 0 indicate no dispersion and large values indicate a large spread of data.

Some measures of dispersion are:
1. Range
2. Variance
3. Standard deviation
4. Inter quantile range




#### Range

Range simply refers to the difference between the max and min value of a dataset

In [None]:
def range(x):
  sorted_x = sorted(x)
  return sorted_x[-1] - sorted_x[0]

In [None]:
range([1, 12, 1, 3, 5 , 5, 10 , 8])

11

Range is very sensitive to outliers. For example:

In [None]:
range([0.001, 0.002, 0.006, 0.009, 100])

99.999

#### Variance

A more complex measurement of dispersion. Variance is defined as

  $\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}$

In [None]:
def de_mean(x):
  mu = mean(x)
  return [xi - mu for xi in x]

In [None]:
de_mean([1, 2,3, 5,4 ,5 ,6, 7 ])

[-3.125, -2.125, -1.125, 0.875, -0.125, 0.875, 1.875, 2.875]

In [None]:
def sum_of_squares(arr):
  return sum([a**2 for a in arr])

def variance(x):
  n = len(x)
  deviations = de_mean(x)
  return sum_of_squares(deviations) / (n-1)

In [None]:
variance([1, 2,3, 5,4 ,5 ,6, 7 ])

4.125

Why use n-1 instead of n?

In [None]:
import math

In [None]:
def standard_deviation(x):
  return math.sqrt(variance(x))

In [None]:
standard_deviation([1, 2,3, 5,4 ,5 ,6, 7 ])

2.03100960115899

Standard deviation has the same outlier problem as in range. An outlier can significantly change the standard deviation value.

An alternative is to compute the difference between 75th and 25th percentile value.

In [None]:
print(standard_deviation([1, 2,3, 5,4 ,5 ,6, 7 ]))
print(standard_deviation([1, 2,3, 5,4 ,5 ,6, 7, 100 ]))

2.03100960115899
32.0147535433969


#### Interquantile range

It is the difference between 75th and 25th percentile values of the given data. It is less susceptible to outlier problem/

In [None]:
def interquantile_range(x):
  percentile_25th = quantile(x, 0.25)
  percentile_75th = quantile(x, 0.75)
  return percentile_75th - percentile_25th

In [None]:
print(interquantile_range([1, 2,3, 5,4 ,5 ,6, 7 ]))
print(interquantile_range([1, 2,3, 5,4 ,5 ,6, 7, 100 ]))

3
3


# Correlation



In [None]:
num_friends = [2 ,3 , 2, 4, 20, 10 , 8, 13, 11, 9, 100]
daily_minutes = [10, 10 , 10, 20, 25, 10, 30, 100, 20, 30]

In [None]:
# Function to compute dot product
def dot(x, y):
  return sum([xi * yi for xi, yi in zip(x, y)])

# Function to compute covariance
def covariance(x, y):
  n = len(x)
  return dot(de_mean(x), de_mean(y)) / (n - 1)

In [None]:
covariance(num_friends, daily_minutes)

60.70000000000001

Interpreting Covariance:

1. A large positive number indicates a positive correlation (x tends to increase with y)
2. A large negative number indicates a negative correlation (x tends to decrease with y)
2. A close to 0 covariance indicates there's no correlation between x and y

Problems with Covariance:

* It does not have an interpretable unit. In our example , 'number of friends' and 'daily minutes' . The unit of covariance is 'num friends daily minutes' which really doesn't mean anything
* It does not account for deviation. If all users had twice the friends, with same daily minutes, the covariance would be larger. However, the relation is actually the same.

For the above reasons we look at `correlation` which takes into account the standard deviation of both variables

In [None]:
def correlation(x, y):
  stddev_x = standard_deviation(x)
  stddev_y = standard_deviation(y)
  if stddev_x > 0 and stddev_y > 0:
    return covariance(x, y) / stddev_x / stddev_y
  else:
    return 0 # If deviation is 0 there's no correlation
  

In [None]:
correlation(num_friends, daily_minutes)

0.07944901270627079

In [None]:
# If each user had twice as friends
correlation([n * 2 for n in num_friends], daily_minutes)

0.07944901270627079

Correlation is unitless and always lies between +1 (perfect positive correlation) and -1 (perfect negative correlation).

Value such as 0.2 indicates a weak positive correlation.

By that sense, correlation between num_friends and daily minutes seem to be weak `0.08` but on examining the data we see there's an outlier.

A person has 100 friends but spends only 20 minutes a day. And correlation is very sensitive to outliers.

Let's examine correlation after removing this outlier

In [None]:
correlation(num_friends[:-1], daily_minutes[:-1])

0.4079645914642422

There's seem to be a good positive correlation between number of friends and daily minutes spent 

# Simpson's paradox 

# Correlation does not equal Causation

# Correlation does not indicate how large a relation is