<a href="https://colab.research.google.com/github/akshay-r13/ds_from_scratch/blob/main/05_statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Statistics

Statistics refers to the mathematics and techniques we use to understand data.

## Describing a single set of data


### Central tendencies

The central tendencies are 
1. mean
2. median
3. quantile

#### Mean

Mean gives us a notion of where the data is centred

In [1]:
# Compute mean of several data points
def mean(x):
  return sum(x) / len(x)

In [2]:
num_friends = [100, 49, 41, 40, 25]
mean(num_friends)

51.0

> Mean of the data **depends on every data point in the dataset**

#### Median

Median of the data is the middle most value (or) the average of the 2 middle values of sorted data.

In [5]:
# Function to compute median

def median(x):
  n = len(x)
  sorted_x = sorted(x)
  midpoint = n // 2

  # If length of x is odd
  if n % 2 != 0:
    return sorted_x[midpoint]
  else:
    return (sorted_x[midpoint-1] + sorted_x[midpoint]) / 2

In [7]:
median(num_friends)

41

In [8]:
median([1, 3, 4, 10, 12])

4

In [9]:
median([1, 3, 4, 7, 10, 12])

5.5

The prerequisite to finding the median is sorting the data.

Also mean changes with every change in values in data. However median may not. By changing our dataset by a small value `e`, the median may
1. remain the same
2. Increase by `e`
3. or increase by a value less than `e`

> The mean is **highly sensitive to outliers**. However, the median is not as sensitive.

#### Quantile

Quantile is a generalization of Median. It denotes **the value less than which a percentage of data exists**.

The **median is the 50th quantile**

In [14]:
def quantile(x, p):
  n = len(x)
  p_index = int(n * p)
  sorted_x = sorted(x)
  return sorted_x[p_index]

In [17]:
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0.5))
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0.9))
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0.4))
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 0.3))

6
10
5
4


In [21]:
print(quantile([1, 2, 3, 4, 5, 6, 7, 8, 9], 0.5))
print(median([1, 2, 3, 4, 5, 6, 7, 8, 9]))

5
5


#### Mode

Mode refers to the **most frequently occuring value in a dataset**

In [23]:
from collections import Counter

In [33]:
def mode(x):
  c = Counter(x)
  max_count = max(c.values())
  return [x for x, count in c.items() if count == max_count]



In [37]:
mode([1, 3, 1, 2, 3, 4 ,3 ,5, 3, 4, 1, 1])

[1, 3]

### Dispersion

Dispersion refers to the measure of how spread out data is. 
Typicall, values close to 0 indicate no dispersion and large values indicate a large spread of data.

Some measures of dispersion are:
1. Range
2. Variance
3. Standard deviation
4. Inter quantile range




#### Range

Range simply refers to the difference between the max and min value of a dataset

In [39]:
def range(x):
  sorted_x = sorted(x)
  return sorted_x[-1] - sorted_x[0]

In [41]:
range([1, 12, 1, 3, 5 , 5, 10 , 8])

11

Range is very sensitive to outliers. For example:

In [42]:
range([0.001, 0.002, 0.006, 0.009, 100])

99.999

#### Variance

A more complex measurement of dispersion. Variance is defined as

  $\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}$