# Part 1: Descriptive Statistics and Data Exploration

## What is data?

- Data is information
- Everything around us is or can become data

## Types of data

- Quantitative
    - Discrete
    - Continuous
- Categorical
    - Nominal
    - Ordinal

## Descriptive statistics
- help understand and describe data
- first step of any data analysis

## Measures of Central Tendency - what's "typical"?

Example: we have the ages of five people:  ages = [25, 30, 28, 32, 27]

### Mean - the average

sum up all the values and divide by their number
sum(ages)/len(ages) = 142/5 = 28.4

- uses all data points equally
- great for symmetrical data without extereme values
- what happens when one person in the group is 95?
 
    -> very sensitive to outliers

In [None]:
import numpy as np
ages = [25, 30, 28, 32, 27]
np.mean(ages)

### Median - the middle value
- sort the data from smallest to largest and take the middle value
- if the number of datapoints is even, take the average between the two middle ones

remember ages = [25, 30, 28, 32, 27]

sorted_ages = [25, 27, 28, 30, 32]

median is 28

- what happens if one person is 90?

  -> not sensitive to outliers

- we can also take the values at other lines through the data: 0.25 and 0.75 are called quartiles; other lines percentiles

In [None]:
np.median(ages)

### Mode - the most frequent

- the value that appears most frequently in the data
- usecase: categorical data
    - there is no mean or median (no 'average' favourite color)
    - but there is a most common value
- data can be unimodal, multimodal or no mode
- example:
    - favourite colors: [red, blue, green, red, yellow, blue, red]
    - mode is red

In [None]:
np.mode(ages)

Whoops! Looks like numpy doesn't implement mode.

That's because numpy is for numerical operations. But we still have pandas!

In [None]:
import pandas as pd

ages = [25, 30, 28, 32, 27]
df = pd.DataFrame({'age': ages})
df.age.mode()

In [None]:
# using a bigger dataset
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)
df = iris['data']

In [None]:
df.head()

In [None]:
df.columns = ['s_length', 's_width', 'p_length', 'p_width']
df['target'] = iris['target']

In [None]:
df.groupby('target').s_length.mean()

In [None]:
df.groupby('target').p_length.median()

In [None]:
df.groupby('target').p_width.agg(lambda x: x.mode().to_list())

## 1.3 Measures of Variability/Spread (15 minutes)

### Range - simple, but limited
- difference between min and max in the dataset
- ages = [25, 30, 28, 32, 27]
    - max = 32
    - min = 25
    - range = 7
- if we include 90, range jumps to 65

In [None]:
df.s_length.max() - df.s_length.min()

In [None]:
df.groupby('target').agg(lambda x: x.max() - x.min())

### Interquartile Range (IQR) - robust spread

New terms:
- quartile: 25% of the data
- xth percentile: x% of the data
- 1st quartile is equal to 25th percentile
- 2nd quartile is equal to 50th percentile is equal to median
- 3rd quartile is equal to 75th percentile

#### IQR

- range between the 25th and 75th percentile
- ignores the extreme 25% on both ends
- range of the middle 50% of the data

In [None]:
quartiles = df.s_length.quantile([0.25, 0.75])
iqr = quartiles[0.75] - quartiles[0.25]
iqr

In [None]:
df.groupby('target').s_length.agg(lambda x: x.quantile(0.75) - x.quantile(0.25))

### Variance and Standard Deviation

- intuition: imagine every data point as a person; variance and std tell us how much they deviate from their home (mean)

- Variance: calculates the average of the squared differences from the mean
    - square the differences so that negative and positive deviations don't cancel each other out
    - gives us a sense of the average squared spread

- Standard Deviation (σ): square root of the variance
    - brings the value back to the original scale of the data
    - easier to interpret
    - small std: data points are close to the mean, large std: datapoints are widely scattered

- Average heigt is 170 cm with a standard deviation of 5 cm. How tall are most people?

In [None]:
df.s_length.std()

In [None]:
df.describe()

## Historgram

In [None]:
import matplotlib.pyplot as plt
df.p_width.hist(bins=20)
_ = plt.title('petal width histogram')

In [None]:
_ = df.hist(column='s_width', by='target', bins=20, layout=(1,3), figsize=(15,5))

## Box Plot

In [None]:
df.p_length.plot(kind='box')

In [None]:
import seaborn as sns
sns.boxplot(x='target', y='p_length', data=df)
_ = plt.title('Iris dataset petal length by target')

## Bar Chart

In [None]:
df[df.target==0].s_width.value_counts().sort_index().plot.bar()

## Scatter Plot

In [None]:
df.plot.scatter(x='p_width', y='p_length', c=df.target, colormap='viridis')

## Recap
- quantitative and categorical data
- measures of 'typical' values (mean, median, mode)
- measures of 'spread' (range, variance, std, IQR)
- visualising data with histograms, box plots, bar charts, scater plots
- key takeaway: always start by understanding your data!

## Questions?