<a href="https://colab.research.google.com/github/costaivo/quotes-app/blob/main/PostRead_Pandas03_DescriptiveStats.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Supplementary Post Read for Basic Descriptive Statistics



## Content


  - **Basic Descriptive Statistics**

    - Measures of central tendency - `mean`, `median`, `mode`
    - Measure of Variability/Dispersion
    - Percentile, Quantile, IQR

In [None]:
import pandas as pd

In [None]:
!gdown 1A1tfiarU4O21EoRdsANyhqi39I9eISjr

Downloading...
From: https://drive.google.com/uc?id=1A1tfiarU4O21EoRdsANyhqi39I9eISjr
To: /content/descriptive_stats_data.csv
  0% 0.00/106k [00:00<?, ?B/s]100% 106k/106k [00:00<00:00, 57.0MB/s]


In [None]:
data = pd.read_csv('/content/descriptive_stats_data.csv')

In [None]:
data

Unnamed: 0.1,Unnamed: 0,START_DATE,END_DATE,CATEGORY,START,STOP,MILES,PURPOSE,year
0,0,2016-01-01 21:11:00,2016-01-01 21:17:00,Business,Fort Pierce,Fort Pierce,5.1,Meal/Entertain,2016
1,1,2016-01-02 01:25:00,2016-01-02 01:37:00,Business,Fort Pierce,Fort Pierce,5.0,,2016
2,2,2016-01-02 20:25:00,2016-01-02 20:38:00,Business,Fort Pierce,Fort Pierce,4.8,Errand/Supplies,2016
3,3,2016-01-05 17:31:00,2016-01-05 17:45:00,Business,Fort Pierce,Fort Pierce,4.7,Meeting,2016
4,4,2016-01-06 14:42:00,2016-01-06 15:49:00,Business,Fort Pierce,West Palm Beach,63.7,Customer Visit,2016
...,...,...,...,...,...,...,...,...,...
1145,1150,2016-12-31 01:07:00,2016-12-31 01:14:00,Business,Kar?chi,Kar?chi,0.7,Meeting,2016
1146,1151,2016-12-31 13:24:00,2016-12-31 13:42:00,Business,Kar?chi,Unknown Location,3.9,Temporary Site,2016
1147,1152,2016-12-31 15:03:00,2016-12-31 15:38:00,Business,Unknown Location,Unknown Location,16.2,Meeting,2016
1148,1153,2016-12-31 21:32:00,2016-12-31 21:50:00,Business,Katunayake,Gampaha,6.4,Temporary Site,2016


## Basic Descriptive Statistics



### Measures of central tendency

#### Now what if you want to find out what will be the estimate distance of future ride based on previous data

#### Which estimate should we use ?

-  The most basic estimate of location is the mean or average value.
-  It is the sum of all values by the number of values.
- Mean:
$\bar{x} = \sum_{i=1}^n \frac{x_i}{n}$


In [None]:
data['MILES'].mean()

10.584608695652161

Lets check the min and max value of miles in the table

In our case few customers are taken a ride of long distances and because of that the average miles are biased.

To avoid this problem we some estimator which is robust to extreme values.

#### What could be a robust estimator?
  
One example is **median**.

#### What is median ?

  - It is the middle value on sorted ata. 
  
#### How is median more robust than mean ?
  
  - Mean uses the complete data
  
  - Median only depends on the values in the center of the sorted data.
  
  - So it is not impacted by the extreme values (outliers).


  - Example:
    - If you want to estimate the income of Mumbai people. 
    - The south bombay people will have an advantage
    - Since Mukesh Ambani lives there, and adding his income to take the average will have a huge impact
    - And hence the average income of people in that area will be high. 
    - Instead of average the right metric over here would be the median.

Lets calculate median of the miles now

In [None]:
data['MILES'].median()

6.0

Median is lesser than mean

#### What do we understand from these medians ?

- Only a few people travel long distances through cabs
 
- Because of their journey the overall mean is shifted towards the max value.

- Also, the maximum value of mile is much higher than their mean.

- This shows that the data was **skewed** towards the maximum.

- These extreme values are giving an inaccurate analysis of the data.

- Such values in any feature/column that are very distant from the others are called **outliers**


#### But why do outliers occur in a data ?

- Because of errors while entering the data
  - Eg: change in the units, instead of meters data is inserted in kilometers.

- Because of anomalies in a few samples

Now we know what outliers are and why they are a problem

#### So how should we treat outliers ?

1. Remove them
  - Common when only very few outliers exist

2. Handle them using algorithms:
  Eg: SVM, KNN etc

  You will study more about this later

Note: If there are too many outliers in a dataset, then it might even become invalid

####Lets say if you want to know what is the most common purpose to book a cab

- Mode will hep us in finding the answer to this question

- `Mean, Median and Mode` are together called **Measures of Central Tendency**

#### And what is `Mode` ?
  - Feature value which occurs most frequently
  - In case of numeric features : Feature value that is most likely to occur

Lets check it for `PURPOSE` feature in our dataset

In [None]:
data['PURPOSE'].value_counts()

Meeting            186
Meal/Entertain     160
Errand/Supplies    128
Customer Visit     101
Temporary Site      50
Between Offices     18
Moving               4
Airport/Travel       3
Charity ($)          1
Commute              1
Name: PURPOSE, dtype: int64

#### What can we see here?
  - People belonging to "Meeting" category is maximum
  - Hence "Meeting" is the **mode** for this feature.

### Measures of Variability


#### So is median enough for understanding the distance of future journey ?
  - No, because we don't know about how data is spread of the data

This is known as **variability in the dataset**

#### When to use variablity?

- To understand the diversity in the sample/customers.

#### How can we calculate variability ?

- There are different ways to measure the dispersion.

- One simple approach is Mean absolute deviation.

#### What is Mean absolute deviation ?

- Mathematically, 
  - Mean absolute deviation : $ \sum_{i=1}^n \frac{|x_i- \bar{x}|}{n}$\
where $\bar{x}$ is sample mean.

If want a metric which is more sensitive to large deviation, we can do square instead

#### What should we use in this case ?

- **Variance** : $ s^2 =   \frac{\sum_{i=1}^n(x_i- \bar{x})^2}{n}$

- **Standard deviation** :  $ s = \sqrt{Variance} $

- Standard deviation is much easier to interpret than variance

- Since it is on the same scale as the original data.

So lets calculate standard deviation and find out deviation in our dataset.

In [None]:
data['MILES'].std()

21.62324116259871

#### What can we see here?

- Large deviation in  miles
- Data is spread extremely



#### But whats the problem with these deviation estimates
  - Neither of them are robust to outliers
  - #### Why are they not robust ?
    - They use mean for their calculations

#### What can be a robust deviation estimator then ?

- A robust estimate of variablity is Median absolute deviation (MAD)
- MAD = $ Median(|x_1 - m|, |x_2 - m|, ... , |x_N - m|)$\
where m = median 

Lets calculate this now

In [None]:
from scipy import stats

In [None]:
print(stats.median_absolute_deviation(data['MILES']))

5.485619999999999


What if we want an estimate that is almost completely unaffected by outliers

One such estimate is known as **Estimate of Percentiles**

#### What is Estimate of Percentiles ?

- It is the estimating dispersion based on looking at the spread of the sorted data.

- $p^{th}$ percentile - at least p percent of the values take <= this value.
  - Eg: Median is same as 50th percentile.

Lets calculate 30th percentile of MILES

In [None]:
import numpy as np

In [None]:
# 30% of the customers take ride less than 3.2 miles
np.percentile(data['MILES'], 30)

3.2

This is much less than the maximum value which we found before


25th, 50th, 75th Percentiles are known as Quantiles (Q1, Q2, Q3 resp.)

**Difference between Q3 - Q1** is also used for measuring variability

Also called as **Interquartile range(IQR).**

#### Why is IQR used as a measure of variability ?
  - Q1 to Q3 contain 50% of data
  - 75th and 50th percentile do not contain extreme values

Lets check the IQR values of these features now

In [None]:
stats.iqr(data['MILES'])

7.5


#### How to identify outliers using this?
- According to 1.5IQR rule outliers are :
  - Points below Q1-1.5IQR
  - Points above Q3+1.5IQR
- We will study more about this in later lectures.