### MAD (Median Absolute Deviation)

Median - The middle most data point in case of odd length, avg of middle two in case of even length

**Why Modified z-score**
1. There are some cases where the individual data items will also falls in 3 std-dev which is casual limit. 
2. But the mean, std-dev moves towards right from the most grouped items to the individual bcz of their higher vals.
3. As a result ML model may not learn the pattern properly

**Solution** 
This Modified-z-score which is based on median, concentrating only on most grouped items, and detect such outliers better than traditional z-score


In [3]:
import pandas as pd 

data = {
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Tablet', 
                'Smartphone', 'Headphones', 'Charger', 'USB Drive', 'Speaker'],
    'Sales': [100, 105, 98, 102, 101, 500, 99, 103, 100, 97]  # 500 is an outlier detectable by Modified Z-score
}

df = pd.DataFrame(data)
df 

Unnamed: 0,Product,Sales
0,Laptop,100
1,Mouse,105
2,Keyboard,98
3,Monitor,102
4,Tablet,101
5,Smartphone,500
6,Headphones,99
7,Charger,103
8,USB Drive,100
9,Speaker,97


### Formulae  
normal  z_score = `(X - mean(X)) / std(X)`

 MAD = `median(|X - median(X)|)` 

 modified_z_score = `0.6745 * (X - median(X)) / MAD` 

In [4]:
df.describe()

Unnamed: 0,Sales
count,10.0
mean,140.5
std,126.337511
min,97.0
25%,99.25
50%,100.5
75%,102.75
max,500.0


### Z_score 

In [5]:
df['z_score'] = (df.Sales - df.Sales.mean()) / df.Sales.std() 
df

Unnamed: 0,Product,Sales,z_score
0,Laptop,100,-0.32057
1,Mouse,105,-0.280993
2,Keyboard,98,-0.3364
3,Monitor,102,-0.304739
4,Tablet,101,-0.312655
5,Smartphone,500,2.845552
6,Headphones,99,-0.328485
7,Charger,103,-0.296824
8,USB Drive,100,-0.32057
9,Speaker,97,-0.344316


In [8]:
# checking with common Std-dev val i.e., 3 and -3
df[ (df.z_score > 3) | (df.z_score < -3)]

Unnamed: 0,Product,Sales,z_score


### Modified Z_score

In [12]:
import numpy as np 

median = df.Sales.median() 
individual_mdn = np.abs(df.Sales - median) 
individual_mdn

0      0.5
1      4.5
2      2.5
3      1.5
4      0.5
5    399.5
6      1.5
7      2.5
8      0.5
9      3.5
Name: Sales, dtype: float64

##### Note: Did you observe how the Mobile Sales (idx=5) which is 500 is treated here, with a higher mark, bcz of median

In [13]:
MAD = individual_mdn.median() 
MAD

np.float64(2.0)

In [14]:
def calc_modified_z_score(x, median=median, MAD=MAD): 
    return 0.6745 * (x - median) / MAD

In [15]:
df['modified_Z_score'] = df.Sales.apply(calc_modified_z_score) 
df

Unnamed: 0,Product,Sales,z_score,modified_Z_score
0,Laptop,100,-0.32057,-0.168625
1,Mouse,105,-0.280993,1.517625
2,Keyboard,98,-0.3364,-0.843125
3,Monitor,102,-0.304739,0.505875
4,Tablet,101,-0.312655,0.168625
5,Smartphone,500,2.845552,134.731375
6,Headphones,99,-0.328485,-0.505875
7,Charger,103,-0.296824,0.843125
8,USB Drive,100,-0.32057,-0.168625
9,Speaker,97,-0.344316,-1.180375


In [16]:
# Find outliers with modified_z_score 
df[ (df.modified_Z_score > 3) | (df.modified_Z_score < -3)]

Unnamed: 0,Product,Sales,z_score,modified_Z_score
5,Smartphone,500,2.845552,134.731375


### Outcome :- 
Finally got the expected output !... 

1Normal Z-score struggles to detect outliers when the dataset is small and skewed because it uses mean and standard deviation, which can be affected by extreme values.

Modified Z-score (which uses median absolute deviation) is robust to outliers, so it will correctly flag 500 as an outlier.