
# STEP 7: Handle Outliers (Complete Pandas Guide)

This notebook covers **ALL practical and interview-relevant ways** to
DETECT, ANALYZE, and HANDLE outliers in a Pandas DataFrame.

Focus: **statistical methods, quantile-based methods, visualization, and business rules**.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## 1. Sample Dataset with Outliers

In [None]:

np.random.seed(42)
df = pd.DataFrame({
    "transaction_amount": np.append(np.random.randint(100, 5000, 50), [50000, 75000]),
    "salary": np.append(np.random.randint(30000, 120000, 48), [300000, 450000]),
})
df.head()


## 2. Detect Outliers using Describe

In [None]:

df.describe()


## 3. Visual Detection using Box Plot

In [None]:

plt.figure()
plt.boxplot(df['transaction_amount'])
plt.title("Transaction Amount Box Plot")
plt.show()


## 4. Detect Outliers using Quantiles (IQR Method)

In [None]:

Q1 = df['transaction_amount'].quantile(0.25)
Q3 = df['transaction_amount'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

outliers_iqr = df[(df['transaction_amount'] < lower_bound) |
                  (df['transaction_amount'] > upper_bound)]
outliers_iqr


## 5. Remove Outliers using IQR

In [None]:

df_iqr_removed = df[(df['transaction_amount'] >= lower_bound) &
                    (df['transaction_amount'] <= upper_bound)]
df_iqr_removed.shape


## 6. Quantile-based Capping (Winsorization)

In [None]:

low_q = df['transaction_amount'].quantile(0.01)
high_q = df['transaction_amount'].quantile(0.99)

df['transaction_capped'] = df['transaction_amount'].clip(low_q, high_q)
df[['transaction_amount', 'transaction_capped']].tail()


## 7. Z-Score Method

In [None]:

mean = df['salary'].mean()
std = df['salary'].std()

df['z_score'] = (df['salary'] - mean) / std
outliers_z = df[np.abs(df['z_score']) > 3]
outliers_z


## 8. Remove Outliers using Z-Score

In [None]:

df_z_removed = df[np.abs(df['z_score']) <= 3]
df_z_removed.shape


## 9. Business Rule Based Outlier Handling

In [None]:

# Example: salary should not exceed 200000
df_business = df[df['salary'] <= 200000]
df_business.shape


## 10. Compare Before & After (Box Plot)

In [None]:

plt.figure()
plt.boxplot([df['transaction_amount'], df_iqr_removed['transaction_amount']],
            labels=['Original', 'After IQR Removal'])
plt.title("Outlier Handling Comparison")
plt.show()



## ✅ Best Practices & Interview Notes
- Always DETECT before REMOVING
- Prefer IQR for finance & skewed data
- Prefer Z-score for normal distributions
- Use capping instead of deletion when data loss matters
- Business rules override statistical rules



## ✔ Summary
- Outliers distort mean and models
- IQR and quantiles are safest methods
- Visualization helps early detection
- Choose method based on data distribution
