# Part 1.4: Outlier Detection & Treatment

Outliers are data points that differ significantly from other observations. They can skew statistical measures and degrade the performance of many machine learning models.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

data = {'values': [10, 12, 12, 13, 12, 11, 14, 13, 15, 10, 100]} # 100 is an outlier
df = pd.DataFrame(data)
print("Original DataFrame:")
df

### Detection Method 1: Visualization (Box Plots)
Box plots are a great way to visually identify outliers.

In [None]:
sns.boxplot(x=df['values'])
plt.title('Box Plot of Values')
plt.show()

### Detection Method 2: Interquartile Range (IQR)
A common statistical method to define outliers. A data point is considered an outlier if it falls outside these boundaries:
- `Lower Bound = Q1 - 1.5 * IQR`
- `Upper Bound = Q3 + 1.5 * IQR`

In [None]:
Q1 = df['values'].quantile(0.25)
Q3 = df['values'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1: {Q1}, Q3: {Q3}, IQR: {IQR}")
print(f"Lower Bound: {lower_bound}")
print(f"Upper Bound: {upper_bound}")

outliers = df[(df['values'] < lower_bound) | (df['values'] > upper_bound)]
print("\nDetected Outliers:")
print(outliers)

### Treatment Strategy 1: Removal
The simplest way to handle outliers is to remove them. This should be done with caution.

In [None]:
df_no_outliers = df[~((df['values'] < lower_bound) | (df['values'] > upper_bound))]
print("DataFrame after removing outliers:")
print(df_no_outliers)

### Treatment Strategy 2: Capping (Winsorizing)
Capping involves replacing the outliers with the nearest value that is not an outlier (i.e., the lower or upper bound).

In [None]:
df_capped = df.copy()
df_capped['values'] = np.where(
    df_capped['values'] > upper_bound, 
    upper_bound, 
    np.where(
        df_capped['values'] < lower_bound, 
        lower_bound, 
        df_capped['values']
    )
)
print("DataFrame after capping outliers:")
print(df_capped)

### Treatment Strategy 3: Transformation
Applying a mathematical transformation like a log or square root can also reduce the impact of outliers.

In [None]:
df_transformed = df.copy()
df_transformed['values_log'] = np.log(df_transformed['values'])
print("DataFrame after log transformation:")
print(df_transformed)

sns.boxplot(x=df_transformed['values_log'])
plt.title('Box Plot of Log-Transformed Values')
plt.show()