# 6. Detecting and Handling Outliers

Outliers are data points that deviate significantly from the majority of the data. They can arise due to measurement errors, data entry mistakes, or genuine variability. Identifying and handling outliers is crucial for ensuring accurate analysis and robust models.

## Identifying Outliers

### Using Summary Statistics
Statistical methods can highlight potential outliers:
1. **`describe()`**:
   - Provides summary statistics (mean, min, max, quartiles) for numeric columns.
   - Extreme min or max values may indicate outliers.
2. **`quantile()`**:
   - Calculates specific percentiles.
   - Commonly used thresholds:
     - Values below the 1st percentile or above the 99th percentile.
     - Interquartile Range (IQR): Data points outside \[Q1 - 1.5*IQR, Q3 + 1.5*IQR\].
#### Example:


In [None]:
import pandas as pd
import numpy as np

# Example DataFrame with potential outliers
data = {'Values': [10, 12, 15, 14, 100, 18, 14, 12, 13, 150]}
df = pd.DataFrame(data)
print('Summary statistics:')
print(df['Values'].describe())

# Calculate IQR
Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f'Lower bound: {lower_bound}, Upper bound: {upper_bound}')

# Identify outliers
outliers = df[(df['Values'] < lower_bound) | (df['Values'] > upper_bound)]
print('Outliers:')
print(outliers)

### Visual Methods

Visualization tools help to quickly identify outliers:
1. **Boxplots**:
   - Displays the data distribution and highlights values outside the whiskers as potential outliers.
2. **Histograms**:
   - Shows the frequency distribution of data, with outliers appearing as isolated bars.

#### Example:


In [None]:
import matplotlib.pyplot as plt

# Boxplot to visualize outliers
plt.figure(figsize=(8, 4))
plt.boxplot(df['Values'], vert=False)
plt.title('Boxplot of Values')
plt.show()

# Histogram to visualize distribution
plt.figure(figsize=(8, 4))
plt.hist(df['Values'], bins=10, edgecolor='black')
plt.title('Histogram of Values')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

## Handling Techniques

### 1. Removing or Capping Outliers
- **Removing Outliers**:
  - Outliers can be removed entirely if they are likely due to errors or are not relevant.
- **Capping Outliers**:
  - Replace extreme values with the nearest acceptable boundary (e.g., upper or lower bound).

#### Example:


In [None]:
# Remove outliers
df_no_outliers = df[(df['Values'] >= lower_bound) & (df['Values'] <= upper_bound)]
print('DataFrame after removing outliers:')
print(df_no_outliers)

# Cap outliers
df['Values'] = np.where(df['Values'] < lower_bound, lower_bound, df['Values'])
df['Values'] = np.where(df['Values'] > upper_bound, upper_bound, df['Values'])
print('DataFrame after capping outliers:')
print(df)

### 2. Transforming Data

Transformations can reduce the impact of outliers without removing them:
- **Logarithmic Scaling**:
  - Reduces the influence of large values by compressing the range of data.
- **Square Root Transformation**:
  - Similar to log scaling but less aggressive.

#### Example:


In [None]:
# Logarithmic scaling
df['Log_Values'] = np.log(df['Values'] + 1)  # Add 1 to avoid log(0)
print('DataFrame after logarithmic scaling:')
print(df)

# Square root transformation
df['Sqrt_Values'] = np.sqrt(df['Values'])
print('DataFrame after square root transformation:')
print(df)