## **Outlier Treatment**

Outlier treatment is important because outliers can significantly affect the performance and reliability of a model. Here’s why handling them properly matters:

1. Improves Model Accuracy: Outliers can skew the results of statistical analyses, leading to inaccurate models. By removing or adjusting outliers, the model can make more reliable predictions.

2. Enhances Model Training: Many machine learning models, especially linear regression, K-nearest neighbors, and neural networks, are sensitive to outliers. Outliers can cause models to overfit, impacting their generalization to new data.

3. Improves Statistical Metrics: Outliers can distort metrics like mean and standard deviation, affecting data interpretation. Proper treatment gives a clearer, more representative picture of the data.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import math 
from scipy import stats

**Data**

In [2]:
# load data 
# loading data with missing value handled 
data = pd.read_csv('data_versions/01_missing_values_removed.csv')

print(f"Successfully read {len(data.columns)} features")

Successfully read 87 features


**Functions**

In [23]:
# Functions 

def remove_outliers_75(df, column):
    """
    Remove values above the 75th percentile in the specified column.
    
    Parameters:
    df: pandas DataFrame
    column: str, name of the column to process
    
    Returns:
    pandas DataFrame with outliers removed
    """
    # Calculate 75th percentile
    q75 = df[column].quantile(0.75)
    
    # Return DataFrame with values <= 75th percentile
    return df[df[column] <= q75]


def cap_outliers_75(df, columns: list):
    """
    Cap values above the 75th percentile in specified columns.
    Values above 75th percentile are replaced with the 75th percentile value.
    
    Parameters:
    df: pandas DataFrame
    columns: list of str, names of columns to process
    
    Returns:
    pandas DataFrame with outliers capped in specified columns
    """
    # Create a copy of the DataFrame to avoid modifying the original
    df_capped = df.copy()
    
    # Process each specified column
    for column in columns:
        # Calculate 75th percentile
        q75 = df[column].quantile(0.75)
        
        # Cap values above 75th percentile
        df_capped[column] = df[column].clip(upper=q75)
    
    return df_capped


def cap_columns_with_a_value(df, cap_dict: dict):
    """
    Cap values in specified columns at custom values.
    
    Parameters:
    df: pandas DataFrame
    cap_dict: dict, mapping of column names to their cap values
             e.g., {'column1': 100, 'column2': 50}
    
    Returns:
    pandas DataFrame with specified columns capped at their respective values
    """
    # Create a copy of the DataFrame to avoid modifying the original
    df_capped = df.copy()
    
    # Process each column with its specified cap value
    for column, cap_value in cap_dict.items():
        df_capped[column] = df[column].clip(upper=cap_value)
    
    return df_capped



**Outlier Removal**

In [13]:
# Before we proceed, we ll remove irrelevant columns
irrelevant_columns = ['Fwd URG Flags', 'Bwd URG Flags', 'URG Flag Count', 'CWR Flag Count', 'ECE Flag Count']

# Drop
data = data.drop(columns=irrelevant_columns)

In [14]:
# print 
print(f"New number of features: {len(data.columns)}")

New number of features: 82


In [15]:
# print 
print(f"Number of rows: {len(data)}")

Number of rows: 3189766


**Cap and Floor treatment**

- Clips the maximum values of columns to its 75th percentile, handling outlier but keeping the dataset intact

In [17]:
# Apply
columns_to_cap = ['Total Fwd Packet', 'Total Bwd packets', 'Total Length of Fwd Packet', 'Total Length of Bwd Packet', 'Flow Bytes/s', 'Flow IAT Mean', 'Fwd Header Length', 'Bwd Header Length', 'Packet Length Variance']

# Cap outliers
data_capped = cap_outliers_75(data, columns_to_cap)

In [24]:
# print rows 
print(f"Number of rows after capping: {len(data_capped)}")

Number of rows after capping: 3189766


**Some features need careful outlier removal, thus we are capping these columns with specific maximum values, so that we wont lose important outliers**

In [26]:
# Capping under a specified value 

cap_dict = {
    'Flow Packets/s': 200000,
    'Down/Up Ratio': 4,
    'Average Packet Size': 2500,
    'Fwd Segment Size Avg': 2500,
    'Bwd Segment Size Avg': 3000,
    'Fwd Bytes/Bulk Avg': 250000,
    'Subflow Fwd Bytes': 1700, 
    'Subflow Bwd Bytes': 2500,
    'Bwd Init Win Bytes': 12000,
    'Active Mean': 10000000,
    'Idle Mean': 40000000,
    'Total TCP Flow Time': 10000000000
}

# apply
data_capped_v2 = cap_columns_with_a_value(data_capped, cap_dict)

In [27]:
# print rows
print(f"Number of rows after capping: {len(data_capped_v2)}")

Number of rows after capping: 3189766


Hurray! outliers are treated! 

Lets save the data

**Save the data to data_versions**

In [28]:
data_capped_v2.to_csv('data_versions/02_outliers_removed.csv', index=False)