
# Experiment - 6: Data Preprocessing Techniques (Handling Missing Values & Outliers)

In this experiment, we will implement data preprocessing techniques to handle **missing values** and **outliers**. 
Preprocessing data is a crucial step in machine learning, ensuring that the dataset is clean and reliable for model training.

### Objective:
1. Implement a strategy to handle missing values (mean, median, mode).
2. Handle outliers using the IQR (Interquartile Range) method to cap extreme values.

Below are the code implementations and usage examples to demonstrate how these techniques are applied.


In [1]:
import pandas as pd
import numpy as np
from scipy import stats

In [2]:

# Function to handle missing values in a dataset
def handle_missing_values(data, strategy='mean'):
    """
    Handles missing values in the dataset based on the provided strategy.
    
    Parameters:
    data (pd.DataFrame): The input data with potential missing values.
    strategy (str): The method for handling missing values. Can be one of the following:
        - 'mean': Replaces missing values with the mean of the respective column.
        - 'median': Replaces missing values with the median of the respective column.
        - 'mode': Replaces missing values with the most frequent value (mode) of the column.
        
    Returns:
    pd.DataFrame: Data with missing values handled.
    """
    if strategy == 'mean':
        # Fill missing values with mean of the column
        return data.fillna(data.mean(numeric_only=True))
    elif strategy == 'median':
        # Fill missing values with median of the column
        return data.fillna(data.median(numeric_only=True))
    elif strategy == 'mode':
        # Fill missing values with mode of the column
        return data.fillna(data.mode().iloc[0])
    else:
        raise ValueError("Invalid strategy! Use 'mean', 'median', or 'mode'.")


In [5]:

# Function to detect and handle outliers using the IQR method
def handle_outliers(data):
    """
    Identifies and caps outliers using the Interquartile Range (IQR) method.
    
    Outliers are defined as values outside of the [Q1 - 1.5*IQR, Q3 + 1.5*IQR] range, 
    where Q1 is the 25th percentile and Q3 is the 75th percentile. 
    These outliers are capped to the nearest valid bound (lower or upper).
    
    Parameters:
    data (pd.DataFrame): The input data to be checked for outliers.
    
    Returns:
    pd.DataFrame: Data with outliers handled.
    """
    # Select only numeric columns for outlier detection
    numeric_columns = data.select_dtypes(include=[np.number])
    
    # Calculate IQR for each numeric column
    Q1 = numeric_columns.quantile(0.25)
    Q3 = numeric_columns.quantile(0.75)
    IQR = Q3 - Q1
    
    # Define bounds for capping outliers
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Cap outliers to the respective bounds
    numeric_columns_capped = numeric_columns.clip(lower=lower_bound, upper=upper_bound, axis=1)
    
    # Replace original numeric columns with the capped ones in the dataset
    data[numeric_columns.columns] = numeric_columns_capped
    
    return data


In [6]:

# Example usage of the missing value and outlier handling functions
if __name__ == "__main__":
    # Path to the dataset (modify this path as per your dataset location)
    file_path = 'sample.csv'
    
    # Load the dataset
    dataset = pd.read_csv(file_path)
    
    # Display initial missing values count before handling
    print("Missing values before handling:")
    print(dataset.isnull().sum())
    
    # Handle missing values using the mean strategy
    dataset_with_no_missing_values = handle_missing_values(dataset, strategy='mean')
    
    # Display missing values count after handling
    print("\nMissing values after handling:")
    print(dataset_with_no_missing_values.isnull().sum())
    
    # Detect and handle outliers in the dataset
    dataset_without_outliers = handle_outliers(dataset_with_no_missing_values)
    
    # Show some statistics after handling outliers to verify changes
    print("\nDataset statistics after handling outliers:")
    print(dataset_without_outliers.describe())


Missing values before handling:
Series_reference        0
Period                  0
Data_value             41
Suppressed          44004
STATUS                  0
UNITS                   0
Magnitude               0
Subject                 0
Group                   0
Series_title_1          0
Series_title_2      22434
Series_title_3      44004
Series_title_4      44004
Series_title_5      44004
dtype: int64

Missing values after handling:
Series_reference        0
Period                  0
Data_value              0
Suppressed          44004
STATUS                  0
UNITS                   0
Magnitude               0
Subject                 0
Group                   0
Series_title_1          0
Series_title_2      22434
Series_title_3      44004
Series_title_4      44004
Series_title_5      44004
dtype: int64

Dataset statistics after handling outliers:
             Period    Data_value  Suppressed  Magnitude  Series_title_3  \
count  44004.000000  44004.000000         0.0    44004.0     