## **Handling Missing Values**

Handling missing values in a dataset is crucial because ignoring them can lead to biased and unreliable results in data analysis and machine learning models

 **Method:**
- We might have to look for patterns in missing data for specific missing patterns 
- We ll handle the missing values either by imputation or simply dropping the specific rows

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

**Functions**

In [5]:
def analyze_missing_data(df):
    """
    Analyze missing data patterns in a DataFrame.
    
    Parameters:
    df (pandas.DataFrame): Input DataFrame to analyze
    
    Returns:
    dict: Dictionary containing missing value statistics and patterns
    """
    # Calculate basic missing value statistics
    missing_stats = {
        'total_missing': df.isnull().sum(),
        'percent_missing': (df.isnull().sum() / len(df) * 100).round(2),
        'missing_pattern_count': df.isnull().sum(axis=1).value_counts().sort_index()
    }
    
    # Create missing value patterns
    missing_patterns = df.isnull().astype(int)
    pattern_counts = missing_patterns.groupby(list(missing_patterns.columns)).size().reset_index()
    pattern_counts.columns = list(df.columns) + ['count']
    pattern_counts = pattern_counts.sort_values('count', ascending=False)
    
    # Identify columns that tend to be missing together
    correlation_matrix = missing_patterns.corr()
    
    # Find rows with specific numbers of missing values
    rows_by_missing = {
        'complete_rows': df.dropna().index,
        'partial_missing': df[df.isnull().any(axis=1)].index,
        'all_missing': df[df.isnull().all(axis=1)].index
    }
    
    return {
        'missing_stats': missing_stats,
        'pattern_counts': pattern_counts,
        'correlation_matrix': correlation_matrix,
        'rows_by_missing': rows_by_missing
    }

def print_missing_analysis(analysis_results):
    """
    Print formatted analysis results.
    
    Parameters:
    analysis_results (dict): Results from analyze_missing_data function
    """
    print("=== Missing Value Analysis ===\n")
    
    print("Missing Value Counts and Percentages:")
    missing_summary = pd.DataFrame({
        'Missing Values': analysis_results['missing_stats']['total_missing'],
        'Percentage': analysis_results['missing_stats']['percent_missing']
    })
    print(missing_summary[missing_summary['Missing Values'] > 0])
    print("\nMissing Value Patterns (Top 5):")
    print(analysis_results['pattern_counts'].head())
    
    print("\nHighly Correlated Missing Patterns (correlation > 0.5):")
    corr_matrix = analysis_results['correlation_matrix']
    high_corr = np.where(np.triu(corr_matrix, 1) > 0.5)
    for i, j in zip(*high_corr):
        print(f"{corr_matrix.index[i]} & {corr_matrix.columns[j]}: {corr_matrix.iloc[i, j]:.2f}")

**Data**

In [26]:
# Read the data 
data = pd.read_csv('dataset.csv')

print(f"Successfully read {len(data.columns)} features")

Successfully read 87 features


In [6]:
# Lets try out the Analysis
analysis_results = analyze_missing_data(data)

In [14]:
print_missing_analysis(analysis_results)

=== Missing Value Analysis ===

Missing Value Counts and Percentages:
               Missing Values  Percentage
Flow Bytes/s            41051        1.27
Flow IAT Mean           41709        1.29
Flow IAT Std            41709        1.29
Flow IAT Max            41709        1.29
Flow IAT Min            41709        1.29

Missing Value Patterns (Top 5):
   Flow ID  Src IP  Src Port  Dst IP  Dst Port  Protocol  Timestamp  \
0        0       0         0       0         0         0          0   
2        0       0         0       0         0         0          0   
1        0       0         0       0         0         0          0   

   Flow Duration  Total Fwd Packet  Total Bwd packets  ...  Active Std  \
0              0                 0                  0  ...           0   
2              0                 0                  0  ...           0   
1              0                 0                  0  ...           0   

   Active Max  Active Min  Idle Mean  Idle Std  Idle Max  Idle 

**Handle Missing Values**

In [25]:
# Check
print(data[data.columns[data.isna().any()].tolist()].isna().sum())

Series([], dtype: float64)


In [27]:
print(f"Number of rows before handling missing values: {len(data)}")
# Since missing values are very less as compared to the dataset, we ll just drop these rows 
data = data.dropna()

print(f"Number of rows after handling missing values: {len(data)}")

Number of rows before handling missing values: 3231475
Number of rows after handling missing values: 3189766


In [28]:
data.isna().sum().sum()

0

**Save it to 01_missing_values_handled.csv**

In [29]:
# Save data to data_versions 
data.to_csv('data_versions/01_missing_values_removed.csv', index=False)