# CICIDS2017 Dataset - Exploratory Data Analysis

This notebook performs comprehensive exploratory data analysis on the CICIDS2017 Monday dataset (benign traffic baseline).

**Objectives:**
1. Load and inspect dataset structure
2. Analyze feature distributions and statistics
3. Identify data quality issues (missing values, inf values, duplicates)
4. Explore feature correlations
5. Visualize key patterns
6. Document preprocessing requirements

**Note:** If CICIDS2017 Monday data is not yet downloaded, this notebook includes sample data generation for development purposes.

## 1. Setup and Imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings

warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 10

# Random seed for reproducibility
np.random.seed(42)

print("Libraries imported successfully")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

## 2. Data Loading

Attempt to load CICIDS2017 Monday dataset. If not available, generate sample data.

In [None]:
# Define data path
DATA_PATH = Path('../data/raw/Monday-WorkingHours.pcap_ISCX.csv')

def load_or_generate_data():
    """Load real data if available, otherwise generate sample for development."""
    
    if DATA_PATH.exists():
        print(f"Loading CICIDS2017 Monday dataset from: {DATA_PATH}")
        print("This may take a minute for large files...\n")
        
        # Load with proper encoding handling
        try:
            df = pd.read_csv(DATA_PATH, encoding='utf-8')
        except UnicodeDecodeError:
            df = pd.read_csv(DATA_PATH, encoding='latin1')
        
        # Clean column names (remove leading/trailing spaces)
        df.columns = df.columns.str.strip()
        
        print(f"Successfully loaded {len(df):,} rows")
        return df, True
    
    else:
        print("CICIDS2017 data not found!")
        print(f"Please download Monday-WorkingHours.pcap_ISCX.csv to: {DATA_PATH.absolute()}")
        print("See ../data/DOWNLOAD_INSTRUCTIONS.md for download steps.\n")
        print("Generating sample data for development purposes...\n")
        
        # Generate sample data with realistic features
        n_samples = 10000
        
        # Common CICIDS2017 features (subset for sample)
        data = {
            'Flow Duration': np.random.exponential(5000, n_samples),
            'Total Fwd Packets': np.random.poisson(10, n_samples),
            'Total Backward Packets': np.random.poisson(8, n_samples),
            'Total Length of Fwd Packets': np.random.exponential(1000, n_samples),
            'Total Length of Bwd Packets': np.random.exponential(800, n_samples),
            'Fwd Packet Length Max': np.random.exponential(500, n_samples),
            'Fwd Packet Length Min': np.random.exponential(50, n_samples),
            'Fwd Packet Length Mean': np.random.exponential(200, n_samples),
            'Fwd Packet Length Std': np.random.exponential(100, n_samples),
            'Bwd Packet Length Max': np.random.exponential(450, n_samples),
            'Bwd Packet Length Min': np.random.exponential(40, n_samples),
            'Bwd Packet Length Mean': np.random.exponential(180, n_samples),
            'Bwd Packet Length Std': np.random.exponential(90, n_samples),
            'Flow Bytes/s': np.random.exponential(10000, n_samples),
            'Flow Packets/s': np.random.exponential(100, n_samples),
            'Flow IAT Mean': np.random.exponential(1000, n_samples),
            'Flow IAT Std': np.random.exponential(500, n_samples),
            'Flow IAT Max': np.random.exponential(2000, n_samples),
            'Flow IAT Min': np.random.exponential(100, n_samples),
            'Fwd IAT Total': np.random.exponential(5000, n_samples),
            'Fwd IAT Mean': np.random.exponential(1000, n_samples),
            'Fwd IAT Std': np.random.exponential(500, n_samples),
            'Fwd IAT Max': np.random.exponential(2000, n_samples),
            'Fwd IAT Min': np.random.exponential(100, n_samples),
            'Bwd IAT Total': np.random.exponential(4500, n_samples),
            'Bwd IAT Mean': np.random.exponential(900, n_samples),
            'Bwd IAT Std': np.random.exponential(450, n_samples),
            'Bwd IAT Max': np.random.exponential(1800, n_samples),
            'Bwd IAT Min': np.random.exponential(90, n_samples),
            'Fwd PSH Flags': np.random.poisson(1, n_samples),
            'Bwd PSH Flags': np.random.poisson(1, n_samples),
            'Fwd URG Flags': np.random.poisson(0.1, n_samples),
            'Bwd URG Flags': np.random.poisson(0.1, n_samples),
            'Fwd Header Length': np.random.poisson(40, n_samples),
            'Bwd Header Length': np.random.poisson(35, n_samples),
            'Fwd Packets/s': np.random.exponential(50, n_samples),
            'Bwd Packets/s': np.random.exponential(40, n_samples),
            'Min Packet Length': np.random.exponential(40, n_samples),
            'Max Packet Length': np.random.exponential(500, n_samples),
            'Packet Length Mean': np.random.exponential(200, n_samples),
            'Packet Length Std': np.random.exponential(100, n_samples),
            'Packet Length Variance': np.random.exponential(10000, n_samples),
            'FIN Flag Count': np.random.poisson(1, n_samples),
            'SYN Flag Count': np.random.poisson(1, n_samples),
            'RST Flag Count': np.random.poisson(0.2, n_samples),
            'PSH Flag Count': np.random.poisson(2, n_samples),
            'ACK Flag Count': np.random.poisson(10, n_samples),
            'URG Flag Count': np.random.poisson(0.1, n_samples),
            'CWE Flag Count': np.random.poisson(0.1, n_samples),
            'ECE Flag Count': np.random.poisson(0.1, n_samples),
            'Down/Up Ratio': np.random.exponential(1, n_samples),
            'Average Packet Size': np.random.exponential(200, n_samples),
            'Avg Fwd Segment Size': np.random.exponential(180, n_samples),
            'Avg Bwd Segment Size': np.random.exponential(160, n_samples),
            'Fwd Header Length.1': np.random.poisson(40, n_samples),  # Duplicate column in real data
            'Subflow Fwd Packets': np.random.poisson(10, n_samples),
            'Subflow Fwd Bytes': np.random.exponential(1000, n_samples),
            'Subflow Bwd Packets': np.random.poisson(8, n_samples),
            'Subflow Bwd Bytes': np.random.exponential(800, n_samples),
            'Init_Win_bytes_forward': np.random.exponential(8000, n_samples),
            'Init_Win_bytes_backward': np.random.exponential(7000, n_samples),
            'act_data_pkt_fwd': np.random.poisson(5, n_samples),
            'min_seg_size_forward': np.random.exponential(20, n_samples),
            'Active Mean': np.random.exponential(1000, n_samples),
            'Active Std': np.random.exponential(500, n_samples),
            'Active Max': np.random.exponential(2000, n_samples),
            'Active Min': np.random.exponential(100, n_samples),
            'Idle Mean': np.random.exponential(5000, n_samples),
            'Idle Std': np.random.exponential(2500, n_samples),
            'Idle Max': np.random.exponential(10000, n_samples),
            'Idle Min': np.random.exponential(500, n_samples),
            'Label': ['BENIGN'] * n_samples  # Monday data is all benign
        }
        
        df = pd.DataFrame(data)
        
        # Introduce some realistic data quality issues
        # Add some infinite values (division by zero in flow features)
        inf_indices = np.random.choice(n_samples, size=50, replace=False)
        df.loc[inf_indices, 'Flow Bytes/s'] = np.inf
        df.loc[inf_indices[:25], 'Flow Packets/s'] = np.inf
        
        # Add some NaN values
        nan_indices = np.random.choice(n_samples, size=30, replace=False)
        df.loc[nan_indices, 'Fwd IAT Mean'] = np.nan
        
        # Add some duplicates
        df = pd.concat([df, df.iloc[:100]], ignore_index=True)
        
        print(f"Generated {len(df):,} sample rows with {len(df.columns)} features")
        print("Note: This is synthetic data for development only!\n")
        return df, False

# Load data
df, is_real_data = load_or_generate_data()

## 3. Dataset Overview

In [None]:
print("=" * 80)
print("DATASET OVERVIEW")
print("=" * 80)
print(f"Data Source: {'CICIDS2017 Real Data' if is_real_data else 'Sample/Synthetic Data'}")
print(f"Number of Rows: {len(df):,}")
print(f"Number of Columns: {len(df.columns)}")
print(f"Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print()

In [None]:
# Display first few rows
print("First 5 rows:")
df.head()

In [None]:
# Data types and non-null counts
print("\nData Types and Non-Null Counts:")
df.info()

In [None]:
# Column names
print("\nColumn Names:")
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

## 4. Data Quality Analysis

In [None]:
print("=" * 80)
print("DATA QUALITY ISSUES")
print("=" * 80)

# Separate features and label
label_col = 'Label'
feature_cols = [col for col in df.columns if col != label_col]

print(f"\nTotal Features: {len(feature_cols)}")
print(f"Label Column: {label_col}")

### 4.1 Missing Values

In [None]:
# Check for missing values
missing_values = df[feature_cols].isnull().sum()
missing_percent = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percent
}).sort_values('Missing Count', ascending=False)

missing_df_filtered = missing_df[missing_df['Missing Count'] > 0]

print(f"\nColumns with Missing Values: {len(missing_df_filtered)}")
if len(missing_df_filtered) > 0:
    print(missing_df_filtered.head(20))
else:
    print("No missing values detected!")

### 4.2 Infinite Values

In [None]:
# Check for infinite values
inf_counts = {}
for col in feature_cols:
    if df[col].dtype in ['float64', 'float32', 'int64', 'int32']:
        inf_count = np.isinf(df[col]).sum()
        if inf_count > 0:
            inf_counts[col] = inf_count

print(f"\nColumns with Infinite Values: {len(inf_counts)}")
if inf_counts:
    inf_df = pd.DataFrame.from_dict(inf_counts, orient='index', columns=['Inf Count'])
    inf_df['Percentage'] = (inf_df['Inf Count'] / len(df)) * 100
    print(inf_df.sort_values('Inf Count', ascending=False).head(20))
else:
    print("No infinite values detected!")

### 4.3 Duplicate Rows

In [None]:
# Check for duplicates
duplicate_count = df.duplicated().sum()
duplicate_percent = (duplicate_count / len(df)) * 100

print(f"\nDuplicate Rows: {duplicate_count:,} ({duplicate_percent:.2f}%)")
if duplicate_count > 0:
    print("\nNote: Duplicates will be removed during preprocessing")

### 4.4 Label Distribution

In [None]:
# Check label distribution
print("\nLabel Distribution:")
label_counts = df[label_col].value_counts()
print(label_counts)
print(f"\nUnique Labels: {df[label_col].nunique()}")

# Visualize label distribution
plt.figure(figsize=(10, 4))
label_counts.plot(kind='bar', color='steelblue')
plt.title('Label Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Label')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

print("\nNote: Monday data should be 100% BENIGN traffic (baseline for anomaly detection)")

## 5. Feature Statistics

In [None]:
# Get numeric columns only
numeric_cols = df[feature_cols].select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric Features: {len(numeric_cols)}")

# Summary statistics
print("\nSummary Statistics (first 10 features):")
df[numeric_cols[:10]].describe().T

In [None]:
# Check for constant/near-constant features
constant_features = []
for col in numeric_cols:
    if df[col].nunique() == 1:
        constant_features.append(col)

print(f"\nConstant Features (zero variance): {len(constant_features)}")
if constant_features:
    print(constant_features)
    print("\nNote: These should be removed during feature selection")

## 6. Feature Distributions

Visualize distributions of key features to understand data characteristics.

In [None]:
# Select key features to visualize
key_features = [
    'Flow Duration', 'Total Fwd Packets', 'Total Backward Packets',
    'Flow Bytes/s', 'Flow Packets/s', 'Flow IAT Mean',
    'Packet Length Mean', 'Average Packet Size'
]

# Filter to features that exist in the dataset
key_features = [f for f in key_features if f in df.columns]

# Plot distributions
fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.ravel()

for i, feature in enumerate(key_features[:8]):
    # Remove inf and nan for visualization
    data = df[feature].replace([np.inf, -np.inf], np.nan).dropna()
    
    axes[i].hist(data, bins=50, color='steelblue', alpha=0.7, edgecolor='black')
    axes[i].set_title(feature, fontsize=10, fontweight='bold')
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Frequency')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Feature Distributions (Key Features)', fontsize=14, fontweight='bold', y=1.02)
plt.show()

print("Note: Many network features show heavy right-skew (exponential/power-law distributions)")
print("Preprocessing will apply log transformation or standardization")

## 7. Feature Correlations

In [None]:
# Calculate correlation matrix for subset of features
subset_features = numeric_cols[:20]  # First 20 numeric features

# Remove inf/nan for correlation calculation
corr_data = df[subset_features].replace([np.inf, -np.inf], np.nan).dropna()
corr_matrix = corr_data.corr()

# Plot correlation heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0, 
            linewidths=0.5, cbar_kws={'label': 'Correlation'})
plt.title('Feature Correlation Matrix (First 20 Features)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Find highly correlated feature pairs
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.9:
            high_corr_pairs.append((
                corr_matrix.columns[i],
                corr_matrix.columns[j],
                corr_matrix.iloc[i, j]
            ))

print(f"\nHighly Correlated Feature Pairs (|r| > 0.9): {len(high_corr_pairs)}")
if high_corr_pairs:
    for feat1, feat2, corr in sorted(high_corr_pairs, key=lambda x: abs(x[2]), reverse=True)[:10]:
        print(f"  {feat1} <-> {feat2}: {corr:.3f}")
    print("\nNote: Consider removing redundant features during feature selection")

## 8. Data Quality Summary

In [None]:
print("=" * 80)
print("DATA QUALITY SUMMARY")
print("=" * 80)

summary = {
    'Total Rows': f"{len(df):,}",
    'Total Features': len(feature_cols),
    'Numeric Features': len(numeric_cols),
    'Features with Missing Values': len(missing_df_filtered),
    'Features with Infinite Values': len(inf_counts),
    'Duplicate Rows': f"{duplicate_count:,} ({duplicate_percent:.2f}%)",
    'Constant Features': len(constant_features),
    'Highly Correlated Pairs': len(high_corr_pairs),
    'Unique Labels': df[label_col].nunique(),
    'Memory Usage (MB)': f"{df.memory_usage(deep=True).sum() / 1024**2:.2f}"
}

for key, value in summary.items():
    print(f"{key:.<35} {value}")

print("\n" + "=" * 80)

## 9. Preprocessing Requirements

Based on the EDA findings, document required preprocessing steps.

In [None]:
print("=" * 80)
print("PREPROCESSING REQUIREMENTS")
print("=" * 80)
print()
print("Required Preprocessing Steps:")
print()

steps = [
    ("1. Remove Duplicates", f"Drop {duplicate_count:,} duplicate rows"),
    ("2. Handle Infinite Values", f"Replace inf in {len(inf_counts)} columns with NaN or max/min values"),
    ("3. Handle Missing Values", f"Impute or remove missing values in {len(missing_df_filtered)} columns"),
    ("4. Remove Constant Features", f"Drop {len(constant_features)} zero-variance features"),
    ("5. Handle Highly Correlated Features", "Consider removing redundant features (|r| > 0.95)"),
    ("6. Feature Scaling", "Apply StandardScaler or MinMaxScaler to normalize feature ranges"),
    ("7. Encode Labels", "Convert labels to binary (BENIGN=0 for anomaly detection)"),
    ("8. Train/Test Split", "Split data maintaining temporal order if timestamps available"),
]

for step, description in steps:
    print(f"{step}")
    print(f"   â†’ {description}")
    print()

print("=" * 80)

## 10. Key Insights

In [None]:
print("=" * 80)
print("KEY INSIGHTS")
print("=" * 80)
print()

insights = [
    "Dataset Characteristics:",
    f"  - {len(df):,} network flow records with {len(feature_cols)} features",
    f"  - Monday data should be 100% benign traffic (baseline for anomaly detection)",
    f"  - Memory footprint: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB",
    "",
    "Data Quality Issues:",
    f"  - Infinite values present in {len(inf_counts)} features (likely from division by zero)",
    f"  - Missing values in {len(missing_df_filtered)} features",
    f"  - {duplicate_count:,} duplicate rows need removal",
    f"  - {len(constant_features)} constant features provide no information",
    "",
    "Feature Characteristics:",
    "  - Most features show heavy right-skew (typical for network traffic)",
    "  - Strong correlations between related features (e.g., packet counts and byte totals)",
    "  - Wide range of scales (ms to MB/s) requires normalization",
    "",
    "Recommendations:",
    "  - Use robust preprocessing pipeline to handle inf/nan values",
    "  - Apply feature scaling (StandardScaler recommended)",
    "  - Consider feature selection to reduce dimensionality",
    "  - Monday (benign) data perfect for training unsupervised anomaly detectors",
    "  - Use Tuesday-Friday data for testing anomaly detection performance",
]

for insight in insights:
    print(insight)

print()
print("=" * 80)

## 11. Next Steps

With EDA complete, proceed to:

1. **Build Preprocessing Pipeline** (`src/data/preprocessing.py`)
   - Implement cleaning functions
   - Handle inf/nan values
   - Feature scaling
   - Train/test splitting

2. **Implement Isolation Forest** (first anomaly detection algorithm)
   - Train on Monday (benign) data
   - Evaluate on test set
   - Extract feature importance

3. **Download Additional Days**
   - Tuesday-Friday data for testing attack detection
   - Evaluate per-attack-type performance

4. **Update Progress Journal**
   - Document EDA findings in `claude.md`
   - Record preprocessing decisions
   - Note any issues or blockers

## 12. Save EDA Results

In [None]:
# Save summary statistics
results_dir = Path('../results')
results_dir.mkdir(exist_ok=True)

# Save data quality report
report = {
    'total_rows': len(df),
    'total_features': len(feature_cols),
    'numeric_features': len(numeric_cols),
    'missing_value_cols': len(missing_df_filtered),
    'infinite_value_cols': len(inf_counts),
    'duplicate_rows': int(duplicate_count),
    'constant_features': len(constant_features),
    'high_corr_pairs': len(high_corr_pairs),
    'memory_mb': float(f"{df.memory_usage(deep=True).sum() / 1024**2:.2f}"),
    'is_real_data': is_real_data
}

import json
with open(results_dir / 'eda_summary.json', 'w') as f:
    json.dump(report, f, indent=2)

print("EDA summary saved to: ../results/eda_summary.json")
print("\nEDA Complete!")