# üõ°Ô∏è Network Intrusion Detection System (NIDS)

## Project Overview

This notebook documents the development of a **Machine Learning-based Network Intrusion Detection System** that analyzes network traffic and identifies malicious activities with **95% F1-score accuracy**.

### What We'll Build:
1. **Data Loading & Exploration** - Understanding the CICIDS2017 dataset
2. **Feature Engineering** - Preprocessing, scaling, and handling class imbalance
3. **Model Training** - Random Forest classifier for attack detection
4. **Model Evaluation** - Performance metrics and feature importance
5. **Interactive Dashboard** - Real-time visualization with Streamlit

### Technologies Used:
- **Python 3.8+**
- **Scikit-learn** - Machine Learning
- **Pandas & NumPy** - Data manipulation
- **Matplotlib & Plotly** - Visualization
- **Streamlit** - Interactive dashboard

---

## üì¶ 1. Setup & Imports

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix
)

# For handling class imbalance
from imblearn.under_sampling import RandomUnderSampler

# Model persistence
import joblib
import os
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', 60)
pd.set_option('display.width', 1000)

# Plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("‚úÖ All libraries imported successfully!")

## üìÇ 2. Data Loading

We're using the **CICIDS2017: Cleaned & Preprocessed** dataset from Kaggle.

**Dataset Info:**
- Source: Canadian Institute for Cybersecurity
- Contains labeled network traffic (benign + attacks)
- Pre-cleaned version with 2.5M+ samples

In [None]:
# Load the dataset
DATA_FILE = "data/cicids2017_cleaned.csv"

print("üìÇ Loading dataset...")
df = pd.read_csv(DATA_FILE, low_memory=False)

print(f"‚úÖ Dataset loaded successfully!")
print(f"   Shape: {df.shape[0]:,} rows √ó {df.shape[1]} columns")

In [None]:
# Display first few rows
print("üìã First 5 rows of the dataset:")
df.head()

In [None]:
# Dataset info
print("üìä Dataset Information:")
print(f"   Total Samples: {len(df):,}")
print(f"   Total Features: {len(df.columns) - 1}")  # Excluding label
print(f"   Label Column: 'Attack Type'")
print(f"\n   Memory Usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print(f"\n   Missing Values: {df.isnull().sum().sum()}")
print(f"   Duplicate Rows: {df.duplicated().sum()}")

In [None]:
# Display all column names
print("üìù All Columns:")
for i, col in enumerate(df.columns, 1):
    print(f"   {i:2}. {col}")

## üìä 3. Exploratory Data Analysis (EDA)

In [None]:
# Attack type distribution
print("üéØ Attack Type Distribution:")
attack_counts = df['Attack Type'].value_counts()
attack_percentages = (attack_counts / len(df) * 100).round(2)

distribution = pd.DataFrame({
    'Attack Type': attack_counts.index,
    'Count': attack_counts.values,
    'Percentage (%)': attack_percentages.values
})
print(distribution.to_string(index=False))

In [None]:
# Visualize attack distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
colors = ['#00C851', '#ff4444', '#CC0000', '#ffbb33', '#ff8800', '#aa66cc', '#0099CC']
axes[0].pie(attack_counts.values, labels=attack_counts.index, autopct='%1.1f%%', 
            colors=colors, explode=[0.05]*len(attack_counts))
axes[0].set_title('Attack Type Distribution', fontsize=14, fontweight='bold')

# Bar chart
bars = axes[1].bar(attack_counts.index, attack_counts.values, color=colors)
axes[1].set_xlabel('Attack Type', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].set_title('Attack Type Counts', fontsize=14, fontweight='bold')
axes[1].tick_params(axis='x', rotation=45)

# Add value labels on bars
for bar, count in zip(bars, attack_counts.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1000, 
                 f'{count:,}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.savefig('attack_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print("\n‚ö†Ô∏è  Class Imbalance Detected!")
print(f"   Normal Traffic represents {attack_percentages['Normal Traffic']:.1f}% of data")
print("   We'll need to handle this in preprocessing!")

In [None]:
# Statistical summary of numeric features
print("üìà Statistical Summary (first 10 features):")
df.describe().iloc[:, :10]

In [None]:
# Correlation heatmap for top features
# Select numeric columns only
numeric_cols = df.select_dtypes(include=[np.number]).columns[:15]  # First 15 for visibility

plt.figure(figsize=(12, 10))
correlation_matrix = df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            fmt='.2f', square=True, linewidths=0.5)
plt.title('Feature Correlation Heatmap (Top 15 Features)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

## ‚öôÔ∏è 4. Data Preprocessing & Feature Engineering

Steps:
1. Clean data (remove duplicates, handle missing/infinite values)
2. Encode labels (text ‚Üí numbers)
3. Split into training and testing sets
4. Scale features using StandardScaler
5. Handle class imbalance using undersampling

In [None]:
# For faster execution, we'll use a sample of the data
# You can set SAMPLE_SIZE = None to use the full dataset
SAMPLE_SIZE = 200000  # 200K samples for demonstration

if SAMPLE_SIZE:
    print(f"üìä Using sample of {SAMPLE_SIZE:,} rows for faster training...")
    df_sample = df.sample(n=SAMPLE_SIZE, random_state=42)
else:
    print("üìä Using full dataset...")
    df_sample = df.copy()

print(f"   Working with: {len(df_sample):,} samples")

In [None]:
# Step 1: Data Cleaning
print("üßπ Step 1: Data Cleaning")
print("-" * 40)

# Remove duplicates
initial_rows = len(df_sample)
df_clean = df_sample.drop_duplicates()
print(f"   Removed {initial_rows - len(df_clean):,} duplicate rows")

# Check for infinite values
numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
inf_count = np.isinf(df_clean[numeric_cols]).sum().sum()
print(f"   Infinite values found: {inf_count}")

# Check for NaN values
nan_count = df_clean.isna().sum().sum()
print(f"   NaN values found: {nan_count}")

print(f"\n   ‚úÖ Clean data: {len(df_clean):,} rows")

In [None]:
# Step 2: Label Encoding
print("üè∑Ô∏è  Step 2: Label Encoding")
print("-" * 40)

LABEL_COLUMN = "Attack Type"

label_encoder = LabelEncoder()
df_clean['Label_Encoded'] = label_encoder.fit_transform(df_clean[LABEL_COLUMN])

# Create label mapping
label_mapping = dict(zip(
    label_encoder.transform(label_encoder.classes_),
    label_encoder.classes_
))

print("   Label Mapping:")
for code, name in sorted(label_mapping.items()):
    count = (df_clean['Label_Encoded'] == code).sum()
    print(f"      {code} ‚Üí {name} ({count:,} samples)")

In [None]:
# Step 3: Prepare Features (X) and Target (y)
print("üì¶ Step 3: Preparing Features")
print("-" * 40)

# Exclude label columns from features
exclude_cols = [LABEL_COLUMN, 'Label_Encoded']
feature_cols = [col for col in df_clean.columns if col not in exclude_cols]

X = df_clean[feature_cols].copy()
y = df_clean['Label_Encoded'].copy()

print(f"   Features (X): {X.shape}")
print(f"   Target (y): {y.shape}")
print(f"   Feature columns: {len(feature_cols)}")

In [None]:
# Step 4: Train/Test Split
print("‚úÇÔ∏è  Step 4: Train/Test Split")
print("-" * 40)

TEST_SIZE = 0.2
RANDOM_STATE = 42

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y  # Maintain class proportions
)

print(f"   Training set: {len(X_train):,} samples ({100-TEST_SIZE*100:.0f}%)")
print(f"   Testing set: {len(X_test):,} samples ({TEST_SIZE*100:.0f}%)")

In [None]:
# Step 5: Feature Scaling
print("‚öñÔ∏è  Step 5: Feature Scaling")
print("-" * 40)

scaler = StandardScaler()

# Fit on training data ONLY, then transform both
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"   Scaler fitted on training data")
print(f"   Training data scaled: {X_train_scaled.shape}")
print(f"   Testing data scaled: {X_test_scaled.shape}")
print(f"\n   Feature means after scaling: {X_train_scaled.mean():.6f} (should be ~0)")

In [None]:
# Step 6: Handle Class Imbalance
print("üéØ Step 6: Handling Class Imbalance")
print("-" * 40)

print("   Before resampling:")
unique, counts = np.unique(y_train, return_counts=True)
for label, count in zip(unique, counts):
    print(f"      Class {label} ({label_mapping[label]}): {count:,}")

# Apply undersampling
undersampler = RandomUnderSampler(random_state=RANDOM_STATE)
X_train_balanced, y_train_balanced = undersampler.fit_resample(X_train_scaled, y_train)

print("\n   After resampling:")
unique, counts = np.unique(y_train_balanced, return_counts=True)
for label, count in zip(unique, counts):
    print(f"      Class {label} ({label_mapping[label]}): {count:,}")

print(f"\n   Total: {len(y_train):,} ‚Üí {len(y_train_balanced):,}")

## ü§ñ 5. Model Training

We'll use **Random Forest Classifier** because:
- Excellent for tabular data with many features
- Handles non-linear relationships well
- Provides feature importance rankings
- Resistant to overfitting
- Fast training and prediction

In [None]:
# Train Random Forest Classifier
print("üå≤ Training Random Forest Classifier")
print("=" * 50)

# Model parameters
N_ESTIMATORS = 100  # Number of trees
MAX_DEPTH = None    # No limit on tree depth
N_JOBS = -1         # Use all CPU cores

print(f"   Parameters:")
print(f"      - n_estimators: {N_ESTIMATORS}")
print(f"      - max_depth: {MAX_DEPTH}")
print(f"      - n_jobs: {N_JOBS}")
print(f"   Training data: {X_train_balanced.shape[0]:,} samples")

# Create and train model
model = RandomForestClassifier(
    n_estimators=N_ESTIMATORS,
    max_depth=MAX_DEPTH,
    random_state=RANDOM_STATE,
    n_jobs=N_JOBS,
    class_weight='balanced'
)

print("\n   Training in progress...")
import time
start_time = time.time()

model.fit(X_train_balanced, y_train_balanced)

training_time = time.time() - start_time
print(f"   ‚úÖ Training complete in {training_time:.2f} seconds")

# Training accuracy
train_accuracy = model.score(X_train_balanced, y_train_balanced)
print(f"   Training accuracy: {train_accuracy * 100:.2f}%")

## üìä 6. Model Evaluation

In [None]:
# Make predictions on test set
print("üìä Model Evaluation")
print("=" * 50)

y_pred = model.predict(X_test_scaled)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted', zero_division=0)
recall = recall_score(y_test, y_pred, average='weighted', zero_division=0)
f1 = f1_score(y_test, y_pred, average='weighted', zero_division=0)

print("\nüìà OVERALL METRICS:")
print(f"   Accuracy:  {accuracy * 100:.2f}%")
print(f"   Precision: {precision * 100:.2f}%")
print(f"   Recall:    {recall * 100:.2f}%")
print(f"   F1-Score:  {f1 * 100:.2f}%")

In [None]:
# Detailed classification report
print("\nüìã CLASSIFICATION REPORT:")
print("=" * 70)

class_names = [label_mapping[i] for i in sorted(label_mapping.keys())]
report = classification_report(y_test, y_pred, target_names=class_names, zero_division=0)
print(report)

In [None]:
# Confusion Matrix Visualization
print("üéØ Confusion Matrix:")

cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=class_names, yticklabels=class_names)
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('Confusion Matrix - Attack Classification', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150, bbox_inches='tight')
plt.show()

In [None]:
# Per-class performance visualization
print("üìä Per-Class Performance:")

# Calculate per-class metrics
from sklearn.metrics import precision_recall_fscore_support

precision_per_class, recall_per_class, f1_per_class, support = precision_recall_fscore_support(
    y_test, y_pred, zero_division=0
)

metrics_df = pd.DataFrame({
    'Attack Type': class_names,
    'Precision': precision_per_class,
    'Recall': recall_per_class,
    'F1-Score': f1_per_class,
    'Support': support
})

# Visualization
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(class_names))
width = 0.25

bars1 = ax.bar(x - width, metrics_df['Precision'], width, label='Precision', color='#3498db')
bars2 = ax.bar(x, metrics_df['Recall'], width, label='Recall', color='#e74c3c')
bars3 = ax.bar(x + width, metrics_df['F1-Score'], width, label='F1-Score', color='#2ecc71')

ax.set_xlabel('Attack Type', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Per-Class Performance Metrics', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(class_names, rotation=45, ha='right')
ax.legend()
ax.set_ylim(0, 1.1)

plt.tight_layout()
plt.savefig('per_class_performance.png', dpi=150, bbox_inches='tight')
plt.show()

print(metrics_df.to_string(index=False))

## üîç 7. Feature Importance Analysis

Understanding which features are most important for detecting attacks.

In [None]:
# Get feature importance
print("üîç Feature Importance Analysis")
print("=" * 50)

feature_importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\nTop 15 Most Important Features:")
print(feature_importance.head(15).to_string(index=False))

In [None]:
# Visualize feature importance
plt.figure(figsize=(10, 8))

top_15 = feature_importance.head(15).sort_values('Importance', ascending=True)

colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(top_15)))
plt.barh(top_15['Feature'], top_15['Importance'], color=colors)
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Top 15 Most Important Features for Attack Detection', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150, bbox_inches='tight')
plt.show()

## üíæ 8. Save Model and Preprocessing Objects

In [None]:
# Create models directory if it doesn't exist
MODELS_DIR = "models"
os.makedirs(MODELS_DIR, exist_ok=True)

print("üíæ Saving Model and Preprocessing Objects")
print("=" * 50)

# Save model
model_path = os.path.join(MODELS_DIR, 'random_forest_model.joblib')
joblib.dump(model, model_path)
print(f"   ‚úÖ Model saved: {model_path}")

# Save scaler
scaler_path = os.path.join(MODELS_DIR, 'scaler.joblib')
joblib.dump(scaler, scaler_path)
print(f"   ‚úÖ Scaler saved: {scaler_path}")

# Save label encoder
encoder_path = os.path.join(MODELS_DIR, 'label_encoder.joblib')
joblib.dump(label_encoder, encoder_path)
print(f"   ‚úÖ Label encoder saved: {encoder_path}")

# Save label mapping
mapping_path = os.path.join(MODELS_DIR, 'label_mapping.joblib')
joblib.dump(label_mapping, mapping_path)
print(f"   ‚úÖ Label mapping saved: {mapping_path}")

# Save feature names
features_path = os.path.join(MODELS_DIR, 'feature_names.joblib')
joblib.dump(feature_cols, features_path)
print(f"   ‚úÖ Feature names saved: {features_path}")

# Display file sizes
print("\nüìÅ Saved Files:")
for filename in os.listdir(MODELS_DIR):
    filepath = os.path.join(MODELS_DIR, filename)
    size_mb = os.path.getsize(filepath) / (1024 * 1024)
    print(f"   - {filename}: {size_mb:.2f} MB")

## üéØ 9. Making Predictions on New Data

In [None]:
# Example: Make prediction on a single sample
print("üéØ Example Prediction")
print("=" * 50)

# Get a random sample from test set
sample_idx = np.random.randint(0, len(X_test))
sample = X_test_scaled[sample_idx].reshape(1, -1)
actual_label = y_test.iloc[sample_idx]

# Predict
predicted_label = model.predict(sample)[0]
probabilities = model.predict_proba(sample)[0]

print(f"\n   Actual: {label_mapping[actual_label]}")
print(f"   Predicted: {label_mapping[predicted_label]}")
print(f"   Confidence: {probabilities[predicted_label] * 100:.2f}%")

print("\n   All Probabilities:")
for i, prob in enumerate(probabilities):
    print(f"      {label_mapping[i]}: {prob * 100:.2f}%")

## üìä 10. Summary & Conclusions

In [None]:
print("=" * 60)
print("üèÜ PROJECT SUMMARY")
print("=" * 60)

print("\nüìä Dataset:")
print(f"   - Source: CICIDS2017 (Cleaned & Preprocessed)")
print(f"   - Total Samples: {len(df):,}")
print(f"   - Features: {len(feature_cols)}")
print(f"   - Classes: {len(label_mapping)} attack types")

print("\nü§ñ Model:")
print(f"   - Algorithm: Random Forest Classifier")
print(f"   - Trees: {N_ESTIMATORS}")
print(f"   - Training Time: {training_time:.2f} seconds")

print("\nüìà Performance:")
print(f"   - Accuracy:  {accuracy * 100:.2f}%")
print(f"   - Precision: {precision * 100:.2f}%")
print(f"   - Recall:    {recall * 100:.2f}%")
print(f"   - F1-Score:  {f1 * 100:.2f}%")

print("\nüîù Top 5 Most Important Features:")
for i, row in feature_importance.head(5).iterrows():
    print(f"   {feature_importance.index.get_loc(i)+1}. {row['Feature']}: {row['Importance']:.4f}")

print("\n‚úÖ Project Complete!")
print("   Run 'streamlit run dashboard/app.py' to launch the interactive dashboard.")