# Centralized Baseline Models (Leakage-Safe)

This notebook provides clean, reproducible baseline models for CICIDS2017 intrusion detection.

**Key Features:**
- ✅ Proper train/validation/test split (no leakage)
- ✅ All preprocessing in sklearn Pipeline (federated-ready)
- ✅ PCA fitted on training data only
- ✅ Threshold tuning on validation set
- ✅ Single evaluation on test set

**Data Split Strategy:**
- Training: Monday-Thursday files
- Validation: 20% holdout from training
- Test: Friday files (untouched until final evaluation)

In [2]:
import pandas as pd
import numpy as np
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sns

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

In [3]:
# Project-relative paths (no more Windows absolute paths)
PROJECT_ROOT = Path().cwd().parent
DATA_PATH = PROJECT_ROOT / "data" / "processed" / "cicids_10pct_stratified.csv"

print(f"Loading data from: {DATA_PATH}")
df = pd.read_csv(DATA_PATH)
print(f"Dataset shape: {df.shape}")

Loading data from: d:\Coding\VanetUAV\data\processed\cicids_10pct_stratified.csv
Dataset shape: (283074, 84)


## 1. Clean Data Split (Training Files vs Test Files)

**Critical:** We must split by source file FIRST, before any preprocessing, to avoid temporal leakage.

In [4]:
# Define temporal split by day (Monday-Thursday = train, Friday = test)
train_files = [
    "Monday-WorkingHours.pcap_ISCX.csv",
    "Tuesday-WorkingHours.pcap_ISCX.csv", 
    "Wednesday-workingHours.pcap_ISCX.csv",
    "Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv",
    "Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv"
]

test_files = [
    "Friday-WorkingHours-Morning.pcap_ISCX.csv",
    "Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv", 
    "Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv"
]

# Create clean binary target
df["target"] = (df["Label"] != "BENIGN").astype(int)

# Split data by files
df_train_raw = df[df["source_file"].isin(train_files)].copy()
df_test = df[df["source_file"].isin(test_files)].copy()

print(f"Training data shape: {df_train_raw.shape}")
print(f"Test data shape: {df_test.shape}")
print(f"Training attack rate: {df_train_raw['target'].mean():.3f}")
print(f"Test attack rate: {df_test['target'].mean():.3f}")

Training data shape: (212883, 85)
Test data shape: (70191, 85)
Training attack rate: 0.126
Test attack rate: 0.412


## 2. Feature Preparation (Training Data Only)

**Critical:** All preprocessing steps must be fitted on training data only.

In [5]:
# Define metadata columns to exclude
META_COLS = ["Label", "source_file", "day", "attack_group", "target"]
if "label_bin" in df.columns:
    META_COLS.append("label_bin")
if "label_binary" in df.columns:
    META_COLS.append("label_binary")

# Extract feature matrix from training data
X_train_raw = df_train_raw.drop(columns=META_COLS)
y_train_raw = df_train_raw["target"]

# Extract test features (for final evaluation only)
X_test = df_test.drop(columns=META_COLS)
y_test = df_test["target"]

print(f"Feature matrix shape: {X_train_raw.shape}")
print(f"Feature columns: {X_train_raw.dtypes.value_counts()}")

Feature matrix shape: (212883, 78)
Feature columns: int64      54
float64    24
Name: count, dtype: int64


In [None]:
# Clean features: keep only numeric columns
numeric_features = X_train_raw.select_dtypes(include=[np.number]).columns.tolist()
X_train_raw = X_train_raw[numeric_features]
X_test = X_test[numeric_features]

print(f"Numeric features: {len(numeric_features)}")
print(f"Training shape after numeric filter: {X_train_raw.shape}")

## 3. Train/Validation Split

Split training data into train/validation for threshold tuning.

In [6]:
# Create train/validation split from training data
X_train, X_val, y_train, y_val = train_test_split(
    X_train_raw, y_train_raw, 
    test_size=0.2, 
    random_state=RANDOM_STATE,
    stratify=y_train_raw
)

print(f"Final split:")
print(f"  Train: {X_train.shape[0]} samples, attack rate: {y_train.mean():.3f}")
print(f"  Val:   {X_val.shape[0]} samples, attack rate: {y_val.mean():.3f}") 
print(f"  Test:  {X_test.shape[0]} samples, attack rate: {y_test.mean():.3f}")

Final split:
  Train: 170306 samples, attack rate: 0.126
  Val:   42577 samples, attack rate: 0.126
  Test:  70191 samples, attack rate: 0.412


## 4. Preprocessing Pipeline (Leakage-Safe)

**This pipeline will be reusable for federated learning clients.**

In [7]:
# Create preprocessing pipeline that handles inf/nan and performs PCA
def create_preprocessing_pipeline(n_components=25):
    """Create a complete preprocessing pipeline.
    
    Args:
        n_components: Number of PCA components (None = no PCA)
    """
    steps = [
        ('imputer', SimpleImputer(strategy='median')),  # Handle inf/nan
        ('scaler', StandardScaler()),
    ]
    
    if n_components is not None:
        steps.append(('pca', PCA(n_components=n_components, random_state=RANDOM_STATE)))
    
    return Pipeline(steps)

# Test both with and without PCA
pipeline_no_pca = create_preprocessing_pipeline(n_components=None)
pipeline_pca = create_preprocessing_pipeline(n_components=25)

print("Created preprocessing pipelines:")
print(f"  - Without PCA: {[step[0] for step in pipeline_no_pca.steps]}")
print(f"  - With PCA (25): {[step[0] for step in pipeline_pca.steps]}")

Created preprocessing pipelines:
  - Without PCA: ['imputer', 'scaler']
  - With PCA (25): ['imputer', 'scaler', 'pca']


In [8]:
# Replace inf with NaN for proper imputation
X_train_clean = X_train.replace([np.inf, -np.inf], np.nan)
X_val_clean = X_val.replace([np.inf, -np.inf], np.nan)
X_test_clean = X_test.replace([np.inf, -np.inf], np.nan)

# Fit preprocessing on training data only
X_train_processed = pipeline_pca.fit_transform(X_train_clean)
X_val_processed = pipeline_pca.transform(X_val_clean)
X_test_processed = pipeline_pca.transform(X_test_clean)

print(f"Processed shapes:")
print(f"  Train: {X_train_processed.shape}")
print(f"  Val:   {X_val_processed.shape}")
print(f"  Test:  {X_test_processed.shape}")

# Check PCA explained variance
pca_explained_var = pipeline_pca.named_steps['pca'].explained_variance_ratio_.cumsum()
print(f"PCA explained variance (25 components): {pca_explained_var[-1]:.3f}")

Processed shapes:
  Train: (170306, 25)
  Val:   (42577, 25)
  Test:  (70191, 25)
PCA explained variance (25 components): 0.958


## 5. Baseline Models

Train Logistic Regression and Random Forest on the clean pipeline.

In [9]:
# Train Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE, n_jobs=-1)
lr.fit(X_train_processed, y_train)

# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=20, 
    min_samples_split=10,
    min_samples_leaf=5,
    class_weight='balanced',
    random_state=RANDOM_STATE,
    n_jobs=-1
)
rf.fit(X_train_processed, y_train)

print("Models trained successfully")



Models trained successfully


## 6. Threshold Tuning (Validation Set Only)

**Critical:** We tune thresholds on validation set, never on test set.

In [10]:
# Get validation predictions
y_val_prob_lr = lr.predict_proba(X_val_processed)[:, 1]
y_val_prob_rf = rf.predict_proba(X_val_processed)[:, 1]

def eval_threshold(y_true, y_prob, thresh, model_name=""):
    """Evaluate model at specific threshold"""
    y_pred = (y_prob >= thresh).astype(int)
    report = classification_report(y_true, y_pred, output_dict=True)
    auc = roc_auc_score(y_true, y_prob)
    
    attack_metrics = report['1']  # Class 1 = attack
    return {
        'model': model_name,
        'threshold': thresh,
        'accuracy': report['accuracy'],
        'attack_precision': attack_metrics['precision'],
        'attack_recall': attack_metrics['recall'], 
        'attack_f1': attack_metrics['f1-score'],
        'auc': auc
    }

# Test multiple thresholds on validation set
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5]
val_results = []

for thresh in thresholds:
    val_results.append(eval_threshold(y_val, y_val_prob_lr, thresh, "LogisticRegression"))
    val_results.append(eval_threshold(y_val, y_val_prob_rf, thresh, "RandomForest"))

val_df = pd.DataFrame(val_results)
print("Validation Results:")
print(val_df.round(3))

Validation Results:
                model  threshold  accuracy  attack_precision  attack_recall  \
0  LogisticRegression        0.1     0.874             0.499          0.922   
1        RandomForest        0.1     0.991             0.936          0.999   
2  LogisticRegression        0.2     0.934             0.679          0.908   
3        RandomForest        0.2     0.996             0.968          0.997   
4  LogisticRegression        0.3     0.959             0.872          0.790   
5        RandomForest        0.3     0.997             0.981          0.996   
6  LogisticRegression        0.4     0.960             0.934          0.732   
7        RandomForest        0.4     0.998             0.988          0.994   
8  LogisticRegression        0.5     0.952             0.960          0.647   
9        RandomForest        0.5     0.998             0.992          0.992   

   attack_f1    auc  
0      0.648  0.965  
1      0.966  1.000  
2      0.777  0.965  
3      0.983  1.000  


In [11]:
# Select best thresholds based on F1 score
best_lr_thresh = val_df[val_df['model'] == 'LogisticRegression'].sort_values('attack_f1', ascending=False).iloc[0]
best_rf_thresh = val_df[val_df['model'] == 'RandomForest'].sort_values('attack_f1', ascending=False).iloc[0]

print("Best thresholds (by validation F1):")
print(f"Logistic Regression: {best_lr_thresh['threshold']:.1f} (F1: {best_lr_thresh['attack_f1']:.3f})")
print(f"Random Forest: {best_rf_thresh['threshold']:.1f} (F1: {best_rf_thresh['attack_f1']:.3f})")

Best thresholds (by validation F1):
Logistic Regression: 0.3 (F1: 0.829)
Random Forest: 0.5 (F1: 0.992)


## 7. Final Test Evaluation (Once Only)

**This is the only time we touch the test set.**

In [12]:
# Final test predictions with optimal thresholds
y_test_prob_lr = lr.predict_proba(X_test_processed)[:, 1]
y_test_prob_rf = rf.predict_proba(X_test_processed)[:, 1]

# Apply best thresholds
final_results = [
    eval_threshold(y_test, y_test_prob_lr, best_lr_thresh['threshold'], "LogisticRegression"),
    eval_threshold(y_test, y_test_prob_rf, best_rf_thresh['threshold'], "RandomForest")
]

final_df = pd.DataFrame(final_results)
print("\n=== FINAL TEST RESULTS ===")
print(final_df.round(3))

# Detailed classification reports
print("\nLogistic Regression (Test Set):")
y_test_pred_lr = (y_test_prob_lr >= best_lr_thresh['threshold']).astype(int)
print(classification_report(y_test, y_test_pred_lr))

print("\nRandom Forest (Test Set):")
y_test_pred_rf = (y_test_prob_rf >= best_rf_thresh['threshold']).astype(int)
print(classification_report(y_test, y_test_pred_rf))


=== FINAL TEST RESULTS ===
                model  threshold  accuracy  attack_precision  attack_recall  \
0  LogisticRegression        0.3     0.705             0.931          0.307   
1        RandomForest        0.5     0.692             0.997          0.252   

   attack_f1    auc  
0      0.462  0.849  
1      0.402  0.816  

Logistic Regression (Test Set):
              precision    recall  f1-score   support

           0       0.67      0.98      0.80     41269
           1       0.93      0.31      0.46     28922

    accuracy                           0.70     70191
   macro avg       0.80      0.65      0.63     70191
weighted avg       0.78      0.70      0.66     70191


Random Forest (Test Set):
              precision    recall  f1-score   support

           0       0.66      1.00      0.79     41269
           1       1.00      0.25      0.40     28922

    accuracy                           0.69     70191
   macro avg       0.83      0.63      0.60     70191
weighted 

## 8. Save Results & Pipeline

Export results and the preprocessing pipeline for federated learning.

In [13]:
# Save final results
results_path = PROJECT_ROOT / "data" / "processed" / "centralized_baseline_results.csv"
final_df.to_csv(results_path, index=False)
print(f"Results saved to: {results_path}")

# Save preprocessing pipeline for federated learning
import joblib
pipeline_path = PROJECT_ROOT / "models" / "preprocessing_pipeline.joblib"
pipeline_path.parent.mkdir(exist_ok=True)
joblib.dump(pipeline_pca, pipeline_path)
print(f"Preprocessing pipeline saved to: {pipeline_path}")

print("\n✅ LEAKAGE-SAFE BASELINE COMPLETE")
print("✅ Pipeline ready for federated learning")
print("✅ Realistic performance metrics achieved")

Results saved to: d:\Coding\VanetUAV\data\processed\centralized_baseline_results.csv
Preprocessing pipeline saved to: d:\Coding\VanetUAV\models\preprocessing_pipeline.joblib

✅ LEAKAGE-SAFE BASELINE COMPLETE
✅ Pipeline ready for federated learning
✅ Realistic performance metrics achieved


---

## Summary

**Data Split:** 
- Training: Mon-Thu files → Train/Val split (80/20)
- Test: Friday files (untouched until final evaluation)

**Pipeline:**
- Imputation → StandardScaler → PCA(25) 
- Fitted on training data only
- Reusable for federated clients

**Models:** 
- Logistic Regression + Random Forest
- Thresholds tuned on validation set
- Final evaluation on test set

**Next Steps:** Use the saved pipeline in federated learning experiments with the same preprocessing applied consistently across all clients.