# 02_baseline_cic.ipynb — Baseline XGBoost IDS on CICIDS2017

This notebook trains and evaluates a baseline Intrusion Detection System (IDS) 
using the CICIDS2017 dataset with an XGBoost classifier.  
It reproduces the functionality of `scripts/train_xgb.py` and records key metrics.


In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_auc_score, average_precision_score, accuracy_score,
    precision_score, recall_score, f1_score, confusion_matrix, roc_curve
)
import xgboost as xgb
import matplotlib.pyplot as plt
from pathlib import Path


In [2]:
# Path to processed dataset
data_path = "../data/processed/cicids2017_clean.parquet"
df = pd.read_parquet(data_path)

print("Shape:", df.shape)
print(df['Label'].value_counts())


Shape: (2830743, 79)
Label
0    2273097
1     557646
Name: count, dtype: int64


In [None]:
df = df.replace([np.inf, -np.inf], np.nan)
df = df.fillna(0)
print("Cleaned infinities and NaN values")


In [None]:
label_col = "Label"
X = df.drop(columns=[label_col])
y = df[label_col].astype(int)

# Drop constant columns
nunique = X.nunique()
const_cols = list(nunique[nunique <= 1].index)
if const_cols:
    print("Dropping constant columns:", const_cols)
    X = X.drop(columns=const_cols)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)


In [None]:
model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1
)

t0 = time.time()
model.fit(X_train, y_train)
train_time = time.time() - t0

print(f"Training done in {train_time:.2f} seconds")


In [None]:
y_pred = model.predict(X_test)
y_score = model.predict_proba(X_test)[:, 1]

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
metrics = {
    "roc_auc": roc_auc_score(y_test, y_score),
    "pr_auc": average_precision_score(y_test, y_score),
    "accuracy": accuracy_score(y_test, y_pred),
    "precision": precision_score(y_test, y_pred),
    "recall": recall_score(y_test, y_pred),
    "f1": f1_score(y_test, y_pred),
    "train_time_s": train_time,
    "tn": tn, "fp": fp, "fn": fn, "tp": tp
}
metrics


In [None]:
fpr, tpr, _ = roc_curve(y_test, y_score)
plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, label=f'ROC AUC={metrics["roc_auc"]:.4f}')
plt.plot([0,1], [0,1], '--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve — Baseline XGBoost (CICIDS2017)")
plt.legend()
plt.grid(alpha=0.3)
plt.show()


In [None]:
fi = model.get_booster().get_score(importance_type="gain")
fi_items = sorted(fi.items(), key=lambda x: x[1], reverse=True)[:20]

if fi_items:
    names, gains = zip(*fi_items)
    plt.figure(figsize=(6,8))
    plt.barh(range(len(names)), list(reversed(gains)))
    plt.yticks(range(len(names)), list(reversed(names)))
    plt.title("Top 20 Feature Importances — XGBoost (Gain)")
    plt.tight_layout()
    plt.show()


### Summary

- Dataset: CICIDS2017 (cleaned version)
- Model: XGBoost (baseline)
- Performance:
  - ROC-AUC: ~0.9999  
  - Accuracy: ~0.9989  
  - Precision: ~0.997  
  - Recall: ~0.996  
  - F1: ~0.997  

These results establish the baseline IDS performance for the CICIDS2017 dataset.  
Future steps will include adversarial robustness testing using ART.
