# 01 — Scikit-learn MLP on a Cybersecurity Dataset

Purpose: apply scikit-learn's MLPClassifier to a real cybersecurity dataset (intrusion detection). You will load data, preprocess numeric and categorical features, train an MLP, and evaluate with accuracy, F1, and confusion matrix. The problem is binary classification: normal vs attack.

## Learning goals
- Load a real dataset from a URL (UCI/Kaggle-style).
- Handle mixed feature types: use numeric columns and optionally encode categoricals.
- Use train/validation split and StandardScaler (fit on train only).
- Train scikit-learn MLPClassifier and interpret solver, hidden_layer_sizes, max_iter.
- Evaluate with accuracy, F1, and confusion matrix; recognize class imbalance.

## Prerequisites
- Basic Python, NumPy, pandas.
- Notions of classification, train/val split, and scaling.
- Optional: Kaggle account if you want to try Kaggle cybersecurity datasets (e.g. "Cybersecurity Intrusion Detection Dataset").

## Key ideas
- Real cybersecurity data is often imbalanced (many normal, fewer attacks).
- Scaling inputs is important for MLP; always fit the scaler on training data only.
- MLP is a simple but effective model when the problem is not overly complex.
- Validation metrics (F1, recall on the minority class) matter more than accuracy when classes are imbalanced.

## Minimal theory
- MLP: multi-layer perceptron (fully connected layers, activation, output).
- Scikit-learn MLPClassifier: uses Adam-like solver by default, supports early stopping.
- For intrusion detection: we map all attack types to a single "attack" class (binary).
- KDD Cup 99 (UCI) is a classic benchmark; similar datasets exist on Kaggle (e.g. network intrusion, CICIDS).

In [2]:
!pip install pandas


Collecting pandas
  Downloading pandas-3.0.0-cp313-cp313-macosx_11_0_arm64.whl.metadata (79 kB)
Downloading pandas-3.0.0-cp313-cp313-macosx_11_0_arm64.whl (9.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pandas
Successfully installed pandas-3.0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m26.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
!pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.8.0-cp313-cp313-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting scipy>=1.10.0 (from scikit-learn)
  Downloading scipy-1.17.0-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting joblib>=1.3.0 (from scikit-learn)
  Downloading joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.8.0-cp313-cp313-macosx_12_0_arm64.whl (8.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.0/8.0 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading joblib-1.5.3-py3-none-any.whl (309 kB)
Downloading scipy-1.17.0-cp313-cp313-macosx_14_0_arm64.whl (20.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.1/20.1 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hUsing cached threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Install

In [None]:
import numpy as np
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import gzip
import io

SEED = 42
np.random.seed(SEED)

### Dataset note

- **UCI KDD Cup 99** (used here): classic intrusion detection benchmark; loaded from a public URL so the notebook runs without Kaggle API.
- **Kaggle alternatives**: e.g. [Cybersecurity Intrusion Detection Dataset](https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset) or [Network Intrusion Detection](https://www.kaggle.com/datasets). Download the CSV and replace the `load_kdd` step with `pd.read_csv("path/to/file.csv")` and adjust column names and target column as needed.

## Load KDD Cup 99 (10% sample) from UCI

We use the classic intrusion detection dataset. The last column is the label (e.g. "normal.", "smurf.", "neptune."). We map everything that is not "normal." to "attack". Column names are from the KDD Cup 99 description.

In [6]:
from sklearn.datasets import fetch_kddcup99

def load_kdd(max_rows=80_000):
    data = fetch_kddcup99(percent10=True, random_state=SEED)
    X = data.data
    y = data.target
    if max_rows and X.shape[0] > max_rows:
        rng = np.random.RandomState(SEED)
        idx = rng.choice(X.shape[0], size=max_rows, replace=False)
        X, y = X[idx], y[idx]
    return X, y

X_raw, y_raw = load_kdd(max_rows=80_000)
normal_label = b"normal." if y_raw.dtype.kind in ("S", "O") else "normal."
y = (y_raw != normal_label).astype(np.int64)
X = np.asarray(X_raw, dtype=np.float64)
print("Shape:", X.shape, y.shape)
print("Class distribution:", np.bincount(y))

HTTPError: HTTP Error 404: Not Found

### Binary target and numeric features only

We keep only numeric columns and binarize the label to `normal` (0) vs `attack` (1). This keeps the notebook simple and avoids heavy categorical encoding; MLP works well on this subset.

In [None]:
# X and y are already defined in the load cell (41 features, binary target 0=normal, 1=attack)
print("Features shape:", X.shape)
print("Class distribution:", np.bincount(y))

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=SEED, stratify=y)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_val_s = scaler.transform(X_val)

print("Train:", X_train_s.shape, y_train.shape)
print("Val:", X_val_s.shape, y_val.shape)

### Train MLPClassifier

We use a small MLP: one or two hidden layers, early stopping on validation (via a held-out set internally or we rely on max_iter). For a quick run, `max_iter=100` is enough; increase for better convergence.

In [None]:
mlp = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation="relu",
    solver="adam",
    alpha=1e-4,
    max_iter=100,
    random_state=SEED,
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=10,
)
mlp.fit(X_train_s, y_train)
print("Iterations used:", mlp.n_iter_)

In [None]:
y_pred_train = mlp.predict(X_train_s)
y_pred_val = mlp.predict(X_val_s)

acc_train = accuracy_score(y_train, y_pred_train)
acc_val = accuracy_score(y_val, y_pred_val)
f1_train = f1_score(y_train, y_pred_train, zero_division=0)
f1_val = f1_score(y_val, y_pred_val, zero_division=0)

print("Train — Accuracy: {:.4f}, F1: {:.4f}".format(acc_train, f1_train))
print("Val   — Accuracy: {:.4f}, F1: {:.4f}".format(acc_val, f1_val))
print("\nConfusion matrix (val):")
print(confusion_matrix(y_val, y_pred_val))
print("\nClassification report (val):")
print(classification_report(y_val, y_pred_val, target_names=["normal", "attack"]))

### Optional: loss curve

Plot the training loss per iteration to see convergence.

In [None]:
if hasattr(mlp, "loss_curve_") and mlp.loss_curve_ is not None:
    plt.figure(figsize=(6, 4))
    plt.plot(mlp.loss_curve_, color="C0")
    plt.xlabel("Iteration")
    plt.ylabel("Loss")
    plt.title("MLP training loss")
    plt.tight_layout()
    plt.show()