# Supervised Learning Models for Anomaly-Based Intrusion Detection (WORK IN PROGRESS)

This Jupyter Notebook focuses on training and evaluating unsupervised machine learning models for anomaly-based intrusion detection. It builds upon the preprocessed CICIDS2017 dataset, prepared in the **data preprocessing [notebook available here](https://www.kaggle.com/code/ericanacletoribeiro/cicids2017-comprehensive-data-processing-for-ml)** and continuous the performance analysis started on the **[supervised learning notebook](https://www.kaggle.com/code/ericanacletoribeiro/cicids2017-ml-models-comparison-supervised)**. 

The broader goal is to develop a Network Intrusion Detection System (NIDS) prototype capable of identifying a range of network attacks, such as DoS, PortScan, and Brute Force, while balancing detection accuracy with computational efficiency. This is particularly critical in resource-constrained environments, focus of the project. The complete pipeline is documented on [Github](https://github.com/anacletu/ml-intrusion-detection-cicids2017).

**Models Being Evaluated:**

* **Supervised Learning ([Previous Notebook](https://www.kaggle.com/code/ericanacletoribeiro/cicids2017-ml-models-comparison-supervised)):**
    * Random Forest
    * XGBoost
    * Logisc Regression

* **Unsupervised Learning (This Notebook):**
    * Isolation Forest
    * Mini-Batch KMeans
    * Autoencoders

**Evaluation Strategy:**

The models are assessed using k-fold cross-validation on the training data, with a separate hold-out test set reserved for final evaluation. Key performance metrics include accuracy, precision, recall, F1-score, ROC-AUC, and resource usage (CPU time and memory consumption). These metrics provide critical insights into algorithm efficiency and effectiveness, essential for real-time deployment in resource-constrained networks.

This notebook presents the training process, hyperparameter tuning, and comparative analysis of the selected unsupervised learning models.

In [1]:
# Installing extra components
!pip install memory_profiler
!pip install psutil

Collecting memory_profiler
  Downloading memory_profiler-0.61.0-py3-none-any.whl.metadata (20 kB)
Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)
Installing collected packages: memory_profiler
Successfully installed memory_profiler-0.61.0


In [2]:
# Importing the relevant libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb

from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, auc

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import SMOTE

import time
import psutil
import threading
from memory_profiler import memory_usage

import joblib

# Helper Functions

In [3]:
# TO DO

# 1. Data Preparation: Test/Train Split, Sampling, and Scaling

This section replicates the **data preparation process used in the [previous notebook](https://www.kaggle.com/code/ericanacletoribeiro/cicids2017-ml-models-comparison-supervised)**, which focuses on supervised algorithms. To avoid redundancy, detailed explanations of the steps—such as splitting the data into training and testing sets, applying scaling, and handling any preprocessing nuances—are not repeated here.

If you haven't reviewed the earlier notebook yet, I recommend doing so. It provides a more detailed walkthrough of these steps, including the rationale behind key decisions, such as the choice of scaling method and how resampling.

By adhering to the same preparation methodology, we ensure consistency in the data pipeline, facilitating fair comparisons between supervised and unsupervised models.

In [4]:
# Loading the dataset
clean_df = pd.read_csv('/kaggle/input/cicids2017_cleaned.csv')

In [5]:
# Preparing training and test splits
X = clean_df.drop('Attack Type', axis=1)
y = clean_df['Attack Type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [6]:
# Initialize RobustScaler
scaler = RobustScaler()

# Fit and transform the training data
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test set using the fitted scaler
X_test_scaled = scaler.transform(X_test)

In [7]:
# Initializing the undersampling for the clean df
X_train_resampled, y_train_resampled = RandomUnderSampler(sampling_strategy={'Normal Traffic': 500000}, random_state=42).fit_resample(X_train, y_train)

# Initializing the undersampling for the scaled df
X_train_scaled, y_train_scaled = RandomUnderSampler(sampling_strategy={'Normal Traffic': 500000}, random_state=42).fit_resample(X_train_scaled, y_train)

In [8]:
# Initializing the oversampling for the scaled df
X_train_resampled_scaled, y_train_resampled_scaled = SMOTE(sampling_strategy={'Bots': 2000, 'Web Attacks': 2000, 'Brute Force': 7000, 'Port Scanning': 70000, 'DDoS':90000, 'DoS': 200000}, random_state=42).fit_resample(X_train_scaled, y_train_scaled)

In [9]:
# Cleaning up
del X_train_scaled, X_train, y_train, X, y, clean_df

In [10]:
# Checking the distribution of the attack types in the resampled/raw training set
y_train_scaled.value_counts()

Attack Type
Normal Traffic    500000
DoS               135621
DDoS               89610
Port Scanning      63486
Brute Force         6405
Web Attacks         1500
Bots                1364
Name: count, dtype: int64

In [11]:
# Checking the distribution of the attack types in the resampled/scaled training set
y_train_resampled_scaled.value_counts()

Attack Type
Normal Traffic    500000
DoS               200000
DDoS               90000
Port Scanning      70000
Brute Force         7000
Bots                2000
Web Attacks         2000
Name: count, dtype: int64

# 2. Unsupervised Learning

# 3. Comparing Performance Results