# Tutorial 1: Intrusion Detection on CAN (Vehicle Network) Traffic

## Abstract
Modern vehicles rely on the Controller Area Network (CAN) bus for communication among Electronic Control Units (ECUs). However, this protocol lacks authentication and encryption mechanisms, making it vulnerable to cyberattacks such as Denial-of-Service (DoS), spoofing, and fuzzy attacks.
This notebook presents a simple yet effective intrusion detection system (IDS) for CAN traffic using real-world automotive data. We employ the simulated dataset â€” containing normal and attack messages â€” and train a lightweight binary classifier to distinguish between benign and malicious CAN frames.

## Introduction
Autonomous and connected vehicles generate massive streams of network messages between critical subsystems (e.g., engine, brakes, sensors).
The CAN protocol, though efficient and real-time, was never designed with security in mind. Attackers can inject or flood malicious CAN messages to manipulate vehicle behavior or cause system malfunctions.

To counter such threats, intrusion detection systems (IDS) are deployed to monitor network traffic and detect anomalies.

In this case study, we:

  - Utilize the Simulated Dataset, a real automotive CAN dataset with labeled attacks.
  - Preprocess the data to extract meaningful numerical features.
  - Train a simple Multilayer Perceptron (MLP) to detect intrusions
  - Evaluate performance using accuracy, precision, and recall metrics.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score

## Synthetic CAN Dataset Generation and Feature Extraction

### Function 1: `generate_synthetic_can_dataset()`
This function generates a **synthetic dataset** that simulates normal and attack messages transmitted over a vehicleâ€™s Controller Area Network (CAN). It produces random message attributes commonly found in real CAN traffic, such as CAN IDs, DLC (Data Length Code), and eight data bytes.

**Steps:**
1. **Random CAN ID Generation:**  
   - CAN IDs are sampled randomly between `0x100` and `0x7FF`, representing valid hexadecimal identifiers.

2. **Payload and DLC Generation:**  
   - DLC (message length) is randomly assigned between 1 and 8 bytes.  
   - Each message contains 8 random data bytes (`DATA0`â€“`DATA7`) with values from 0â€“255.

3. **Attack Labeling:**  
   - A fraction of the dataset (`attack_ratio`) is marked as attack messages (`Flag = 'T'`), while the rest are normal (`Flag = 'R'`).
   - Attack messages are simulated by replacing their data bytes with higher values (200â€“255), mimicking abnormal or corrupted payloads.

4. **Output:**  
   Returns a `pandas.DataFrame` with columns:


In [2]:
def generate_synthetic_can_dataset(n_samples=2000, attack_ratio=0.2, random_state=42):
    """
    Generate a synthetic CAN dataset with normal and attack messages.
    Simulates CAN_IDs, DLC, and DATA bytes (DATA0...DATA7).
    """
    np.random.seed(random_state)

    # Generate random CAN IDs (hex-like values)
    can_ids = np.random.choice(range(0x100, 0x7FF), size=n_samples)
    dlc = np.random.randint(1, 9, size=n_samples)  # DLC between 1â€“8 bytes

    # Generate 8 random data bytes (0â€“255)
    data_bytes = np.random.randint(0, 256, size=(n_samples, 8))

    # Generate labels (0 = normal, 1 = attack)
    y = np.zeros(n_samples, dtype=int)
    attack_indices = np.random.choice(n_samples, int(n_samples * attack_ratio), replace=False)
    y[attack_indices] = 1

    # Inject attack pattern (e.g., high data variance or extreme byte values)
    data_bytes[attack_indices] = np.random.randint(200, 256, size=(len(attack_indices), 8))

    df = pd.DataFrame({
        "CAN_ID": [hex(cid) for cid in can_ids],
        "DLC": dlc,
        "Flag": ["T" if label == 1 else "R" for label in y]
    })

    for i in range(8):
        df[f"DATA{i}"] = data_bytes[:, i]
    return df

def featurize_can(df):
    """Convert CAN messages to numerical features."""
    df["can_id_int"] = df["CAN_ID"].apply(lambda x: int(x, 16) % 256)
    features = [df["can_id_int"].values, df["DLC"].values]
    feat_names = ["can_id_int", "DLC"]

    for i in range(8):
        col = f"DATA{i}"
        features.append(df[col].values)
        feat_names.append(col)

    X = np.stack(features, axis=1)
    y = (df["Flag"] == "T").astype(int).values
    return X, y, feat_names



# Theoretical Background
## Intrusion Detection in Vehicle CAN Networks
The Controller Area Network (CAN) is the communication backbone of modern vehicles, allowing electronic control units (ECUs) to exchange messages efficiently. However, since CAN lacks built-in authentication or encryption, it is vulnerable to malicious message injection, spoofing, or denial-of-service (DoS) attacks. Machine learning techniques can be used to detect such anomalies by learning the statistical patterns of normal and abnormal CAN traffic.  

A typical machine learningâ€“based intrusion detection system (IDS) for CAN traffic involves three stages:  
1. **Data Collection:** CAN messages are collected, either from real vehicles or generated synthetically, representing both normal and attack conditions.  
2. **Feature Extraction (Featurization):** Raw CAN frames are converted into numerical features (e.g., message frequency, ID entropy, payload variance) suitable for training classifiers.  
3. **Classification:** A supervised model (e.g., Random Forest, SVM, Neural Network) is trained to distinguish between normal and malicious traffic.  

---

## Problem Statement  
The goal of this code is to develop and evaluate a simple machine learning pipeline for **CAN intrusion detection**. Specifically, it seeks to classify CAN messages as either *normal* or *attack* using features derived from CAN traffic data.  

---

## Methodology  
1. **Synthetic Dataset Generation:**  
   The code begins by generating a synthetic CAN dataset (`generate_synthetic_can_dataset()`), simulating both normal and anomalous vehicle communication data.  

2. **Feature Engineering:**  
   The dataset is processed using a `featurize_can()` function to convert raw CAN frames into numerical vectors (`X`) and corresponding labels (`y`). The resulting features capture statistical and temporal properties of CAN messages.  

3. **Data Splitting:**  
   The data is split into training and testing sets using `train_test_split()`, ensuring stratified sampling to preserve class balance.  

4. **Model Training:**  
   A **Random Forest Classifier** with 50 trees and a maximum depth of 10 is trained on the training set. This ensemble model captures complex nonlinear patterns and is robust against overfitting for moderately sized datasets.  

5. **Model Evaluation:**  
   Predictions are generated on the test set, and performance is assessed using a **classification report** (precision, recall, F1-score) and the **ROC-AUC score**, which measures the modelâ€™s ability to distinguish between normal and attack traffic.  

---

## Computational Challenge  
While Random Forests are efficient for medium-scale data, their effectiveness depends heavily on the quality and diversity of features extracted from CAN traffic. For real-world deployment in vehicles, computational efficiency, low false alarm rate, and adaptability to new attack types remain significant challenges.  

In [3]:
print("ðŸš— Generating synthetic CAN dataset...")
df = generate_synthetic_can_dataset()
print(df.head())

X, y, feat_names = featurize_can(df)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0, stratify=y
)

clf = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1]
print("Classification Report:")
print(classification_report(y_test, y_pred, digits=4))
print("ROC AUC:", roc_auc_score(y_test, y_proba))

ðŸš— Generating synthetic CAN dataset...
  CAN_ID  DLC Flag  DATA0  DATA1  DATA2  DATA3  DATA4  DATA5  DATA6  DATA7
0  0x566    8    R    107    107    200    212    215    225    213    130
1  0x6b3    3    R     75    219     29    217    239     20    155    247
2  0x45c    8    T    235    211    219    222    224    229    228    230
3  0x60e    8    T    242    211    202    217    255    237    206    232
4  0x56a    4    R    127    149     16     68    107    138     93     14
Classification Report:
              precision    recall  f1-score   support

           0     1.0000    1.0000    1.0000       480
           1     1.0000    1.0000    1.0000       120

    accuracy                         1.0000       600
   macro avg     1.0000    1.0000    1.0000       600
weighted avg     1.0000    1.0000    1.0000       600

ROC AUC: 1.0
