## EDA Notebook On Flooding Attacks in CAN networks (Assignment 2)


### Importing necessary libraries and loading the dataset
This cell imports the `pandas` library and loads the dataset from a file located at `../data/dataset/Spark/Flooding_dataset_Spark.txt`. It also displays the shape, column names, and the first few rows of the dataset.


In [None]:
import pandas as pd

df = pd.read_csv("../data/dataset/Spark/Flooding_dataset_Spark.txt")

print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

### Displaying dataset information
This cell provides an overview of the dataset, including the data types of each column and the number of non-null values.

In [None]:
df.info()

### Generating descriptive statistics
This cell calculates and displays summary statistics for the numerical columns in the dataset, such as mean, standard deviation, minimum, and maximum values.

In [None]:
df.describe()

### Checking for missing values
This cell calculates the number of missing values in each column of the dataset.

In [None]:
df.isnull().sum()

### Some Basic Data Cleaning
This cell makes a new flag column where the flag for each packet exists in the same row and gets rid of the scattered flag encodings. Also, null data is replaced with a more representative string.

In [None]:
df["flag"] = df.apply(lambda row: row[row.last_valid_index()] if row.last_valid_index() is not None else "No data", axis=1)
df["flag"] = df["flag"].apply(lambda x: 1 if x == "T" else 0)
cols = [c for c in df.columns if c != "flag"]
df[cols] = df[cols].apply(
    lambda row: row.mask(row.index == row.last_valid_index(), "No data"),
    axis=1
)

df.drop(columns=["R"], inplace=True, errors="ignore")
df.drop(columns=["04C1"], inplace=True, errors="ignore")

df.fillna("No data", inplace=True)

df.head()

### More Data Cleaning
this cell makes the data column names more representative

In [None]:
df.rename(columns={
    "1513920093.615172": "Timestamp",
    "8": "DLC",
    "00": "Data[0]",
    "CC": "Data[1]",
    "80": "Data[2]",
    "5E": "Data[3]",
    "52": "Data[4]",
    "08": "Data[5]",
    "00.1": "Data[6]",
    "00.2": "Data[7]"
}, inplace=True)
df.head()

### Visualizing the distribution of Column 8
This cell uses `matplotlib` and `seaborn` to create a histogram with a kernel density estimate (KDE) for the values in column `DLC`. The plot is saved as `distribution.png`.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(8, 5))
sns.histplot(df["DLC"], bins=50, kde=True, ax=ax)
ax.set_title("Distribution of DLC")
ax.set_xlabel("Value")
ax.set_ylabel("Count")
fig.savefig("../figures/distribution.png", dpi=150, bbox_inches="tight")
plt.show()

### Generating a correlation heatmap
This cell calculates the correlation matrix for numeric columns in the dataset and visualizes it using a heatmap. The plot is saved as `correlation_heatmap.png`.


In [None]:
fig, ax = plt.subplots(figsize=(10, 8))
numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns
sns.heatmap(df[numeric_cols].corr(), annot=True, fmt=".2f", cmap="coolwarm", ax=ax)
ax.set_title("Feature Correlation Heatmap")
fig.savefig("../figures/correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()

### Creating a new categorical column and visualizing its distribution
This cell visualizes the distribution of the new category using a count plot, which is saved as `class_distribution.png`.


In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

sns.countplot(data=df, x="flag", ax=ax)

ax.set_xticklabels(["Normal (R)", "Flooding (T)"])
ax.set_title("Message Count: Normal vs Flooding")
ax.set_xlabel("CAN ID Category")
ax.set_ylabel("Number of Messages")

fig.savefig("../figures/class_distribution.png", dpi=150, bbox_inches="tight")
plt.show()

### Visualizing data size distribution for different categories
This cell creates a violin plot to visualize the distribution of data sizes (`DLC`) for the two categories in the `flag` column. The plot is saved as `violin_plot.png`.

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

sns.violinplot(data=df, x="flag", y="DLC", ax=ax)

ax.set_xticks([0, 1])
ax.set_xticklabels(["Normal", "Flooding (0x000)"])
ax.set_title("Data Size Distribution: Flooding vs Normal Messages")
ax.set_xlabel("CAN ID Category")
ax.set_ylabel("Data Size")

fig.savefig("../figures/violin_plot.png", dpi=150, bbox_inches="tight")
plt.show()

### Extracting A More Specific Time Feature
This collects timestamps within 100 ms and dislays the message count over these different time frames. This shows the pattern in message flow over time

In [None]:
# Extract message over time graph
df["time_bucket"] = df["Timestamp"].apply(lambda x: int(x * 10) / 10)

message_counts = df.groupby("time_bucket").size().reset_index(name="message_count")

fig, ax = plt.subplots(figsize=(10, 8))

sns.lineplot(data=message_counts, x="time_bucket", y="message_count", ax=ax)

ax.set_title("Message Count Over Time (100ms Windows)")
ax.set_xlabel("Time Window (100ms)")
ax.set_ylabel("Number of Messages")

fig.savefig("../figures/message_over_time.png", dpi=150, bbox_inches="tight")
plt.show()


### Looking At High Traffic
This specifically encodes each frame as a high traffic time frame or not. This helps with looking at a pattern between high message counts and injected messages. Specifcally, if messages in high traffic time frames are more likely to be flooding messgaes or not.

In [None]:
high_traffic = df.groupby("time_bucket").size().reset_index(name="message_count")
high_traffic["high_traffic"] = (high_traffic["message_count"] > 300).astype(int)
print(high_traffic.head())

df = df.merge(high_traffic[["time_bucket", "high_traffic"]], on="time_bucket", how="left")
df.head()

### Extracting new features from string data
This cell looks at the uniqueness of the byte data in the data columns. Used to look for a pattern in variance of data between flooding and non-flooding messages

In [203]:
data_cols = ["Data[0]", "Data[1]", "Data[2]", "Data[3]", "Data[4]", "Data[5]", 
             "Data[6]", "Data[7]"]

df["unique_count"] = df[data_cols].apply(lambda row: row[row != "No data"].nunique(), axis=1)


### Creating a parallel coordinates plot
This cell creates a parallel coordinates plot to visualize the relationships between the features `DLC`, `high_traffic`, `unique_count`, and `flag`. The plot is saved as `parallel_coordinates.png`.

In [None]:
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates

plot_df = df[["unique_count", "DLC", "high_traffic", "flag"]].copy()
plot_df = plot_df.sample(1000, random_state=42)
plot_df["flag"] = plot_df["flag"].map({0: "Normal", 1: "Flooding"})

fig, ax = plt.subplots(figsize=(10, 6))

parallel_coordinates(plot_df, "flag",
                     color=["steelblue", "crimson"],
                     alpha=0.2,
                     ax=ax)

ax.set_title("Parallel Coordinates: Flooding vs Normal Messages")

fig.savefig("../figures/parallel_coordinates.png", dpi=150, bbox_inches="tight")
plt.show()

### Relooking at correlation
This cell remakes a new correlation heatmap matrix with the new extracted features to determine their relavence in training

In [None]:
#Relooking at correlation
corr_df = df[["DLC", "unique_count","high_traffic", "flag"]].copy()
fig, ax = plt.subplots(figsize=(10, 8))
sns.heatmap(corr_df.corr(), annot=True, fmt=".2f", cmap="coolwarm", ax=ax)
ax.set_title("Feature Correlation Heatmap")
fig.savefig("../figures/correlation_heatmap.png", dpi=150, bbox_inches="tight")
plt.show()

### Splitting the dataset into training and testing sets
This cell splits the dataset into training and testing sets using `train_test_split` from `sklearn`. The target variable is `flag`.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop(columns=["flag"])
y = df["flag"]

X = X.select_dtypes(include="number")

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

### Some Basic Preprocessing

This is used to scale down the data specifically for logistic regression models to perform effectively

In [207]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

# Fill NaNs
imputer = SimpleImputer(strategy="median")
X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

# Scale
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


### Training machine learning models
This cell trains two machine learning models: Logistic Regression and Random Forest Classifier, using the training data.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

### Evaluating model performance
This cell evaluates the performance of the trained models (Logistic Regression and Random Forest) using the test dataset. It generates classification reports and confusion matrices for each model, and saves the confusion matrix plots as images

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

for name, model in [("Logistic Regression", lr), ("Random Forest", rf)]:
    y_pred = model.predict(X_test)
    print(f"\n{'='*40}")
    print(f"{name}")
    print(f"{'='*40}")
    print(classification_report(y_test, y_pred))
    
    fig, ax = plt.subplots(figsize=(6, 5))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", ax=ax)
    ax.set_title(f"Confusion Matrix - {name}")
    ax.set_xlabel("Predicted")
    ax.set_ylabel("Actual")
    fig.savefig(f"../figures/confusion_matrix_{name.lower().replace(' ', '_')}.png", 
                dpi=150, bbox_inches="tight")
    plt.show()

### Check Model Sanity
This verifies that the model is learning meaningful patterns from the data

In [None]:
# Sanity check
from sklearn.dummy import DummyClassifier

dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
print(f"Dummy classifier score: {dummy.score(X_test, y_test):.2f}")