# LAB: Internet of Vehicles (IoV) network packet analysis
The CICIoV 2024 dataset as proposed by *E. C. P. Neto, H. Taslimasa, S. Dadkhah, S. Iqbal, P. Xiong, T. Rahmanb, and A. A. Ghorbani, "CICIoV2024: Advancing Realistic IDS Approaches against DoS and Spoofing Attack in IoV CAN bus," Internet of Things, 101209, 2024.* contains a variety of labeled network traffic data capturing both normal and malicious activites within IoV environments. 

![IoV Dataset 2024](https://www.unb.ca/cic/_assets/images/inline-iov-data-2024-1.jpg)

### Imports


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.metrics import accuracy_score, mean_squared_error, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, HistGradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.svm import SVC
import xgboost as xgb
import warnings
warnings.filterwarnings("ignore")

### Load and concatenate dataset
- Load the decimal benign dataset
- Show the first ten rows, but transpose the output.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
ID,65,1068,535,131,936,359,369,516,609,1071
DATA_0,96,132,127,15,1,0,16,192,0,125
DATA_1,0,13,255,224,0,128,108,0,0,4
DATA_2,0,160,127,0,39,0,0,125,9,0
DATA_3,0,0,255,0,16,0,0,0,0,2
DATA_4,0,0,127,0,0,0,0,0,0,113
DATA_5,0,0,255,0,0,1,0,0,0,0
DATA_6,0,0,127,0,0,227,0,0,0,0
DATA_7,0,0,255,0,0,0,0,0,0,0
label,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN


- What is the shape of the dataset?

(1223737, 12)

- Read in the decimal DoS dataset
- Show the first ten rows, but transpose the output.

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
ID,291,291,291,291,291,291,291,291,291,291
DATA_0,0,14,14,14,14,14,14,14,14,14
DATA_1,0,11,11,11,11,11,11,11,11,11
DATA_2,0,4,4,4,4,4,4,4,4,4
DATA_3,0,4,4,4,4,4,4,4,4,4
DATA_4,0,3,3,3,3,3,3,3,3,3
DATA_5,0,3,3,3,3,3,3,3,3,3
DATA_6,0,8,8,8,8,8,8,8,8,8
DATA_7,0,12,12,12,12,12,12,12,12,12
label,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK


- Read in the spoofing datasets
- Concatenate all datasets into one big DataFrame

- Show the last 10 rows, transpose the output

In [15]:
#df.shape
df.head(-10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19957,19958,19959,19960,19961,19962,19963,19964,19965,19966
ID,65,1068,535,131,936,359,369,516,609,1071,...,128,128,128,128,128,128,128,128,128,128
DATA_0,96,132,127,15,1,0,16,192,0,125,...,132,132,132,132,132,132,132,132,132,132
DATA_1,0,13,255,224,0,128,108,0,0,4,...,3,3,3,3,3,3,3,3,3,3
DATA_2,0,160,127,0,39,0,0,125,9,0,...,2,2,2,2,2,2,2,2,2,2
DATA_3,0,0,255,0,16,0,0,0,0,2,...,35,35,35,35,35,35,35,35,35,35
DATA_4,0,0,127,0,0,0,0,0,0,113,...,24,24,24,24,24,24,24,24,24,24
DATA_5,0,0,255,0,0,1,0,0,0,0,...,5,5,5,5,5,5,5,5,5,5
DATA_6,0,0,127,0,0,227,0,0,0,0,...,138,138,138,138,138,138,138,138,138,138
DATA_7,0,0,255,0,0,0,0,0,0,0,...,34,34,34,34,34,34,34,34,34,34
label,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,BENIGN,...,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK,ATTACK


### Two-class classification

#### Data Preprocessing and cleaning

- Drop specific_class and category
- Count the amount of values in the 'label' column
- Create a feature column list and a variable containing the class column
- Use a LabelEncoder to transform labels to numerical

label
BENIGN    244747
ATTACK     36896
Name: count, dtype: int64

array(['ATTACK', 'BENIGN'], dtype=object)

- Perform train-test-split, but make sure you also have a separate validation dataset

In [217]:
X_train.dtypes

DATA_0    int64
DATA_1    int64
DATA_2    int64
DATA_3    int64
DATA_4    int64
DATA_5    int64
DATA_6    int64
DATA_7    int64
dtype: object

In [219]:
y_train.unique()

array([1, 0])

#### ML models

Compare the performance of a variety of ML models on the selected features. (Similar to the Malware Detection lab) Track the accuracy, MSE and FP/FN ratio. What classifier scores best?
- SVM (use the Support Vector Classifier, SVC)
- XGBoost Classifier
- Logistic Regression
- AdaBoost Classifier
- K-NN Classifier
- Random Forest Classifier
- Gradient Boosting Classifier
- Histogram-Based Gradient Boosting Classifier

### Multiclass classification (Class)

#### Data Preprocessing and cleaning

- Drop label and category
- Create a feature column list and a variable containing the class column
- Use a LabelEncoder to transform labels to numerical

specific_class
BENIGN            244747
DoS                14933
RPM                10980
SPEED               4990
STEERING_WHEEL      3995
GAS                 1998
Name: count, dtype: int64

array(['BENIGN', 'DoS', 'GAS', 'RPM', 'SPEED', 'STEERING_WHEEL'],
      dtype=object)

- Perform train-test-split, but make sure you also have a separate validation dataset

In [244]:
y_train.unique()

array([0, 3, 5, 1, 2, 4])

#### ML models

Compare the performance of a variety of ML models on the selected features. (Similar to the Malware Detection lab) Track the accuracy, MSE and FP/FN ratio. What classifier scores best?
- XGBoost Classifier
- Logistic Regression
- AdaBoost Classifier
- K-NN Classifier
- Random Forest Classifier
- Gradient Boosting Classifier
- Histogram-Based Gradient Boosting Classifier

- Plot the Confusion Matrix with color_palette "coolwarm", make sure the class labels are shown on the X and Y axis
- Transform the numbers inside the matrix to a percentage