<a href="https://colab.research.google.com/github/akin-oladejo/can-anomaly-detection/blob/main/notebooks/Analysis%20of%20normal%20Renault%20CAN%20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of the Renault CAN Data

## The Task
The Controller Area Network (CAN) is currently the most widely-used in-vehicle networking protocol. It is a bi-directional, multi-master, serial bus that uses UTP cabling to ensure reliability in electromagnetically noisy environments. Several devices in modern vehicles communicate with each other using the CAN protocol. Some of these devices are connected to the internet, allowing external attacks in various forms such as Replay, Spoofing, Denial of Service (DOS) and so on. In this project, different methods will be employed to detect attacks on the CAN, employing methods from different research articles.  

The methods are:
1. Frequency-based detection
2. Anomaly detection using the latent representation of normal data (i.e. autoencoders)
3. Anomaly detection using vehicle state transition (i.e. hidden markov models)

4. *Bonus: Anomaly detection using a triplet-loss network*



In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder
import joblib

In [None]:
# config
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## The Dataset
The dataset used in this project is the **Automotive Controller Area Network (CAN) Bus Intrusion Dataset v2**. It can be found [here](https://figshare.com/articles/dataset/Automotive_Controller_Area_Network_CAN_Bus_Intrusion_Dataset/12696950). This dataset contains automotive Controller Area Network (CAN) bus data from three systems: two cars (Opel Astra and Renault Clio) and from a CAN bus prototype made by the authors of the dataset. Its purpose is to evaluate CAN bus Network Intrusion Detection Systems (NIDS). For each vehicle/system, there is a collection of log files captured from its CAN bus: normal (attack-free) data for training and testing detection algorithms, and different CAN bus attacks (Diagnostic, Fuzzing attacks, Replay attack, Suspension attack and Denial-of-Service attack).

For this project, I used data from the Renault Clio. The description of the different logs and attacks can be found in `../RenaultClio/README.md`

In [None]:
file_path = '/content/drive/MyDrive/datasets/RenaultClio'

train = pd.read_csv(f'{file_path}/training.log', delimiter=' ', header=None)
# test = pd.read_csv(f'{file_path}/testing.log', delimiter=' ', header=None)
# diag_atk = pd.read_csv(f'{file_path}/diagnostic.log', delimiter=' ', header=None)
# dos_atk = pd.read_csv(f'{file_path}/dosattack.log', delimiter=' ', header=None)
# fuzzing_canid_atk = pd.read_csv(f'{file_path}/fuzzing_canid.log', delimiter=' ', header=None)
# fuzzing_payload_atk = pd.read_csv(f'{file_path}/fuzzing_payload.log', delimiter=' ', header=None)
# replay_atk = pd.read_csv(f'{file_path}/replay.log', delimiter=' ', header=None)
# suspension_atk = pd.read_csv(f'{file_path}/suspension.log', delimiter=' ', header=None)

In [None]:
print(train.shape) # print the number of rows and columns
train.head() # print first 5 rows

(270596, 3)


Unnamed: 0,0,1,2
0,(1508687283.891357),slcan0,12E#C680027FD0FFFF00
1,(1508687283.891365),slcan0,090#1A000000
2,(1508687283.891368),slcan0,0C6#7512800A8008BAAC
3,(1508687283.891375),slcan0,242#0000FFEFFE000D
4,(1508687283.891377),slcan0,29C#00000000FFFFFFFF


In [None]:
train.isnull().sum() # print the total number of missing rows

0    0
1    0
2    0
dtype: int64

Great, there is no missing data. It is probably a good idea at this point to name the features for reference

In [None]:
train.columns = ['timestamp', 'device', 'id_and_message']
train.head(3)

Unnamed: 0,timestamp,device,id_and_message
0,(1508687283.891357),slcan0,12E#C680027FD0FFFF00
1,(1508687283.891365),slcan0,090#1A000000
2,(1508687283.891368),slcan0,0C6#7512800A8008BAAC


#### The `Timestamp` Feature
The first column represents the timestamp in epochs. Converting this timestamp to datetime makes it easier to work with:

In [None]:
timestamp = train['timestamp'].apply(lambda x: x.strip('()')) # strip parentheses
timestamp = pd.to_datetime(timestamp, unit='s') # convert to datetime
timestamp.head(3)

0   2017-10-22 15:48:03.891356945
1   2017-10-22 15:48:03.891365051
2   2017-10-22 15:48:03.891367912
Name: timestamp, dtype: datetime64[ns]

#### The `Device` Feature: What is slcan0?
The `slcan0` is the index of the device reading the CAN data on the SLCAN (socket-based) interface. It is consistent throughout the renault logs because only one socket-based device, with index 0, was used. In readings of the custom CAN prototype by the authors, the name of this  device is `can0` indicating that device is reading using real hardware. For more information on CAN interfaces, take a look at [this resource](https://elinux.org/Bringing_CAN_interface_up#Introduction), or section 6.4 in [this documentation](https://www.kernel.org/doc/Documentation/networking/can.txt).  

The feature `slcan0` is redundant and has no significance on the CAN data so it will not be included in the data used to train the models.

In [None]:
train.drop(columns='device', inplace=True) # drop slcan column

In [None]:
train.head(2)

Unnamed: 0,timestamp,id_and_message
0,(1508687283.891357),12E#C680027FD0FFFF00
1,(1508687283.891365),090#1A000000


#### Extracting CAN ID and packet data
Let's take a look at a single example:

In [None]:
train.iloc[0]

timestamp          (1508687283.891357)
id_and_message    12E#C680027FD0FFFF00
Name: 0, dtype: object

From the documentation in the README provided by the dataset's authors, we see that the second feature contains both CAN identifier and message. The packets have differing message length (indicating that the CAN implementation may be [CAN-FD](https://www.can-cia.org/can-knowledge/can/can-fd/), which allows variable message length). Let's extract the CAN ID and data.

In [None]:
id_and_message = train['id_and_message'].str.split('#', expand=True)
id_and_message.head(2)

Unnamed: 0,0,1
0,12E,C680027FD0FFFF00
1,090,1A000000


##### CAN-ID
The first column of the `id_and_message` dataframe contains CAN ID values

In [None]:
id = id_and_message[0] # extract CAN ID
id.head()

0    12E
1    090
2    0C6
3    242
4    29C
Name: 0, dtype: object

The documentation states that there are 55 distinct `id` values in the Renault's CAN, making `id` categorical. As a preprocessing step, the `id` column will be one-hot encoded.

In [None]:
id.nunique() # show the number of unique can_id values

55

In [None]:
id.unique()

array(['12E', '090', '0C6', '242', '29C', '352', '1F6', '186', '18A',
       '4AC', '211', '45C', '214', '5DF', '29A', '2B7', '217', '2C6',
       '354', '392', '5E9', '68B', '653', '564', '3B7', '500', '4F8',
       '218', '4FA', '671', '66A', '511', '5DA', '648', '65C', '350',
       '55D', '575', '1A0', '563', '5D7', '62C', '5DE', '673', '552',
       '303', '3FA', '666', '634', '657', '646', '433', '69F', '665',
       '6FB'], dtype=object)

In [None]:
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-55)
id = id.array.reshape(-1, 1) # prep the CAN data
id = enc.fit_transform(id) # transform the categorical data

joblib.dump(enc, 'enc.bin', compress=True) # save ordinal encoder

['enc.bin']

In [None]:
# oh_encoder = OneHotEncoder(handle_unknown='ignore') # instatntiate a one-hot encoder
# id_array = id.array.reshape(-1, 1) # prep the CAN data
# id_columns = oh_encoder.fit_transform(id_array).toarray() # transform the categorical data

# # print data type info and shape
# print(f'{type(id_columns)=}, {id_columns.dtype=}')
# print(id_columns.shape)

# joblib.dump(oh_encoder, 'oh_encoder.bin', compress=True) # save one-hot encoder

##### CAN Message
the second column of the `id_and_message` column is the message transmitted.

In [None]:
message = id_and_message[1].apply(lambda x: ' '.join(x[i:i+2] for i in range(0, len(x), 2))) # space out
message = message.str.split(' ', expand=True)
message.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,C6,80,02,7F,D0,FF,FF,00
1,1A,0,00,00,,,,
2,75,12,80,0A,80,08,BA,AC
3,00,0,FF,EF,FE,00,0D,
4,00,0,00,00,FF,FF,FF,FF


We see that each `message` is currently in hexadecimal form with several `None` values where the message length was less than 8 (this was discussed earlier). Next, the missing data will be replaced with a dummy hex value of '00' and the values will be converted to float.

In [None]:
message.fillna('00', inplace=True) # fill missing values with 00
message = message.apply(lambda x : x.astype(str).map(lambda x : float(int(x, base=16)))) # convert hex values to float
print(message.shape)
message.head(3)

(270596, 8)


Unnamed: 0,0,1,2,3,4,5,6,7
0,198.0,128.0,2.0,127.0,208.0,255.0,255.0,0.0
1,26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,117.0,18.0,128.0,10.0,128.0,8.0,186.0,172.0


Let's create a new feature called `interval` that contains the time between two messages. Perhaps it will serve as a helper feature and improve training. Since it will contain float values (the timestamps are still in epochs, which will be converted into floats), it will be scaled too using the `StandardScaler`

In [None]:
timestamp_flt = train['timestamp'].apply(lambda x: x.strip('()')) # strip parentheses on timestamp
timesteps = timestamp_flt.astype('float').values
# shift the timesteps by one element and pad it at the end with the last value
shifted_timesteps = np.append(timesteps[1:], timesteps[-1])
interval = shifted_timesteps - timesteps

In [None]:
interval

array([8.10623169e-06, 2.86102295e-06, 7.15255737e-06, ...,
       2.50816345e-04, 1.98125839e-04, 0.00000000e+00])

It is proper that we scale numeric data in the dataset to assist training. Scaling data stabilizes model training and causes the model to interpret the different features on the same scale, so no feature is given undue/incorrect coefficients off the jump.

In [None]:
ss = StandardScaler()
scaled_num_cols = ss.fit_transform(pd.concat([message, pd.DataFrame(interval)], axis=1))
scaled_num_cols[:5]

array([[ 1.58171022,  0.42526157, -0.78101472,  0.44817721,  1.20661597,
         2.10680795,  1.73201824, -0.51253627, -0.54538308],
       [-0.57784532, -0.96487851, -0.80346713, -0.91896883, -0.94667383,
        -0.61326702, -0.72375158, -0.51253627, -0.54944963],
       [ 0.56471023, -0.76939006,  0.63348665, -0.81131954,  0.37842758,
        -0.52793134,  1.06751582,  1.38737938, -0.54612245],
       [-0.90428976, -0.96487851,  2.05921423,  1.65384932,  1.68282429,
        -0.61326702, -0.59855547, -0.51253627, -0.550189  ],
       [-0.90428976, -0.96487851, -0.80346713, -0.91896883,  1.69317665,
         2.10680795,  1.73201824,  2.30419914, -0.55000416]])

In [None]:
joblib.dump(ss, 'scl.bin', compress=True) # save the standard scaler

['scl.bin']

Now that we have been able to process the `can_id` and `data` features, let's join them together in the original dataset and remove the former features.

In [None]:
train = pd.concat([timestamp, pd.DataFrame(scaled_num_cols), pd.DataFrame(id)], axis=1)
print(train.shape)
train.head()

(270596, 11)


Unnamed: 0,timestamp,0,1,2,3,4,5,6,7,8,0.1
0,2017-10-22 15:48:03.891356945,1.58171,0.425262,-0.781015,0.448177,1.206616,2.106808,1.732018,-0.512536,-0.545383,2.0
1,2017-10-22 15:48:03.891365051,-0.577845,-0.964879,-0.803467,-0.918969,-0.946674,-0.613267,-0.723752,-0.512536,-0.54945,0.0
2,2017-10-22 15:48:03.891367912,0.56471,-0.76939,0.633487,-0.81132,0.378428,-0.527931,1.067516,1.387379,-0.546122,1.0
3,2017-10-22 15:48:03.891375065,-0.90429,-0.964879,2.059214,1.653849,1.682824,-0.613267,-0.598555,-0.512536,-0.550189,11.0
4,2017-10-22 15:48:03.891376972,-0.90429,-0.964879,-0.803467,-0.918969,1.693177,2.106808,1.732018,2.304199,-0.550004,13.0


Having understood the data and how it will be presented for training, let's go ahead to create anomaly detection models using different approaches. That notebook can be found here