<a href="https://colab.research.google.com/github/akin-oladejo/can-anomaly-detection/blob/main/notebooks/Analysis%20of%20normal%20Renault%20CAN%20Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of the Renault CAN Data

## The Task
The Controller Area Network (CAN) is currently the most widely-used in-vehicle networking protocol. It is a bi-directional, multi-master, serial bus that uses UTP cabling to ensure reliability in electromagnetically noisy environments. Several devices in modern vehicles communicate with each other using the CAN protocol. Some of these devices are connected to the internet, allowing external attacks in various forms such as Replay, Spoofing, Denial of Service (DOS) and so on. In this project, different methods will be employed to detect attacks on the CAN, employing methods from different research articles.  

The methods are:
1. Frequency-based detection
2. Anomaly detection using the latent representation of normal data (i.e. autoencoders)
3. Anomaly detection using vehicle state transition (i.e. hidden markov models)

4. *Bonus: Anomaly detection using a triplet-loss network*



In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
import joblib
import json
import warnings

In [2]:
# config
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## The Dataset
The dataset used in this project is the **Automotive Controller Area Network (CAN) Bus Intrusion Dataset v2**. It can be found [here](https://figshare.com/articles/dataset/Automotive_Controller_Area_Network_CAN_Bus_Intrusion_Dataset/12696950). This dataset contains automotive Controller Area Network (CAN) bus data from three systems: two cars (Opel Astra and Renault Clio) and from a CAN bus prototype made by the authors of the dataset. Its purpose is to evaluate CAN bus Network Intrusion Detection Systems (NIDS). For each vehicle/system, there is a collection of log files captured from its CAN bus: normal (attack-free) data for training and testing detection algorithms, and different CAN bus attacks (Diagnostic, Fuzzing attacks, Replay attack, Suspension attack and Denial-of-Service attack).

For this project, I used data from the Renault Clio. The description of the different logs and attacks can be found in `../RenaultClio/README.md`

In [3]:
# store the locations of the datasets
data_path = 'data'
with open(f'../{data_path}/metadata.json', 'r') as f:
    dataset = json.load(f)

train_path = dataset['train']['path']
train = pd.read_csv(train_path,
                    delimiter=' ',
                    # header=None,
                    names=['timestamp', 'device', 'id_and_message'])

Note that the ambient logs used as the train dataset are obtained from the dyno. This is to maintain uniformity as the attack data was captured from the dyno too.

In [4]:
print(train.shape) # print the number of rows and columns
train.head() # print first 5 rows

(1424208, 3)


Unnamed: 0,timestamp,device,id_and_message
0,(1090000000.000000),can0,107#0000000000000000
1,(1090000000.001017),can0,FFF#0000000000000000
2,(1090000000.001018),can0,655#8800010178000010
3,(1090000000.001019),can0,0BA#04A7F484000606A0
4,(1090000000.010104),can0,32D#8000042758010000


In [5]:
train.isnull().sum() # print the total number of missing rows

timestamp         0
device            0
id_and_message    0
dtype: int64

Great, there is no missing data. 

#### The `Timestamp` Feature
The first column represents the timestamp in epochs. Converting this timestamp to datetime makes it easier to work with:

In [6]:
timestamp = train['timestamp'].apply(lambda x: x.strip('()')) # strip parentheses
train['timestamp'] = pd.to_datetime(timestamp, unit='s') # convert to datetime

In [7]:
timestamp.head(3)

0    1090000000.000000
1    1090000000.001017
2    1090000000.001018
Name: timestamp, dtype: object

#### The `Device` Feature: What is slcan0?
The `slcan0` is the index of the device reading the CAN data on the SLCAN (socket-based) interface. It is consistent throughout the renault logs because only one socket-based device, with index 0, was used. In readings of the custom CAN prototype by the authors, the name of this  device is `can0` indicating that device is reading using real hardware. For more information on CAN interfaces, take a look at [this resource](https://elinux.org/Bringing_CAN_interface_up#Introduction), or section 6.4 in [this documentation](https://www.kernel.org/doc/Documentation/networking/can.txt).  

The feature `slcan0` is redundant and has no significance on the CAN data so it will not be included in the data used to train the models.

In [8]:
train.drop(columns='device', inplace=True) # drop slcan column

In [9]:
train.head(2)

Unnamed: 0,timestamp,id_and_message
0,2004-07-16 17:46:40,107#0000000000000000
1,2004-07-16 17:46:40,FFF#0000000000000000


#### Extracting CAN ID and packet data
Let's take a look at a single example:

In [10]:
train.iloc[0]

timestamp          2004-07-16 17:46:40
id_and_message    107#0000000000000000
Name: 0, dtype: object

From the documentation in the README provided by the dataset's authors, we see that the second feature contains both CAN arbitration identifier and message. Let's extract the Arbitration ID and data.

In [11]:
id_and_message = train['id_and_message'].str.split('#', expand=True)
id_and_message.head(2)

Unnamed: 0,0,1
0,107,0
1,FFF,0


##### CAN-ID
The first column of the `id_and_message` dataframe contains Arbitration ID values

In [12]:
id = id_and_message[0] # extract CAN ID
id.head()

0    107
1    FFF
2    655
3    0BA
4    32D
Name: 0, dtype: object

The documentation states that there are 55 distinct `id` values in the Renault's CAN, making `id` categorical. As a preprocessing step, the `id` column will be one-hot encoded.

In [13]:
# show the number of unique can_id values
print('Number of arbitration IDs present in the training data: ', id.nunique())
unique_ids = id.unique()
unique_ids

Number of arbitration IDs present in the training data:  106


array(['107', 'FFF', '655', '0BA', '32D', '03C', '207', '4C9', '1D6',
       '533', '5FD', '273', '1C4', '2A4', '371', '67D', '51B', '0CC',
       '0FD', '1A4', '1E5', '3B9', '21D', '0F1', '041', '297', '5AF',
       '419', '1AA', '30A', '006', '576', '0F8', '585', '5B3', '2B4',
       '2AB', '6D7', '12C', '26E', '025', '193', '20E', '522', '366',
       '580', '434', '6E0', '498', '464', '230', '2B7', '671', '2E1',
       '354', '5E1', '162', '0A7', '2A3', '28B', '6FC', '239', '0D7',
       '497', '00E', '2E2', '345', '430', '130', '2D2', '55C', '407',
       '075', '03A', '03D', '4EE', '69D', '1CA', '5E8', '277', '4FD',
       '3A2', '66C', '618', '4E7', '153', '295', '662', '684', '636',
       '2D7', '19C', '0D0', '033', '274', '0C0', '3E4', '65C', '577',
       '69E', '125', '4CB', '280', '0F4', '3C1', '2C1'], dtype=object)

In [14]:
class ArbitrationEncoder():
    """
    Convert arbitration Id's into floats and normalize them. During transformation, 
    previously unseen id's are stored using an unknown_value parameter that is either
    passed during instantiation or calculated as the negative of the number of unique valid ID's

    Example
    =======
    >>> enc = ArbitrationEncoder(unknown_value = -150)
    >>> enc.fit(df['arbitration_id'])
    >>> enc.transform(test_df['arbitration_id'])
    """
    def __init__(self, unknown_value:int|None=None):
        self.valid_ids = {}
        self.unknown_value = unknown_value if unknown_value else None

    @np.vectorize
    def make_float(self, val):
        """Convert hex to float"""
        return float(int(val, base=16)) # convert to float

    def fit(self, train_ids):
        """
        Determine the unique elements, standard deviation and mean of the input array.
        Also, calculate default value for unseen ID's if not passed at instantiation
        """
        # if isinstance(train_ids, pd.DataFrame):
        #     train_ids = arr.values
        self.valid_ids = set(train_ids) # get unique train id's
        self.converted_valid_ids = self.make_float(self,train_ids)
        self.std = np.std(self.converted_valid_ids) # calculate std
        self.mean = np.mean(self.converted_valid_ids) # calculate mean
        if not self.unknown_value:
            self.unknown_value = -(len(self.valid_ids))

    @np.vectorize
    def convert_and_scale(self, val):
        """
        Normalize array (compute the z-score). Note that all previously unseen values have the
        same transformation since `unknown_value` is a constant.
        """
        if not val in self.valid_ids:
            return (self.unknown_value - self.mean)/self.std
        else:
            converted_id = self.make_float(self, val) # convert from hex to float
            return (converted_id - self.mean)/self.std #return z score
        
    def transform(self, arr):
        """
        Transform array after running the `fit()` method
        """
        return self.convert_and_scale(self, arr)

    def fit_transform(self, arr):
        self.fit(arr)
        return self.transform(arr)

In [15]:
ex = np.array(['A', 'B', 'C', 'D'])
enc = ArbitrationEncoder(unknown_value = -(ex.shape[0]))
enc.fit(ex)
print("transformation with seen ID's:")
print(enc.transform(ex))

print('\ntransformation with unseen ID:')
enc.transform(['A', 'BB', 'C', 'D'])

transformation with seen ID's:
[-1.34164079 -0.4472136   0.4472136   1.34164079]

transformation with unseen ID:


array([ -1.34164079, -13.86362146,   0.4472136 ,   1.34164079])

In [16]:
ex2 = enc.fit_transform(ex)
ex2

array([-1.34164079, -0.4472136 ,  0.4472136 ,  1.34164079])

Now, let's use this class to encode the arbitration ID's:

In [17]:
enc = ArbitrationEncoder()
id = enc.fit_transform(id)

In [18]:
# enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-106)
# id = id.array.reshape(-1, 1) # prep the CAN data
# id = enc.fit_transform(id) # transform the categorical data

In [19]:
# joblib.dump(enc, '../bin/id_scl.bin', compress=True) # save ordinal encoder

##### CAN Message
the second column of the `id_and_message` column is the message transmitted.

In [20]:
message = id_and_message[1].apply(lambda x: ' '.join(x[i:i+2] for i in range(0, len(x), 2))) # space out
message = message.str.split(' ', expand=True)
message.head()

Unnamed: 0,0,1,2,3,4,5,6,7
0,0,00,00,0,0,0,0,00
1,0,00,00,0,0,0,0,00
2,88,00,01,1,78,0,0,10
3,4,A7,F4,84,0,6,6,A0
4,80,00,04,27,58,1,0,00


`message` is currently in hexadecimal form with several `None` values where the message length was less than 8 (this was discussed earlier). Next, the missing data will be replaced with a dummy hex value of '00' and the values will be converted to float.

In [21]:
message.isnull().sum()

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
dtype: int64

In [22]:
message.fillna('00', inplace=True) # fill missing values with 00, though that has been handled in this dataset
message = message.apply(lambda x : x.astype(str).map(lambda x : float(int(x, base=16)))) # convert hex values to float
print(message.shape)
message.head(3)

(1424208, 8)


Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,136.0,0.0,1.0,1.0,120.0,0.0,0.0,16.0


Let's create a new feature called `interval` that contains the time between two messages. Perhaps it will serve as a helper feature and improve training. Since it will contain float values (the timestamps are still in epochs, which will be converted into floats), it will be scaled too using the `StandardScaler`

In [23]:
timestamp_flt = timestamp.apply(lambda x: x.strip('()')) # strip parentheses on timestamp
timesteps = timestamp_flt.astype('float').values
# shift the timesteps by one element and pad it at the end with the last value
shifted_timesteps = np.append(timesteps[1:], timesteps[-1])
interval = shifted_timesteps - timesteps

In [24]:
interval

array([1.01709366e-03, 9.53674316e-07, 9.53674316e-07, ...,
       1.02901459e-03, 3.07083130e-03, 0.00000000e+00])

It is proper that we scale numeric data in the dataset to assist training. Scaling data stabilizes model training and causes the model to interpret the different features on the same scale, so no feature is given undue/incorrect coefficients off the jump.

In [25]:
ss = StandardScaler()
scaled_num_cols = ss.fit_transform(pd.concat([message, pd.DataFrame(interval)], axis=1))
scaled_num_cols[:5]

array([[-0.68802256, -0.75911542, -0.73223633, -0.71579007, -0.74874214,
        -0.71050963, -0.82555032, -0.92925816,  0.66402706],
       [-0.68802256, -0.75911542, -0.73223633, -0.71579007, -0.74874214,
        -0.71050963, -0.82555032, -0.92925816, -0.55135791],
       [ 1.42715829, -0.75911542, -0.72044567, -0.70300528,  0.77049696,
        -0.71050963, -0.82555032, -0.72528706, -0.55135791],
       [-0.62581136,  1.448487  ,  2.14468385,  0.9718028 , -0.74874214,
        -0.62685191, -0.75482356,  1.11045283, 10.31381871],
       [ 1.30273589, -0.75911542, -0.6850737 , -0.21718308,  0.36536653,
        -0.69656667, -0.82555032, -0.92925816, 11.39517483]])

In [26]:
# joblib.dump(ss, '../bin/msg_scl.bin', compress=True) # save the standard scaler

Now that we have been able to process the `can_id` and `data` features, let's join them together in the original dataset and remove the former features.

In [27]:
train = pd.concat([train['timestamp'], pd.DataFrame(scaled_num_cols), pd.DataFrame(id)], axis=1)
print(train.shape)
train.head()

(1424208, 11)


Unnamed: 0,timestamp,0,1,2,3,4,5,6,7,8,0.1
0,2004-07-16 17:46:40,-0.688023,-0.759115,-0.732236,-0.71579,-0.748742,-0.71051,-0.82555,-0.929258,0.664027,-0.721516
1,2004-07-16 17:46:40,-0.688023,-0.759115,-0.732236,-0.71579,-0.748742,-0.71051,-0.82555,-0.929258,-0.551358,3.240762
2,2004-07-16 17:46:40,1.427158,-0.759115,-0.720446,-0.703005,0.770497,-0.71051,-0.82555,-0.725287,-0.551358,0.682652
3,2004-07-16 17:46:40,-0.625811,1.448487,2.144684,0.971803,-0.748742,-0.626852,-0.754824,1.110453,10.313819,-0.801134
4,2004-07-16 17:46:40,1.302736,-0.759115,-0.685074,-0.217183,0.365367,-0.696567,-0.82555,-0.929258,11.395175,-0.152817


Having understood the data and how it will be presented for training, let's go ahead to create anomaly detection models using different approaches. That notebook can be found here