# Anomlay Detection for Cyber-Security

In this notebook, I would like to explain an ensemble approach to detect anomalies (i.e., novelty) for an example dataset. The dataset is coming from a MIL-STD-1553 bus. More information about the bus can be found here:
[MIL-STD-1553](https://en.wikipedia.org/wiki/MIL-STD-1553).


### Challenges:

- Normal Only Training Data
- Diverse Data Types, including text, hyper-categorical, numerical, and boolean.

In [2]:
#loading pandas library
import pandas as pd

#loading the input data
df = pd.read_csv('data/data.csv')

#shape of the data
print('Features data-frame has the dimension of', df.shape, 'with the following data types:\n')

#default data types
print(df.dtypes)

#printing the first few rows
df.head()

Features data-frame has the dimension of (9886, 6) with the following data types:

gap        float64
addr         int64
rxtx          bool
subaddr      int64
count        int64
data        object
dtype: object


Unnamed: 0,gap,addr,rxtx,subaddr,count,data
0,93134.6,7,False,1,10,48b93dc901e443c4fab151cb1cc4cd58d469de23
1,11.7,7,False,2,5,cb1f36e5d457f78a14a6
2,11.7,7,False,6,25,efe76a30f55010989d264c7d213c4225ee1bdf5f91b71c...
3,11.7,7,False,30,2,88830b3d
4,11.7,7,True,3,32,8c370935ffb387bc22895e9c028648e3676553f7ccb258...


## Preprocessing

From the training dataset, it apears `addr` and `subaddr` are in fact best to be represented as categorical values instead of numeric, since there only a finite number of possiblities for each. `rxtx` is boolean feature that can be converted to `0` or `1` values.


In [3]:
df['addr'].unique()

array([7, 4, 1])

In [4]:
df['subaddr'].unique()

array([ 1,  2,  6, 30,  3, 20, 16, 19, 26])

Therefore, I perform the following type conversion:

In [5]:
df['addr'] = df['addr'].astype('object')
df['subaddr'] = df['subaddr'].astype('object')
df['rxtx'] = df['rxtx'] * 1

Additionally, since `subaddr` is within each `addr`, I create a new `fulladdr` feature that contains both `addr` and `subaddr` information:

In [6]:
df['fulladdr'] = df['addr'].astype('str')+ '-' + df['subaddr'].astype('str')

The new `df` includes the following datatypes:

In [7]:
df.dtypes

gap         float64
addr         object
rxtx          int64
subaddr      object
count         int64
data         object
fulladdr     object
dtype: object

### Handling the Address Columns

In [8]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()

addr_subaddr_fulladdr = pd.DataFrame(encoder.fit_transform(df[['addr','subaddr', 'fulladdr']]).toarray())
addr_subaddr_fulladdr.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,19,20,21,22,23,24,25,26,27,28
0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Handling the `Data` Column

In [9]:
# lambda to convert the data column to sentences
f1 = lambda s: ' '.join([s[i:i + 4] for i in range(0, len(s), 4)])

# lambda to convert the data column to words
f2 = lambda s: [s[i:i + 4] for i in range(0, len(s), 4)]

sentences = df['data'].apply(f1)
words  = df['data'].apply(f2)

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(min_df=2)
sentences_count = pd.DataFrame(count.fit_transform(sentences).toarray())

print(sentences_count.shape)

(9886, 30998)


In [11]:
from sklearn.decomposition import PCA
pca=PCA(n_components=10)
sentences_count_pca = pca.fit_transform(sentences_count)
sentences_count_pca =pd.DataFrame(sentences_count_pca)
print(sentences_count_pca.shape)
sentences_count_pca.head()

(9886, 10)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.039008,-0.015669,0.022084,0.003606,0.024607,0.009855,0.038612,-0.001895,-0.00635,-0.008358
1,-0.005312,-0.002406,0.008544,-0.002722,0.009519,0.005489,-0.005242,0.000975,-0.002882,0.001247
2,0.030437,-0.035597,-0.023934,0.01509,0.010429,0.042983,0.027306,-0.003119,-0.044408,-0.115033
3,-0.007886,-0.008575,-0.001842,-0.00547,0.001463,0.002024,0.002743,0.000367,0.000164,-0.003985
4,0.253842,0.19001,-0.083564,-0.037267,-0.201679,-0.118436,0.045518,0.177863,-0.045455,-0.204556


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
sentences_tfidf = pd.DataFrame(tfidf.fit_transform(sentences).toarray())

print(sentences_tfidf.shape)
sentences_tfidf.head()

(9886, 52250)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,52240,52241,52242,52243,52244,52245,52246,52247,52248,52249
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Merging all the processed data 

In [13]:
xdata = pd.concat([df[['rxtx', 'gap', 'count']], addr_subaddr_fulladdr, sentences_count_pca], axis=1)
print(xdata.shape)
xdata.head()

(9886, 42)


Unnamed: 0,rxtx,gap,count,0,1,2,3,4,5,6,...,0.1,1.1,2.1,3.1,4.1,5.1,6.1,7,8,9
0,0,93134.6,10,0.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.039008,-0.015669,0.022084,0.003606,0.024607,0.009855,0.038612,-0.001895,-0.00635,-0.008358
1,0,11.7,5,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,-0.005312,-0.002406,0.008544,-0.002722,0.009519,0.005489,-0.005242,0.000975,-0.002882,0.001247
2,0,11.7,25,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.030437,-0.035597,-0.023934,0.01509,0.010429,0.042983,0.027306,-0.003119,-0.044408,-0.115033
3,0,11.7,2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.007886,-0.008575,-0.001842,-0.00547,0.001463,0.002024,0.002743,0.000367,0.000164,-0.003985
4,1,11.7,32,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.253842,0.19001,-0.083564,-0.037267,-0.201679,-0.118436,0.045518,0.177863,-0.045455,-0.204556


# Ensemble Model

We used an ensemble approach by combining results from six anomly detection models. The final outcome is the average of the output from all these models.

The models include:

1. [One-Class SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html)
2. [Isolation Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)
3. [Local Outlier Factor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html)
4. [Elliptic Envelope](https://scikit-learn.org/stable/modules/generated/sklearn.covariance.EllipticEnvelope.html)

**Note**: The One-Class SVM approach has beenn implemented using three differnt kernel functions: *`rbf`, `sigmoid` and `linear`* kernels.

### Importing the related modules

In [14]:
from sklearn.svm import OneClassSVM
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.covariance import EllipticEnvelope


### Creating the models

In [15]:
ocsvm_rbf = OneClassSVM(gamma = 'scale', kernel = 'rbf', nu = 0.001) 
ocsvm_sigmoid = OneClassSVM(gamma = 'auto', kernel = 'sigmoid', nu = 0.01) 
ocsvm_linear = OneClassSVM(kernel = 'linear', nu = 0.001) 
ifo = IsolationForest(contamination = 0.001)  
lof = LocalOutlierFactor(contamination = 0.001, novelty = True)
ee = EllipticEnvelope(contamination= 0.001)

### Fitting the data

In [16]:
ocsvm_rbf.fit(xdata)
ocsvm_sigmoid.fit(xdata)
ocsvm_linear.fit(xdata)
ifo.fit(xdata)
lof.fit(xdata)
ee.fit(xdata)



EllipticEnvelope(contamination=0.001)

### Predictions

Because this is one-class classification without any validation data, here we only look at the predictions of the training data. 

In [17]:
ocsvm_preds1 = ocsvm_rbf.predict(xdata)
ocsvm_preds2 = ocsvm_sigmoid.predict(xdata)
ocsvm_preds3 = ocsvm_linear.predict(xdata)
if_preds = ifo.predict(xdata)
lof_preds = lof.predict(xdata)
ee_preds = ee.predict(xdata)
#dbscan_preds = dbscan.predict(xdata)

Below we show how many true normal (+1) vs false anomaly (-1) was predicted for each of the six models.

In [18]:
from collections import Counter

print('OC-SVM rbf:', dict(Counter(ocsvm_preds1)))
print('OC-SVM sigmoid:', dict(Counter(ocsvm_preds2)))
print('OC-SVM linear:', dict(Counter(ocsvm_preds3)))
print('Isolation Forest:', dict(Counter(if_preds)))
print('Local Outlier Factor:', dict(Counter(lof_preds)))
print('Elliptic Envelope:', dict(Counter(ee_preds)))

OC-SVM rbf: {1: 9869, -1: 17}
OC-SVM sigmoid: {1: 9500, -1: 386}
OC-SVM linear: {1: 9846, -1: 40}
Isolation Forest: {1: 9876, -1: 10}
Local Outlier Factor: {1: 9878, -1: 8}
Elliptic Envelope: {1: 9876, -1: 10}


We give equal wights to the output of each model, and come up with an average prediction metric from all predictions of anomalies.

In [19]:
predictions = ((ocsvm_preds1 == -1) * 1 + 
         (ocsvm_preds2 == -1) * 1 + 
         (ocsvm_preds3 == -1) * 1 + 
         (if_preds == -1) * 1 +
         (lof_preds == -1) * 1 +
         (ee_preds == -1) * 1) / 6

print(dict(Counter(predictions)))

{0.0: 9458, 0.16666666666666666: 386, 0.3333333333333333: 41, 0.5: 1}


Finally, by setting a threshold of 0.5, we decide a datapoint is an "anomaly" if most models agree on that.

In [20]:
Anomalies = predictions >= 0.5 

print('Anomalies:', dict(Counter(Anomalies)))

Anomalies: {False: 9885, True: 1}


## Evaluation
Evaluation of One-Class Classification is always tricky as the model is to predict unseen anomaly situations while trained only on True Normal situations.

However, a well-known scientific paper (Lee & Liu, 2003) suggests that the following metric may be suitable  for these types of situations:

R² / Pr(Y=1)

where R is the recall = TP / ( TP + FN)


Another approach is to produce pseudo-anomalies by adding noise to the training dataset, however, this requires a technical understanding of the data.

### Module and Tutorial

I have turned the code and the workflow into an independent `Python` module named: `AnomalyMIL1553.py` that can be directly loaded into the workspace to build a model, fit the data and predict the labels.



In [21]:
from AnomalyMIL1553 import AnomalyMIL1553
from collections import Counter

mil1553 = AnomalyMIL1553(training_file = 'data/data.csv')
mil1553.model()
mil1553.fit()
preds = mil1553.predict()

print(Counter(preds))


2020-12-21 13:02:27.068734 loading data/data.csv ...
2020-12-21 13:02:27.088406 raw data shape: (9886, 6)
2020-12-21 13:02:27.100564 applying OneHotEncoder ...
2020-12-21 13:02:27.149725 applying TfidfVectorizer ...
2020-12-21 13:02:27.733547 applying PCA ...
2020-12-21 13:04:29.392703 concatenating ...
2020-12-21 13:04:29.465219 data loaded.
2020-12-21 13:04:29.667317 the model is being created ...
2020-12-21 13:04:29.675978 the model is ready.
2020-12-21 13:04:29.676288 the model is being fitted ...




2020-12-21 13:04:33.195438 the model is fitted.
2020-12-21 13:04:33.195808 the prediction is running...
2020-12-21 13:04:34.600426 the prediction is ready.
Counter({0.0: 9414, 0.16666666666666666: 422, 0.3333333333333333: 49, 0.5: 1})


### Predict for New Data

After fitting the model on the original training data, we can then apply the model on any new data with the same data strcuture. 

This is demonstrated as the following commands:

In [22]:
test_data = mil1553.load_data(data_file='data/data_test.csv', initialize= False)
mil1553.predict(test_data)

2020-12-21 13:04:34.620517 loading data/data_test.csv ...
2020-12-21 13:04:34.648075 raw data shape: (16, 6)
2020-12-21 13:04:34.661706 applying OneHotEncoder ...
2020-12-21 13:04:34.683137 applying TfidfVectorizer ...
2020-12-21 13:04:34.704352 applying PCA ...
2020-12-21 13:04:35.068360 concatenating ...
2020-12-21 13:04:35.072003 data loaded.
2020-12-21 13:04:35.072965 the prediction is running...
2020-12-21 13:04:35.164667 the prediction is ready.


array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

### References

All codes, notebooks, technical documents and scientific articles can be found from the GitHub repository:
https://github.com/bnasr/MIL-STD-1553