# NSL-KDD Network anomaly detection

## Introduction
The task at hand refers to recognizing network attacks based on data from the NSL-KDD dataset.
The given training dataset does not contain labels so we will first attempt to recognize anomalies
in our training data to classify them as attacks, so that we can utilize the transformed training
dataset to build a classifier model, that we will test against our test data set which in itself contains data

We begin with our necessary imports and a random seed for reproducible results

In [1]:
import pandas as pd

from sklearn import metrics
from sklearn.cluster import KMeans, DBSCAN
from sklearn.ensemble import IsolationForest, RandomForestClassifier
from sklearn.feature_selection import VarianceThreshold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import  MinMaxScaler
from xgboost import XGBClassifier

RANDOM_SEED = 42

## Data Exploration

The first step is to perform some basic exploration to recognize pattern and possible problems in our data

In [2]:
train_df = pd.read_csv("../input/NSL-KDDTrain.csv")
test_df = pd.read_csv("../input/NSL-KDDTest.csv")

In [3]:
train_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,150,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0
1,0,udp,other,SF,146,0,0,0,0,0,...,255,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0
2,0,tcp,private,S0,0,0,0,0,0,0,...,255,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0
3,0,tcp,http,SF,232,8153,0,0,0,0,...,30,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
test_df.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,target
0,0,tcp,private,REJ,0,0,0,0,0,0,...,10,0.04,0.06,0.0,0.0,0.0,0.0,1.0,1.0,attack
1,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.0,0.06,0.0,0.0,0.0,0.0,1.0,1.0,attack
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,86,0.61,0.04,0.61,0.02,0.0,0.0,0.0,0.0,normal
3,0,icmp,eco_i,SF,20,0,0,0,0,0,...,57,1.0,0.0,1.0,0.28,0.0,0.0,0.0,0.0,attack
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,86,0.31,0.17,0.03,0.02,0.0,0.0,0.83,0.71,attack


Our dataset contains 41 features, plus the target. There do not seem to exist any features that at first glance
can be categorized as not-useful in the context of classifying a network attack. Our features can be roughly divided
to 4 categories
- Categorical features that we will have to be encoded, such as protocol_type, service and flag
- Numerical counting features like src_bytes, dst_bytes. Since those express different characteristics, they range
in values. We can easily observe that a normalization step will be required
- Rate features that refer to percentages. These can be rather easily identified due to the "_rate" suffix. These
features are already in the [0, 1] range
- Binary features such as logged_in or is_hot_login



In [5]:
train_df.isnull().sum()

duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_h

In [6]:
test_df.isnull().sum()

duration                       0
protocol_type                  0
service                        0
flag                           0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
num_outbound_cmds              0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_h

Our dataset seems clean without any missing values.


In [7]:
train_df.info()
train_df.nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125973 entries, 0 to 125972
Data columns (total 41 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   duration                     125973 non-null  int64  
 1   protocol_type                125973 non-null  object 
 2   service                      125973 non-null  object 
 3   flag                         125973 non-null  object 
 4   src_bytes                    125973 non-null  int64  
 5   dst_bytes                    125973 non-null  int64  
 6   land                         125973 non-null  int64  
 7   wrong_fragment               125973 non-null  int64  
 8   urgent                       125973 non-null  int64  
 9   hot                          125973 non-null  int64  
 10  num_failed_logins            125973 non-null  int64  
 11  logged_in                    125973 non-null  int64  
 12  num_compromised              125973 non-null  int64  
 13 

duration                       2981
protocol_type                     3
service                          70
flag                             11
src_bytes                      3341
dst_bytes                      9326
land                              2
wrong_fragment                    3
urgent                            4
hot                              28
num_failed_logins                 6
logged_in                         2
num_compromised                  88
root_shell                        2
su_attempted                      3
num_root                         82
num_file_creations               35
num_shells                        3
num_access_files                 10
num_outbound_cmds                 1
is_host_login                     2
is_guest_login                    2
count                           512
srv_count                       509
serror_rate                      89
srv_serror_rate                  86
rerror_rate                      82
srv_rerror_rate             

In [8]:
test_df.info()
test_df.nunique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22544 entries, 0 to 22543
Data columns (total 42 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   duration                     22544 non-null  int64  
 1   protocol_type                22544 non-null  object 
 2   service                      22544 non-null  object 
 3   flag                         22544 non-null  object 
 4   src_bytes                    22544 non-null  int64  
 5   dst_bytes                    22544 non-null  int64  
 6   land                         22544 non-null  int64  
 7   wrong_fragment               22544 non-null  int64  
 8   urgent                       22544 non-null  int64  
 9   hot                          22544 non-null  int64  
 10  num_failed_logins            22544 non-null  int64  
 11  logged_in                    22544 non-null  int64  
 12  num_compromised              22544 non-null  int64  
 13  root_shell      

duration                        624
protocol_type                     3
service                          64
flag                             11
src_bytes                      1149
dst_bytes                      3650
land                              2
wrong_fragment                    3
urgent                            4
hot                              16
num_failed_logins                 5
logged_in                         2
num_compromised                  23
root_shell                        2
su_attempted                      3
num_root                         20
num_file_creations                9
num_shells                        4
num_access_files                  5
num_outbound_cmds                 1
is_host_login                     2
is_guest_login                    2
count                           495
srv_count                       457
serror_rate                      88
srv_serror_rate                  82
rerror_rate                      90
srv_rerror_rate             

It can be observed that there is a difference in the service categorical values between the train and test dataset, that
may need to be alleviated. Another interesting point is that features such as *src_bytes*, or *dst_bytes* range
significantly in values and will most certainly need to be scaled.


## Preprocessing

We will now perform some basic data preprocessing, in accordance with the issues our data analysis made apparent

The first step is to one-hot encode our categorical features to transform them to numerical. Additionaly on the test
dataset we will transform our target to a binary feature, essentially denoting whether a row is an attack or not.

In [9]:
categorical_labels = ["protocol_type", "service", "flag"]
train_df = pd.get_dummies(train_df, columns=categorical_labels)
test_df = pd.get_dummies(test_df, columns=categorical_labels)

train_df.head()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0,491,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,146,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,232,8153,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
4,0,199,420,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0


In [10]:
test_df['target'] = test_df['target'].apply(lambda row: 0 if row == 'normal' else 1)
test_df.head()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,2,12983,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,20,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,1,0,15,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


We can see a feature mismatch 122 features vs 117. This is due to the additional values our categorical features
had in the train dataset and were transformed to extra one-hot features. We will attempt to remedy this accordingly

Another necessary step is to apply some basic feature selection to reduce unnecessary features. Using the variance threshold
we will remove any features that display less than 1% variance.

In [11]:
selector = VarianceThreshold(threshold=0.01)
selector.fit(train_df)
train_df = train_df.loc[:, selector.get_support()]
train_df.shape

(125973, 47)

In [12]:

selector = VarianceThreshold(threshold=0.01)
target = test_df['target']
test_df.drop(['target'], axis=1, inplace=True)
selector.fit(test_df)
test_df = test_df.loc[:, selector.get_support()]
test_df['target'] = target.values
test_df.head()

Unnamed: 0,duration,src_bytes,dst_bytes,wrong_fragment,hot,num_failed_logins,logged_in,num_compromised,num_root,num_file_creations,...,service_private,service_smtp,service_telnet,flag_REJ,flag_RSTO,flag_RSTR,flag_S0,flag_S3,flag_SF,target
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,1
2,2,12983,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,0,20,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,1
4,1,0,15,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,1


The dataset features, have been significantly reduced. Since there now exists a feature mismatch between the train
and test dataset, we will only keep their common features moving forward, except for *target* ofcourse

In [13]:
train_test_diff = list(set(train_df.columns).symmetric_difference(set(test_df.columns)))
train_test_diff.remove("target")
train_df.drop(train_test_diff, axis=1, errors='ignore', inplace=True)
test_df.drop(train_test_diff, axis=1, errors='ignore', inplace=True)

train_test_diff = list(set(train_df.columns).symmetric_difference(set(test_df.columns)))
train_test_diff

['target']

Our final preprocessing step is to normalize our features, as already noted. Since there exist many rate features
it maxes sense to scale all features to the [0, 1] spectrum to achieve uniformity in values.

In [14]:
scaler = MinMaxScaler(feature_range=(0,1))
scaled_features = scaler.fit_transform(train_df.values)
scaled_train_df = pd.DataFrame(scaled_features, index=train_df.index, columns=train_df.columns)

scaled_features = scaler.fit_transform(test_df.values)
scaled_test_df = pd.DataFrame(scaled_features, index=test_df.index, columns=test_df.columns)

## Clustering

The next step lies in attempting to identify anomalies in the training dataset by use of various clustering algorithms.
There exist several clustering algorithms, the first approach was to use the silhouette score
to evaluate the clustering. Silhouette is a measure of cluster cohesion and separation, higher
values indicate a more successful clustering.

As will also be discussed though, the internal evaluation of unsupervised anomaly detection is
a rather difficult task, in which evaluation methods like silhouette can be deceiving due to the
small number of out-of-cluster data points.

### Kmeans

Our first clustering attempt is the all-purpose k-means. Since we already know that we want
to classify our data in 2 clusters, attack or not. While the algorithm presents a respectable
silhouette score, it is a false assumption that the "attack" cluster would be close toghether.
On the contrary we expect attack data points to be far apart and thus won't create a singular
cluster in the k-means approach

In [15]:
kmeans = KMeans(n_clusters=2, random_state=RANDOM_SEED)
kmeans.fit(scaled_train_df)
silhouette_score = metrics.silhouette_score(scaled_train_df, kmeans.labels_)

print(f"KMeans Silhouette {silhouette_score}")

KMeans Silhouette 0.4320329343545968


### DBSCAN

The next attempt lies with DBSCAN with a modest epsilon distance. The goal is to create a singular
cluster for all the normal datapoints and recognize all attack datapoints as outliers to this
cluster.

In [16]:
dbscan = DBSCAN(eps=0.147, min_samples=5, metric='euclidean', n_jobs=-1)
dbscan.fit(scaled_train_df)

silhouette_score = metrics.silhouette_score(scaled_train_df, dbscan.labels_)

print(f"DBSCAN Silhouette {silhouette_score}")

DBSCAN Silhouette 0.09976925649345238


### Isolation Forest

Isolation forests iteratively select instances until they are isolated. The steps needed to
isolate an observation are a measure of the observations' normalness. The faster an instance
isolates, the more likely it is to be an anomaly. Isolation forests generally perform well
in anomaly detection.

In [17]:
isolation = IsolationForest(random_state=RANDOM_SEED, n_jobs=-1)
isolation_labels = isolation.fit_predict(scaled_train_df)

silhouette_score = metrics.silhouette_score(scaled_train_df, isolation_labels)
print(f"Isolation Forest Silhouette {silhouette_score}")

Isolation Forest Silhouette 0.19238295189697238


### Local Outlier Factor

Local Outlier Factor attempts to identify outliers with repect to their local neighbourhood rather
than the global clusters. It can yield better results where more than one "normal" clusters are present,
rather than a singular dense, global cluster.

In [18]:
lof = LocalOutlierFactor(n_jobs=-1)
lof_labels = lof.fit_predict(scaled_train_df)

silhouette_score = metrics.silhouette_score(scaled_train_df, lof_labels)
print(f"Local Outlier Factor Silhouette {silhouette_score}")

Local Outlier Factor Silhouette -0.035818439745056455


As already noted, silhouette score failed to accurately evaluate the clustering attempts for the
subsequent classification task. As a result, classification experiments were performed with all
the clustering attempts. Isolation forest was found to perform the best, and was thus selected.

## Anomaly Prediction
We have created target features for the train dataset from our clustering attempts. Three different
classifiers are used to perform the classification task. A simple baseline logistic regression, a random forest classifier
and a tree boosting method. The accuracy improves with the ensemble methods like
Random Forests and reaches a respectable 62% with the use of the XGBoost algorithm.

In [19]:
anomalies = pd.Series(isolation_labels).replace([-1,1],[1,0])
scaled_train_df['target'] = anomalies
scaled_train_df.head()

Unnamed: 0,duration,src_bytes,dst_bytes,wrong_fragment,hot,logged_in,num_compromised,num_root,num_file_creations,count,...,service_other,service_private,service_smtp,service_telnet,flag_REJ,flag_RSTO,flag_RSTR,flag_S0,flag_SF,target
0,0.0,3.558064e-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.003914,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
1,0.0,1.057999e-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02544,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.240705,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
3,0.0,1.681203e-07,6.223962e-06,0.0,0.0,1.0,0.0,0.0,0.0,0.009785,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
4,0.0,1.442067e-07,3.20626e-07,0.0,0.0,1.0,0.0,0.0,0.0,0.058708,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


In [20]:
y_train = scaled_train_df['target']
x_train = scaled_train_df.drop(['target'], axis=1)

y_test = scaled_test_df['target']
x_test = scaled_test_df.drop(['target'], axis=1)

### Logistic Regression


In [21]:
log_reg = LogisticRegression(max_iter=10000, n_jobs=-1)

log_reg.fit(x_train, y_train)
y_predicted = log_reg.predict(x_test)

log_reg_acc = metrics.accuracy_score(y_test, y_predicted)
print(f"Logistic Regression Accuracy: {log_reg_acc}")


Logistic Regression Accuracy: 0.5644073811213627


### Random Forest

In [22]:
rf = RandomForestClassifier(criterion='gini', n_estimators=100, n_jobs=-1)

rf.fit(x_train, y_train)
y_predicted = rf.predict(x_test)

rf_acc = metrics.accuracy_score(y_test, y_predicted)
print(f"Random Forest Accuracy: {rf_acc}")


Random Forest Accuracy: 0.5885379701916252


### XGBoost

In [23]:
xgb = XGBClassifier(
        scale_pos_weight=100,
        eta=0.01,
        max_depth=10,
        min_child_weight=5,
        objective="binary:logistic",
        eval_metric="auc"
    )

xgb.fit(x_train, y_train)
y_predicted = xgb.predict(x_test)

xgb_acc = metrics.accuracy_score(y_test, y_predicted)
print(f"XGBoost Accuracy: {xgb_acc}")



XGBoost Accuracy: 0.629347054648687


## Conclusions
The dataset seems pretty expansive, thus it seems that the classification task in a hand labeled training scenario could
perform exceptionally well. Consequently the unsupervised anomaly detection seems to be critical part of the task at
hand, with classification methods being secondary.

Some steps that could be improved in a future revisiting of the matter.
- The datapoint could have been reduced with PCA to 2 points and be plotted in a scatter plot. This would probably
provide better insights on what clustering performs the best, given that PCA represented the data point correctly
- The variance threshold feature selection could maybe be replaced with a more suitable method. One-hot encoded features
may very well diplay very low variance, this does not mean they should be dropped
- In general some additional data and feature plots could help better illustrate the task at hand