# Flow Classification Task
This notebook will guide the process of how we do the Bonus Task of Lab 3. The overall outline of the process is as follows:

1. Preprocessing
2. Splitting the Training and Test Set
3. Fitting a Random Forest Classifier to the Training Set
4. Packet-Level Evaluation 
5. Host-Level Evaluation

In [1]:
# import necessary modules
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score, recall_score, precision_score, auc, precision_recall_curve, accuracy_score, f1_score, confusion_matrix, average_precision_score
from sklearn.metrics import confusion_matrix



## 1. Preprocessing
For the preprocessing start, we apply similar preprocessing from the other tasks: renaming columns, converting the "Start" feature to datetime, and splitting the IP and Port information from Source IP and Destination IP

In [2]:
# renaming columns
dataset = pd.read_csv('dataset/capture-scenario10.pcap.netflow.labeled', delim_whitespace=True,skiprows=1,header=None)
dataset.columns = ["Date","Start","Duration","Protocol", "Source_IP","->","Destination_IP", "Flags","Tos","Packets"
                            ,"Bytes", "Flows","Label"]

# convert to datetime
dataset['Start'] = dataset['Date'] + ' ' + dataset['Start']
dataset['Start'] = pd.to_datetime(dataset['Start'])

# split port information
dataset['Source_Port'] = dataset['Source_IP'].apply(lambda x: x.split(":")[1] if len(x.split(":")) > 1 else None)
dataset['Source_IP'] = dataset['Source_IP'].apply(lambda x: x.split(":")[0])
dataset['Destination_Port'] = dataset['Destination_IP'].apply(lambda x: x.split(":")[1] if len(x.split(":")) > 1 else None)
dataset['Destination_IP'] = dataset['Destination_IP'].apply(lambda x: x.split(":")[0])

dataset.head()

Unnamed: 0,Date,Start,Duration,Protocol,Source_IP,->,Destination_IP,Flags,Tos,Packets,Bytes,Flows,Label,Source_Port,Destination_Port
0,2011-08-18,2011-08-18 10:19:13.328,0.002,TCP,147.32.86.166,->,212.24.150.110,FRPA_,0,4,321,1,Background,33426,25443
1,2011-08-18,2011-08-18 10:19:13.328,4.995,UDP,82.39.2.249,->,147.32.84.59,INT,0,617,40095,1,Background,41915,43087
2,2011-08-18,2011-08-18 10:19:13.329,4.996,UDP,147.32.84.59,->,82.39.2.249,INT,0,1290,1909200,1,Background,43087,41915
3,2011-08-18,2011-08-18 10:19:13.330,0.0,TCP,147.32.86.166,->,147.32.192.34,A_,0,1,66,1,Background,42020,993
4,2011-08-18,2011-08-18 10:19:13.330,0.0,TCP,212.24.150.110,->,147.32.86.166,FPA_,0,2,169,1,Background,25443,33426


As to make our classification binary, we remove the rows labeled with "Background" (so we only have "Botnet" and "Legitimate" rows)

In [3]:
# remove background flows 
dataset = dataset[dataset['Label'] != 'Background']

Then, we extract the following features: Duration, Protocol, Flags, Packets and Bytes. As Protocol and Flags are discrete values, we encode it to discrete number using sklearn's LabelEncoder.

In [4]:
# encode discrete features
discrete_features = ['Protocol', 'Flags']
dataset[discrete_features] = dataset[discrete_features].apply(LabelEncoder().fit_transform)

# feature to be extracted
features = ['Duration', 'Protocol', 'Flags', 'Packets', 'Bytes']

# extracting features and separating the labels
X = dataset[features]
y = dataset['Label'].values

## 2. Split into training set and test set
We split the training set and test set with 60:40 ratio. As the resulting label distribution is balanced for both the training and test set, we did not employ any balancing technique.

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=42)
print("Training Data")
print("Botnet: ", len([label for label in y_train if label == "Botnet"]))
print("Legitimate: ", len([label for label in y_train if label == "LEGITIMATE"]))
    
print("Test Data")
print("Botnet: ", len([label for label in y_test if label == "Botnet"]))
print("Legitimate: ", len([label for label in y_test if label == "LEGITIMATE"]))

Training Data
Botnet:  194065
Legitimate:  193149
Test Data
Botnet:  129376
Legitimate:  128768


## 3. Fitting a RandomForestClassifier
We fit a Random Forest with default parameters (number of trees = 10 with no maximum depth for each tree) onto the training set.

In [6]:
# instantiate learning model
clf = RandomForestClassifier() # default: n_estimators = 10, max_depth = None

# fitting the model
clf.fit(X_train, y_train)

# predict the response
pred = clf.predict(X_test)

## 4. Packet-Level Evaluation
We evaluate the classifier using the test set based on each row (packet).

In [7]:
# evaluate performance
print("Random Forest")
print("Precision: ", precision_score(y_test, pred, pos_label="Botnet"))
print("Recall: ", recall_score(y_test, pred, pos_label="Botnet"))
print("F1 Score: ", f1_score(y_test, pred, pos_label="Botnet"))
print("Accuracy: ", accuracy_score(y_test, pred))

Random Forest
Precision:  0.9989165763813651
Recall:  0.9905855800148404
F1 Score:  0.9947336352664035
Accuracy:  0.9947432440808232


In [8]:
tn, fp, fn, tp = confusion_matrix(y_test, pred, labels=['LEGITIMATE','Botnet']).ravel()
print("TP: ", tp)
print("FP: ", fp)
print("TN: ", tn)
print("FN: ", fn)

TP:  128158
FP:  139
TN:  128629
FN:  1218


In [9]:
# feature importance
print(features)
print(clf.feature_importances_)

['Duration', 'Protocol', 'Flags', 'Packets', 'Bytes']
[0.04276232 0.71289616 0.06268871 0.11194954 0.06970327]


## 5. Host-Level Evaluation
This time, we evaluate based on known infected hosts and normal hosts (read https://mcfp.felk.cvut.cz/publicDatasets/CTU-Malware-Capture-Botnet-51/README.html#ip-addresses). The metric is accounted with the following criteria:

* TP: A True Positive is accounted when there is at least one packet, whose Source_IP is one of the known infected hosts, detected as Botnet
* TN: A True Negative is accounted when there is no packet, whose Source_IP is one of the known normal hosts  , detected as Botnet
* FP: A False possitive is accounted when there is at least one packet, whose Source_IP is one of the known normal hosts, detected as Botnet
* FN: A False negative is accounted when there is no packet, whose Source_IP is one of the known infected hosts, detected as Botnet

In [10]:
temp = X_test.copy()
temp["Pred_Label"] = pred
temp["True_Label"] = y_test
temp["Source_IP"] = dataset["Source_IP"]
temp.head()

Unnamed: 0,Duration,Protocol,Flags,Packets,Bytes,Pred_Label,True_Label,Source_IP
2080251,0.0,0,57,1,1066,Botnet,Botnet,147.32.84.208
4002973,4.996,2,18,500,468705,Botnet,Botnet,147.32.84.209
4805991,0.0,0,5,1,1066,Botnet,Botnet,147.32.84.205
620518,0.263,1,32,4,2519,LEGITIMATE,LEGITIMATE,147.32.84.25
1959757,0.0,0,57,1,1066,Botnet,Botnet,147.32.84.207


In [11]:
infected_hosts = ["147.32.84.165", "147.32.84.191", "147.32.84.192", "147.32.84.193", "147.32.84.204", "147.32.84.205", "147.32.84.206", "147.32.84.207", "147.32.84.208", "147.32.84.209"]
normal_hosts = ["147.32.84.170", "147.32.84.134", "147.32.84.164", "147.32.87.36", "147.32.80.9"]

In [12]:
TP,FP,TN,FN = 0,0,0,0

for host in infected_hosts:
    if not temp[(temp["Source_IP"] == host) & (temp["Pred_Label"] == "Botnet")].empty:
        TP = TP + 1    
    else:
        FN = FN + 1

for host in normal_hosts:
    if temp[(temp["Source_IP"] == host) & (temp["Pred_Label"] == "Botnet")].empty:
        TN = TN + 1 
    else:
        FP = FP + 1 

print("TP:",TP)
print("FP:",FP)
print("TN:",TN)
print("FN:",FN)

TP: 10
FP: 4
TN: 1
FN: 0
