## Data Collection

This dataset has 4 CSV files of the data records and each CSV file contains attack and normal records.
<table>
<tr>
<th> file name </th>
<th> file name size</th>
<th> number of records </th>
<th> number of features </th>
</tr>

<tr>
<td> UNSWNB15_1.csv </td>
<td> 165.02 MB </td>
<td> 700000 </td>
<td> 49 </td>
</tr>

<tr>
<td> UNSWNB15_2.csv </td>
<td> 161.349 MB </td>
<td> 700000 </td>
<td> 49 </td>
</tr>

<tr>
<td> UNSWNB15_3.csv </td>
<td> 150.965 MB </td>
<td> 700000 </td>
<td> 49 </td>
</tr>

<tr>
<td> UNSWNB15_4.csv </td>
<td> 95.302 MB </td>
<td> 440044 </td>
<td> 49 </td>
</tr>
</table>


## Features in the Dataset

This dataset has 49 features.
<br>
There are 3 different datatypes:
- Categorical: proto, state, service, attack_cat
- Binary: is_sm_ips_ports, is_ftp_login
- Numerical: Rest of the features

IMPORT MODUL

## ML Problem Formulation


*Binary classification of attack category*

The dataset has "label" with 0 and 1 where 0 represents non-attack and 1 represent attack. So with the features available we will try to predict a given datapoint whether it belongs to attack or non-attack category.

# UNSW-NB15: Data cleaning and preprocessing


In [5]:
import pandas as pd  # for csv files and dataframe
import pickle  # To load data int disk
import warnings
warnings.filterwarnings("ignore")
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix
from sklearn.metrics import accuracy_score, confusion_matrix

import pickle
import time
%matplotlib inline

<H1> Reading data

In [6]:
# Creating a empty dict, where I will save all parameters required for test data transformation

saved_dict = {}

In [7]:
# Reading datasets
dfs = []
for i in range(1,5):
    path = 'Dataset/UNSW-NB15_{}.csv'  # There are 4 input csv files
    dfs.append(pd.read_csv(path.format(i), header = None))
df = pd.concat(dfs).reset_index(drop=True)  # Concat all to a single df

In [8]:
# Load features from NUSW-NB15_features.csv
df_features = pd.read_csv('Dataset/NUSW-NB15_features.csv',encoding="ISO-8859-1")

# Apply features to the dataset
df.columns = df_features['Name'].values


In [9]:
# Making column names lower case, removing spaces
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

In [10]:
# train_test_split
X = df.drop(columns=['label'])
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

In [11]:

df.shape


(2540047, 49)

In [12]:
df.head()

Unnamed: 0,srcip,sport,dstip,dsport,proto,state,dur,sbytes,dbytes,sttl,...,ct_ftp_cmd,ct_srv_src,ct_srv_dst,ct_dst_ltm,ct_src__ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,attack_cat,label
0,59.166.0.0,1390,149.171.126.6,53,udp,CON,0.001055,132,164,31,...,0,3,7,1,3,1,1,1,,0
1,59.166.0.0,33661,149.171.126.9,1024,udp,CON,0.036133,528,304,31,...,0,2,4,2,3,1,1,2,,0
2,59.166.0.6,1464,149.171.126.7,53,udp,CON,0.001119,146,178,31,...,0,12,8,1,2,2,1,1,,0
3,59.166.0.5,3593,149.171.126.5,53,udp,CON,0.001209,132,164,31,...,0,6,9,1,1,1,1,1,,0
4,59.166.0.3,49664,149.171.126.0,53,udp,CON,0.001169,146,178,31,...,0,7,9,1,1,1,1,1,,0


<H1> pre-processing & Data cleaning

In [13]:
X = df.drop(columns=['label'])
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)
del df

In [14]:
# print (X_train)
# print (y_train)
# print (X_test)
# print (y_test)
# Print Null values in X_train


In [15]:
print("Total Null values in X_train before filter:", X_train.isnull().sum().sum())


Total Null values in X_train before filter: 4497854


In [16]:
# Print Null values in X_train
print(X_train.isnull().sum())

srcip                     0
sport                     0
dstip                     0
dsport                    0
proto                     0
state                     0
dur                       0
sbytes                    0
dbytes                    0
sttl                      0
dttl                      0
sloss                     0
dloss                     0
service                   0
sload                     0
dload                     0
spkts                     0
dpkts                     0
swin                      0
dwin                      0
stcpb                     0
dtcpb                     0
smeansz                   0
dmeansz                   0
trans_depth               0
res_bdy_len               0
sjit                      0
djit                      0
stime                     0
ltime                     0
sintpkt                   0
dintpkt                   0
tcprtt                    0
synack                    0
ackdat                    0
is_sm_ips_ports     

In [17]:
X_train['attack_cat'].value_counts()

attack_cat
Generic             193696
Exploits             40112
 Fuzzers             17358
DoS                  14723
 Reconnaissance      11025
 Fuzzers              4562
Analysis              2400
Backdoor              1617
Reconnaissance        1599
 Shellcode            1167
Backdoors              486
Shellcode              192
Worms                  156
Name: count, dtype: int64

In [18]:
# Fill null values in attack_cat column with 'normal'
X_train['attack_cat'] = X_train['attack_cat'].fillna('normal')
X_test['attack_cat'] = X_test['attack_cat'].fillna('normal')

# Fill null values in ct_flw_http_mthd column with 0
X_train['ct_flw_http_mthd'] = X_train['ct_flw_http_mthd'].fillna(0)
X_test['ct_flw_http_mthd'] = X_test['ct_flw_http_mthd'].fillna(0)

# Fill null values in is_ftp_login column with 0
X_train['is_ftp_login'] = X_train['is_ftp_login'].fillna(0)
X_test['is_ftp_login'] = X_test['is_ftp_login'].fillna(0)

In [19]:
# check null values in X_train
print("Total Null values in X_train after filter:", X_train.isnull().sum().sum())


Total Null values in X_train after filter: 0


In [20]:
print("X_train shape before high corr filter:", X_train.shape)

X_train shape before high corr filter: (2286042, 48)


In [21]:
on_numeric_columns = X_train.select_dtypes(exclude='number').columns
non_numeric_columns = X_test.select_dtypes(exclude='number').columns
print(f"Non-numeric columns: {non_numeric_columns}")

Non-numeric columns: Index(['srcip', 'sport', 'dstip', 'dsport', 'proto', 'state', 'service',
       'ct_ftp_cmd', 'attack_cat'],
      dtype='object')


<H1>Encode

In [22]:
X_train['dsport'] = X_train['dsport'].astype(str)
X_train['dsport'].fillna('missing', inplace=True)  # Replace NaN with a placeholder value
X_train = X_train.dropna(subset=['dsport'])
X_train['ct_ftp_cmd'] = X_train['ct_ftp_cmd'].astype(str)
X_train['srcip'] = X_train['srcip'].astype(str)

X_test['ct_ftp_cmd'] = X_test['ct_ftp_cmd'].astype(str)
X_test['srcip'] = X_test['srcip'].astype(str)
X_test['dsport'] = X_test['dsport'].astype(str)
X_test['dsport'].fillna('missing', inplace=True)  # Replace NaN with a placeholder value
X_test = X_test.dropna(subset=['dsport'])

# Assuming 'X_train' is your DataFrame
le = LabelEncoder()


# Now apply the LabelEncoder
X_train['ct_ftp_cmd'] = le.fit_transform(X_train['ct_ftp_cmd'])
X_test['ct_ftp_cmd'] = le.fit_transform(X_test['ct_ftp_cmd'])
X_train['srcip'] = le.fit_transform(X_train['srcip'])
X_test['srcip'] = le.fit_transform(X_test['srcip'])
# Convert the 'sport' column to strings
X_train['sport'] = X_train['sport'].astype(str)
X_test['sport'] = X_test['sport'].astype(str)
# Now, you can use LabelEncoder on the 'sport' column
X_train['sport'] = le.fit_transform(X_train['sport'])
X_train['dsport'] = le.fit_transform(X_train['dsport'])
X_test['sport'] = le.fit_transform(X_test['sport'])
X_test['dsport'] = le.fit_transform(X_test['dsport'])

# Repeat for other columns...

X_train['dstip'] = le.fit_transform(X_train['dstip'])
X_train['dsport'] = le.fit_transform(X_train['dsport'])
X_train['proto'] = le.fit_transform(X_train['proto'])
X_train['state'] = le.fit_transform(X_train['state'])
X_train['service'] = le.fit_transform(X_train['service'])
X_train['ct_ftp_cmd'] = le.fit_transform(X_train['ct_ftp_cmd'])
X_train['attack_cat'] = le.fit_transform(X_train['attack_cat'])
X_train['dstip'] = le.fit_transform(X_train['dstip'])

X_test['dstip'] = le.fit_transform(X_test['dstip'])
X_test['dsport'] = le.fit_transform(X_test['dsport'])
X_test['proto'] = le.fit_transform(X_test['proto'])
X_test['state'] = le.fit_transform(X_test['state'])
X_test['service'] = le.fit_transform(X_test['service'])
X_test['ct_ftp_cmd'] = le.fit_transform(X_test['ct_ftp_cmd'])
X_test['attack_cat'] = le.fit_transform(X_test['attack_cat'])



In [23]:
non_numeric_columns = X_train.select_dtypes(exclude='number').columns
print(f"Non-numeric columns: {non_numeric_columns}")
non_numeric_columns = X_test.select_dtypes(exclude='number').columns
print(f"Non-numeric columns: {non_numeric_columns}")

Non-numeric columns: Index([], dtype='object')
Non-numeric columns: Index([], dtype='object')


In [24]:
# Finding dan Remove high correlation features
corr_mat = X_train.corr(method='pearson')
columns = corr_mat.columns
for i in range(corr_mat.shape[0]):
    for j in range(i+1, corr_mat.shape[0]):
        if corr_mat.iloc[i, j] >= 0.9:
            print(f"High corr: {columns[i]:20s} {columns[j]:20s} {corr_mat.iloc[i, j]}")

            # Dropping high correlation features
            if columns[j] in X_train.columns:
                X_train = X_train.drop(columns=[columns[j]])
            
            if columns[j] in X_test.columns:
                X_test = X_test.drop(columns=[columns[j]])

High corr: sbytes               sloss                0.9544266892460995
High corr: dbytes               dloss                0.9913052585818809
High corr: dbytes               dpkts                0.9706287326945696
High corr: sttl                 ct_state_ttl         0.9059574874443663
High corr: dloss                dpkts                0.9920944071571861
High corr: swin                 dwin                 0.9972098797520872
High corr: stime                ltime                0.9999999997825213
High corr: tcprtt               synack               0.9297052338631611
High corr: tcprtt               ackdat               0.9186177806539728
High corr: ct_srv_src           ct_srv_dst           0.9567382933002961
High corr: ct_srv_src           ct_dst_src_ltm       0.942191554375676
High corr: ct_srv_dst           ct_dst_src_ltm       0.9509948411308352
High corr: ct_dst_ltm           ct_src__ltm          0.9385080233831757
High corr: ct_dst_ltm           ct_src_dport_ltm     0.9601365848

In [25]:
print("X_train shape after high corr filter:", X_train.shape)


X_train shape after high corr filter: (2286042, 35)


In [26]:
print(f"Data after scaling:\n{X_train}")

Data after scaling:
         srcip  sport  dstip  dsport  proto  state       dur  sbytes  dbytes  \
282001      37  34830      9   47340    120      2  0.001097     146     178   
336847      36  61619     26   26636    114      5  0.004232     528    8824   
1362417     28  18499     10    3845    114      5  0.777348     588     354   
218834      35  19412     24   48764    120      2  0.001681     528     304   
1203820     12    453     28   47340    120      6  0.000010     264       0   
...        ...    ...    ...     ...    ...    ...       ...     ...     ...   
2249467     14    453     29   47340    120      6  0.000009     264       0   
963395      34  25616      8   46128    114      5  0.426193    2054    2478   
2215104     31  48738     19   37974    114      5  0.667889    1058     766   
1484405     41   6632      8   14868    114      5  0.041776    3302   37162   
305711      37  31728     21   32567    114      5  0.028309    4014   57706   

         sttl  ... 

<H1> TRAINING, Save Model, And Clasification

<H5>Random Forest 

In [27]:
# Train Random Forest classifier
start_time = time.time()
random_forest_clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
random_forest_clf.fit(X_train, y_train)
end_time = time.time()
print("Training time taken (Random Forest): ", end_time - start_time)

# Save the trained model to a pickle file
model_filename = 'random_forest_model.pkl'
with open(model_filename, 'wb') as model_file:
    pickle.dump(random_forest_clf, model_file)
print(f"Random Forest model saved as {model_filename}")



Training time taken (Random Forest):  93.5866277217865
Random Forest model saved as random_forest_model.pkl


Test

In [28]:
# Test and get accuracy, precision, recall
y_pred = random_forest_clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred, average='weighted'))
print("Recall:", metrics.recall_score(y_test, y_pred, average='weighted'))
cm = metrics.confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Accuracy: 0.9981220842109407
Precision: 0.9981495053341348
Recall: 0.9981220842109407
Confusion Matrix:
[[221338    477]
 [     0  32190]]


DecisionTree

In [29]:
# Train Decision Tree classifier
start_time = time.time()
decision_tree_clf = DecisionTreeClassifier()
decision_tree_clf.fit(X_train, y_train)
end_time = time.time()
print("Training time taken (Decision Tree): ", end_time - start_time)

# Save the trained model to a pickle file
model_filename = 'decision_tree_model.pkl'
with open(model_filename, 'wb') as model_file:
    pickle.dump(decision_tree_clf, model_file)
print(f"Decision Tree model saved as {model_filename}")

# Test and get accuracy, precision, recall
y_pred = decision_tree_clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Precision:", metrics.precision_score(y_test, y_pred, average='weighted'))
print("Recall:", metrics.recall_score(y_test, y_pred, average='weighted'))
cm = metrics.confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Training time taken (Decision Tree):  3.848851203918457
Decision Tree model saved as decision_tree_model.pkl
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
Confusion Matrix:
[[221815      0]
 [     0  32190]]


neural Network

In [33]:
le = LabelEncoder()
y_train_encoded = le.fit_transform(y_train)
y_test_encoded = le.transform(y_test)

# Create a neural network classifier
neural_network_clf = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000, random_state=0)

# Measure the training time
start_time = time.time()
neural_network_clf.fit(X_train, y_train_encoded)
end_time = time.time()
print("Training time taken (Neural Network): ", end_time - start_time)

# Save the trained model to a pickle file
model_filename = 'neural_network_model.pkl'
with open(model_filename, 'wb') as model_file:
    pickle.dump(neural_network_clf, model_file)
print(f"Neural Network model saved as {model_filename}")

# Test and get accuracy, precision, recall on the test set
y_pred_encoded = neural_network_clf.predict(X_test)
y_pred = le.inverse_transform(y_pred_encoded)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("Test Precision:", precision_score(y_test, y_pred, average='weighted'))
print("Test Recall:", recall_score(y_test, y_pred, average='weighted'))

# Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Training time taken (Neural Network):  140.66628170013428
Neural Network model saved as neural_network_model.pkl
Test Accuracy: 0.8617310682860574
Test Precision: 0.8990055654718714
Test Recall: 0.8617310682860574
Confusion Matrix:
[[194583  27232]
 [  7889  24301]]


<H1>LOAD MODEL

Random Forest

In [31]:
with open('random_forest_model.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)
# Now, you can use the loaded_model for predictions
loaded_y_pred = loaded_model.predict(X_test)

Decision Tree

In [32]:
# Load the saved model
with open('decision_tree_model.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)
# Now, you can use the loaded_model for predictions
loaded_y_pred = loaded_model.predict(X_test)

Neural Network

In [None]:
# Specify the filename of the saved model
model_filename = 'neural_network_model.pkl'

# Load the saved model
with open(model_filename, 'rb') as model_file:
    loaded_model = pickle.load(model_file)