## Experimental results

1.Based on the given accuracy and kappa values, the following findings can be derived from the experiment:

2.The decision tree (DT) and random forest (RF) models perform equally well in terms of accuracy and kappa across all look-back steps. This suggests that these models are robust to changes in the amount of historical data used for training.

3.The K-Nearest Neighbors (KNN) and Multi-Layer Perceptron (MLP) models have lower accuracy and kappa values compared to the DT and RF models. This suggests that these models may not be well suited for cyberattack prediction or may need further optimization.

4.The performance of the MLP model decreases as the look-back step increases. This suggests that the model may not be able to learn effectively from a large amount of historical data.

5.The performance of the KNN model remains constant across all lookback steps. This suggests that beyond a certain point, the model can no longer benefit from additional historical data.

6.The RF model outperforms the other models in terms of kappa values, suggesting that it's better at capturing the correspondence between predicted and actual labels.

These results suggest that the DT and RF models are well suited for cyberattack prediction and that the amount of historical data used for training can be varied without significantly affecting the model's performance. However, for the KNN and MLP models, further optimization may be required to improve their performance. 

In [20]:
import pandas as pd
import seaborn as sns

# create a dictionary to store the data
data = {
    'Model': ['DT', 'RF', 'KNN', 'MLP'],
    'Basic Approach': [0.9999904593808138, 0.9999904593808138, 0.9999373045024907, 0.9998528018754131],
    'Step 1': [0.9999890964203548, 0.9999890964203548, 0.9999373044170401, 0.9998364463053221],
    'Step 2': [0.9999890963906327, 0.9999890963906327, 0.999937304246138, 0.9998105497872434],
    'Step 3': [0.9999904593027931, 0.999991822259537, 0.9999373039897833, 0.9998200897098128],
    'Step 4': [0.9999904592507786, 0.9999918222149531, 0.9999373036479736, 0.9997573923769414],
}

# create a pandas DataFrame for accuracy table
acc_df = pd.DataFrame(data)
acc_df.set_index('Model', inplace=True)

# create a dictionary to store the kappa score data
data_kappa = {
    'Model': ['DT', 'RF', 'KNN', 'MLP'],
    'Basic Approach': [0.9665, 0.9662, 0.7850, 0.1562],
    'Step 1': [0.9615, 0.9612, 0.7850, 0.1549],
    'Step 2': [0.9615, 0.9612, 0.7850, 0.1472],
    'Step 3': [0.9665, 0.9711, 0.7850, 0.1538],
    'Step 4': [0.9665, 0.9711, 0.7850, 0.1273],
}

# create a pandas DataFrame for kappa score table
kappa_df = pd.DataFrame(data_kappa)
kappa_df.set_index('Model', inplace=True)

# format the tables
acc_df = acc_df.round(5)
kappa_df = kappa_df.round(4)

acc_df.columns = pd.MultiIndex.from_product([['Accuracy'], acc_df.columns])
kappa_df.columns = pd.MultiIndex.from_product([['Kappa Score'], kappa_df.columns])

# create a color gradient for the tables
acc_df_styled = acc_df.style.background_gradient(cmap='Greens', axis=None)
kappa_df_styled = kappa_df.style.background_gradient(cmap='Purples', axis=None)

# display the tables
print("Accuracy Table:\n")
display(acc_df_styled)
print("\nKappa Score Table:\n")
display(kappa_df_styled)


Accuracy Table:



Unnamed: 0_level_0,Accuracy,Accuracy,Accuracy,Accuracy,Accuracy
Unnamed: 0_level_1,Basic Approach,Step 1,Step 2,Step 3,Step 4
Model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
DT,0.99999,0.99999,0.99999,0.99999,0.99999
RF,0.99999,0.99999,0.99999,0.99999,0.99999
KNN,0.99994,0.99994,0.99994,0.99994,0.99994
MLP,0.99985,0.99984,0.99981,0.99982,0.99976



Kappa Score Table:



Unnamed: 0_level_0,Kappa Score,Kappa Score,Kappa Score,Kappa Score,Kappa Score
Unnamed: 0_level_1,Basic Approach,Step 1,Step 2,Step 3,Step 4
Model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
DT,0.9665,0.9615,0.9615,0.9665,0.9665
RF,0.9662,0.9612,0.9612,0.9711,0.9711
KNN,0.785,0.785,0.785,0.785,0.785
MLP,0.1562,0.1549,0.1472,0.1538,0.1273


In [22]:
import pandas as pd

# Load preprocessed training and testing datasets
train_data = pd.read_csv('Training.csv')
test_data = pd.read_csv('Testing.csv')
unwanted_features = ['pkSeqID', 'proto', 'saddr', 'sport', 'daddr', 'dport', 'category','subcategory']
# train_data.drop(columns=unwanted_features, inplace=True)
# test_data.drop(columns=unwanted_features, inplace=True)
print(train_data.head())

   pkSeqID proto            saddr  sport          daddr dport     seq  \
0  3142762   udp  192.168.100.150   6551  192.168.100.3    80  251984   
1  2432264   tcp  192.168.100.150   5532  192.168.100.3    80  256724   
2  1976315   tcp  192.168.100.147  27165  192.168.100.3    80   62921   
3  1240757   udp  192.168.100.150  48719  192.168.100.3    80   99168   
4  3257991   udp  192.168.100.147  22461  192.168.100.3    80  105063   

     stddev  N_IN_Conn_P_SrcIP       min  state_number      mean  \
0  1.900363                100  0.000000             4  2.687519   
1  0.078003                 38  3.856930             3  3.934927   
2  0.268666                100  2.974100             3  3.341429   
3  1.823185                 63  0.000000             4  3.222832   
4  0.822418                100  2.979995             4  3.983222   

   N_IN_Conn_P_DstIP  drate     srate       max  attack category subcategory  
0                100    0.0  0.494549  4.031619       1     DDoS         

In [24]:
print(train_data['attack'].value_counts())

1    2934447
0        370
Name: attack, dtype: int64


In [23]:

# group the dataset by two features, e.g. 'feature_1' and 'feature_2', and get the count of each group
counts = train_data.groupby(['category','subcategory']).size()

# print the count of each combination
print(counts)


category        subcategory      
DDoS            HTTP                    786
                TCP                  782228
                UDP                  758301
DoS             HTTP                   1184
                TCP                  492615
                UDP                  826349
Normal          Normal                  370
Reconnaissance  OS_Fingerprint        14293
                Service_Scan          58626
Theft           Data_Exfiltration         6
                Keylogging               59
dtype: int64
