## Partitioning UNSW-NB15-Train-Basic

The partitions made can be balanced/ unbalanced, with 3 and 5 nodes, and considering attacks in multiple places or just one node. 

### Partitions with 3 balanced nodes 

Generic (3 nodes), Exploits (2 nodes), DoS and Reconnaissance (1 node).


- UNSW-NB15-Train-Basic-Part1 (145200): 
    - Normal: 72600 (50.0%)
    - Generic: 47369 (32.6%)
    - Exploits: 14728 (10.2%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 10503 (7.2%)

- UNSW-NB15-Train-Basic-Part2 (145119): 
    - Normal: 72600 (50.0%)
    - Generic: 60213 (41.5%)
    - Exploits: 0 (0.0%)
    - DoS: 12306 (8.5%)
    - Reconnaissance: 0 (0.0%)

- UNSW-NB15-Train-Basic-Part3 (145200): 
    - Normal: 72600 (50.0%)
    - Generic: 53792 (37.0%)
    - Exploits: 18808 (13.0%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 0 (0.0%)

In [1]:
import numpy as np  # for array
import pandas as pd  # for csv files and dataframe
import matplotlib.pyplot as plt  # for plotting
import seaborn as sns  # plotting
from scipy import stats

import pickle  # To load data int disk

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer
from sklearn.metrics import auc, f1_score, roc_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_validate, cross_val_predict

In [3]:
# Get UNSW-NB15-Train-Basic dataset 
complete = pd.read_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic.csv')

In [4]:
complete

Unnamed: 0,proto,srcip,sport,dstip,dsport,spkts,dpkts,sbytes,dbytes,state,stime,ltime,dur,label,attack_cat
0,udp,59.166.0.7,45584,149.171.126.0,53,2,2,130,162,CON,1424257612,1424257612,0.003362,0,normal
1,tcp,59.166.0.2,18633,149.171.126.3,8908,38,40,2438,19440,FIN,1424258064,1424258064,0.013975,0,normal
2,tcp,59.166.0.7,48428,149.171.126.4,143,122,126,7824,14814,FIN,1421951117,1421951118,0.757193,0,normal
3,unas,175.45.176.1,0,149.171.126.17,0,2,0,200,0,INT,1424244417,1424244417,0.000004,1,dos
4,tcp,175.45.176.1,65485,149.171.126.17,179,10,6,876,268,FIN,1421928270,1421928271,0.436004,0,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
435514,udp,175.45.176.2,31072,149.171.126.19,500,2,0,136,0,INT,1424223755,1424223755,0.000001,1,dos
435515,tcp,59.166.0.5,2758,149.171.126.8,51824,44,46,2766,28558,FIN,1424261274,1424261274,0.014994,0,normal
435516,udp,175.45.176.0,47439,149.171.126.10,53,2,0,114,0,INT,1424259155,1424259155,0.000002,1,generic
435517,udp,175.45.176.1,47439,149.171.126.18,53,2,0,114,0,INT,1424238658,1424238658,0.000002,1,generic


Get amount of samples for every type of traffic (attack subtypes as well): 

In [12]:
normal = complete[complete['label'] == 0]
normal.shape

(217800, 15)

In [14]:
generic = complete[complete['attack_cat'] == "generic"]
generic.shape

(161374, 15)

In [16]:
exploits = complete[complete['attack_cat'] == "exploits"]
exploits.shape

(33536, 15)

In [17]:
dos = complete[complete['attack_cat'] == "dos"]
dos.shape

(12306, 15)

In [18]:
reconnaissance = complete[complete['attack_cat'] == "reconnaissance"]
reconnaissance.shape

(10503, 15)

Construct the partitions for each subtype: 

In [24]:
# Separate into three partitions normal samples (72600)
normal1 = complete[complete['label'] == 0].iloc[:72600]
normal2 = complete[complete['label'] == 0].iloc[72600:72600*2]
normal3 = complete[complete['label'] == 0].iloc[72600*2:]


In [26]:
# Separate into three partitions generic samples (47369, 60213, 53792)
generic1 = complete[complete['attack_cat'] == "generic"].iloc[:47369]
generic2 = complete[complete['attack_cat'] == "generic"].iloc[47369:(47369+60213)]
generic3 = complete[complete['attack_cat'] == "generic"].iloc[(47369+60213):]

In [30]:
# Separate into two partitions exploits samples (14728, 18808)
exploits1 = complete[complete['attack_cat'] == "exploits"].iloc[:14728]
exploits2 = complete[complete['attack_cat'] == "exploits"].iloc[14728:]

In [32]:
# Get all dos samples and reconnaissance samples
dos = complete[complete['attack_cat'] == "dos"]
recon = complete[complete['attack_cat'] == "reconnaissance"]

Concatenate the different df to obtain the three final partitions: 

In [33]:
# Create UNSW-NB15-Train-Basic-PartN, N = 1,2,3, dataset and export to csv
part1 = pd.concat([normal1, generic1, exploits1, recon])
part2 = pd.concat([normal2, generic2, dos])
part3 = pd.concat([normal3, generic3, exploits2])
part1.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-Part1.csv', index=False)
part2.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-Part2.csv', index=False)
part3.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-Part3.csv', index=False)