## Partitioning UNSW-NB15-Train-Basic into 3 nodes 

The partitions made can be balanced/ unbalanced, with 3 nodes. Attacks might appear in all nodes or only a subset. 

In [3]:
import numpy as np  # for array
import pandas as pd  # for csv files and dataframe
import matplotlib.pyplot as plt  # for plotting
import seaborn as sns  # plotting
from scipy import stats

import pickle  # To load data int disk

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer
from sklearn.metrics import auc, f1_score, roc_curve
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_validate, cross_val_predict

In [4]:
# Get UNSW-NB15-Train-Basic dataset 
complete = pd.read_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic.csv')

In [3]:
complete

Unnamed: 0,proto,srcip,sport,dstip,dsport,spkts,dpkts,sbytes,dbytes,state,stime,ltime,dur,label,attack_cat
0,udp,59.166.0.7,45584,149.171.126.0,53,2,2,130,162,CON,1424257612,1424257612,0.003362,0,normal
1,tcp,59.166.0.2,18633,149.171.126.3,8908,38,40,2438,19440,FIN,1424258064,1424258064,0.013975,0,normal
2,tcp,59.166.0.7,48428,149.171.126.4,143,122,126,7824,14814,FIN,1421951117,1421951118,0.757193,0,normal
3,unas,175.45.176.1,0,149.171.126.17,0,2,0,200,0,INT,1424244417,1424244417,0.000004,1,dos
4,tcp,175.45.176.1,65485,149.171.126.17,179,10,6,876,268,FIN,1421928270,1421928271,0.436004,0,normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
435514,udp,175.45.176.2,31072,149.171.126.19,500,2,0,136,0,INT,1424223755,1424223755,0.000001,1,dos
435515,tcp,59.166.0.5,2758,149.171.126.8,51824,44,46,2766,28558,FIN,1424261274,1424261274,0.014994,0,normal
435516,udp,175.45.176.0,47439,149.171.126.10,53,2,0,114,0,INT,1424259155,1424259155,0.000002,1,generic
435517,udp,175.45.176.1,47439,149.171.126.18,53,2,0,114,0,INT,1424238658,1424238658,0.000002,1,generic


### id = 3A : Partition with 3 balanced nodes 

Normal (3 nodes), Generic (3 nodes), Exploits (2 nodes), DoS and Reconnaissance (1 node).


- UNSW-NB15-Train-Basic-Part1 (145200): 
    - Normal: 72600 (50.0%)
    - Generic: 47369 (32.6%)
    - Exploits: 14728 (10.2%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 10503 (7.2%)

- UNSW-NB15-Train-Basic-Part2 (145119): 
    - Normal: 72600 (50.0%)
    - Generic: 60213 (41.5%)
    - Exploits: 0 (0.0%)
    - DoS: 12306 (8.5%)
    - Reconnaissance: 0 (0.0%)

- UNSW-NB15-Train-Basic-Part3 (145200): 
    - Normal: 72600 (50.0%)
    - Generic: 53792 (37.0%)
    - Exploits: 18808 (13.0%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 0 (0.0%)

Get amount of samples for every type of traffic (attack subtypes as well): 

In [4]:
normal = complete[complete['label'] == 0]
normal.shape

(217800, 15)

In [5]:
generic = complete[complete['attack_cat'] == "generic"]
generic.shape

(161374, 15)

In [6]:
exploits = complete[complete['attack_cat'] == "exploits"]
exploits.shape

(33536, 15)

In [7]:
dos = complete[complete['attack_cat'] == "dos"]
dos.shape

(12306, 15)

In [8]:
reconnaissance = complete[complete['attack_cat'] == "reconnaissance"]
reconnaissance.shape

(10503, 15)

Construct the partitions for each subtype: 

In [9]:
# Separate into three partitions normal samples (72600)
normal1 = complete[complete['label'] == 0].iloc[:72600]
normal2 = complete[complete['label'] == 0].iloc[72600:72600*2]
normal3 = complete[complete['label'] == 0].iloc[72600*2:]


In [10]:
# Separate into three partitions generic samples (47369, 60213, 53792)
generic1 = complete[complete['attack_cat'] == "generic"].iloc[:47369]
generic2 = complete[complete['attack_cat'] == "generic"].iloc[47369:(47369+60213)]
generic3 = complete[complete['attack_cat'] == "generic"].iloc[(47369+60213):]

In [11]:
# Separate into two partitions exploits samples (14728, 18808)
exploits1 = complete[complete['attack_cat'] == "exploits"].iloc[:14728]
exploits2 = complete[complete['attack_cat'] == "exploits"].iloc[14728:]

In [12]:
# Get all dos samples and reconnaissance samples
dos = complete[complete['attack_cat'] == "dos"]
recon = complete[complete['attack_cat'] == "reconnaissance"]

Concatenate the different df to obtain the three final partitions: 

In [13]:
# Create UNSW-NB15-Train-Basic-PartN, N = 1,2,3, dataset and export to csv
part1 = pd.concat([normal1, generic1, exploits1, recon])
part2 = pd.concat([normal2, generic2, dos])
part3 = pd.concat([normal3, generic3, exploits2])
part1.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3A-Part1.csv', index=False)
part2.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3A-Part2.csv', index=False)
part3.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3A-Part3.csv', index=False)

### id = 3B : Partition with 3 balanced nodes 

Normal (2 nodes), Generic (2 nodes), Exploits (1 nodes), DoS and Reconnaissance (1 node).


- UNSW-NB15-Train-Basic-Part1 (145200): 
    - Normal: 145200 (100.0%)
    - Generic: 0 (0.0%)
    - Exploits: 0 (0.0%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 0 (0.0%)

- UNSW-NB15-Train-Basic-Part2 (145119): 
    - Normal: 72600 (50.0%)
    - Generic: 60213 (41.5%)
    - Exploits: 0 (0.0%)
    - DoS: 12306 (8.5%)
    - Reconnaissance: 0 (0.0%)

- UNSW-NB15-Train-Basic-Part3 (145200): 
    - Normal: 0 (0.0%)
    - Generic: 101161 (69.7%)
    - Exploits: 33536 (23.1%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 10503 (7.2%)

In [4]:
# Create the partitions 

# Separate into two partitions normal samples (145200, 72600)
normal1 = complete[complete['label'] == 0].iloc[:145200]
normal2 = complete[complete['label'] == 0].iloc[145200:]

# Separate into two partitions generic samples (60213, 101161)
generic1 = complete[complete['attack_cat'] == "generic"].iloc[:60213]
generic2 = complete[complete['attack_cat'] == "generic"].iloc[60213:]

# Grab the samples with exploit attacks 
exploits1 = complete[complete['attack_cat'] == "exploits"]

# Get all dos samples and reconnaissance samples
dos = complete[complete['attack_cat'] == "dos"]
recon = complete[complete['attack_cat'] == "reconnaissance"]

In [5]:
# Create UNSW-NB15-Train-Basic-PartN, N = 1,2,3, dataset and export to csv
part1 = pd.concat([normal1])
part2 = pd.concat([normal2, generic1, dos])
part3 = pd.concat([generic1, exploits1, recon])
part1.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3B-Part1.csv', index=False)
part2.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3B-Part2.csv', index=False)
part3.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3B-Part3.csv', index=False)

### id = 3C : Partition with 3 balanced nodes 

All types of traffic represented in all nodes: 

NEED TO CHANGE 
- UNSW-NB15-Train-Basic-Part1 (145173): 
    - Normal: 72600 (50.0%)
    - Generic: 53791 (37.1%)
    - Exploits: 11179 (7.7%)
    - DoS: 4102 (2.8%)
    - Reconnaissance: 3501 (2.4%)

- UNSW-NB15-Train-Basic-Part2 (145173): 
    - Normal: 72600 (50.0%)
    - Generic: 53791 (37.1%)
    - Exploits: 11179 (7.7%)
    - DoS: 4102 (2.8%)
    - Reconnaissance: 3501 (2.4%)

- UNSW-NB15-Train-Basic-Part3 (145173): 
    - Normal: 72600 (50.0%)
    - Generic: 53792 (37.1%)
    - Exploits: 11178 (7.7%)
    - DoS: 4102 (2.8%)
    - Reconnaissance: 3501 (2.4%)

In [5]:
# Create the partitions 

# Separate into three partitions normal samples (72600)
normal1 = complete[complete['label'] == 0].iloc[:72600]
normal2 = complete[complete['label'] == 0].iloc[72600:72600*2]
normal3 = complete[complete['label'] == 0].iloc[72600*2:]

# Separate into three partitions generic samples (53791, 53791, 53792)
generic1 = complete[complete['attack_cat'] == "generic"].iloc[:53791]
generic2 = complete[complete['attack_cat'] == "generic"].iloc[53791:53791*2]
generic3 = complete[complete['attack_cat'] == "generic"].iloc[53791*2:]

# Separate into three partitions exploits samples (11179, 11179, 11178) 
exploits1 = complete[complete['attack_cat'] == "exploits"].iloc[:11179]
exploits2 = complete[complete['attack_cat'] == "exploits"].iloc[11179:11179*2]
exploits3 = complete[complete['attack_cat'] == "exploits"].iloc[11179*2:]

# Separate into three partitions dos samples (4102)
dos1 = complete[complete['attack_cat'] == "dos"].iloc[:4102]
dos2 = complete[complete['attack_cat'] == "dos"].iloc[4102:4102*2]
dos3 = complete[complete['attack_cat'] == "dos"].iloc[4102*2:]

# Separate into three partitions reconnaissance samples (3501)
recon1 = complete[complete['attack_cat'] == "reconnaissance"].iloc[:3501]
recon2 = complete[complete['attack_cat'] == "reconnaissance"].iloc[3501:3501*2]
recon3 = complete[complete['attack_cat'] == "reconnaissance"].iloc[3501*2:]

In [6]:
# Create UNSW-NB15-Train-Basic-PartN, N = 1,2,3, dataset and export to csv
part1 = pd.concat([normal1, generic1, exploits1, dos1, recon1])
part2 = pd.concat([normal2, generic2, exploits2, dos2, recon2])
part3 = pd.concat([normal3, generic3, exploits3, dos3, recon3])
part1.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3C-Part1.csv', index=False)
part2.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3C-Part2.csv', index=False)
part3.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3C-Part3.csv', index=False)

### id = 3D : Partition with 3 balanced nodes 

To try with balanced nodes (same number of samples) that only contain Normal or Generic in one node, we need to restrict the number of samples at each node. 
- UNSW-NB15-Train-Basic-Part1 (56345): 
    - Normal: 56345 (100.0%)
    - Generic: 0 (0.0%)
    - Exploits: 0 (0.0%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 0 (0.0%)

- UNSW-NB15-Train-Basic-Part2 (56345): 
    - Normal: 0 (0.0%)
    - Generic: 56345 (100.0%)
    - Exploits: 0 (0.0%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 0 (0.0%)

- UNSW-NB15-Train-Basic-Part3 (56345): 
    - Normal: 0 (0.0%)
    - Generic: 0 (0.0%)
    - Exploits: 33536 (59.52%)
    - DoS: 12306 (21.84%)
    - Reconnaissance: 10503 (18.64%)

In [3]:
# Create the partitions 

# Grab 56345 random samples from normal traffic 
normal = complete[complete['attack_cat'] == "normal"].sample(56345)

# Grab 56345 random samples from generic attacks
generic = complete[complete['attack_cat'] == "generic"].sample(56345)

# Grab the samples with exploit attacks 
exploits1 = complete[complete['attack_cat'] == "exploits"]

# Get all dos samples and reconnaissance samples
dos = complete[complete['attack_cat'] == "dos"]
recon = complete[complete['attack_cat'] == "reconnaissance"]

In [4]:
# Create UNSW-NB15-Train-Basic-PartN, N = 1,2,3, dataset and export to csv
part1 = pd.concat([normal])
part2 = pd.concat([generic])
part3 = pd.concat([exploits1, dos, recon])
part1.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3D-Part1.csv', index=False)
part2.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3D-Part2.csv', index=False)
part3.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3D-Part3.csv', index=False)

### id = 3E : Partition with 3 unbalanced nodes 

To try with balanced nodes (same number of samples) that only contain Normal or Generic in one node, we need to restrict the number of samples at each node. 
- UNSW-NB15-Train-Basic-Part1 (145173): 
    - Normal: 74050 (51.0%)
    - Generic: 53791 (37.1%)
    - Exploits: 11179 (7.7%)
    - DoS: 6153 (4.2%)
    - Reconnaissance: 0 (0.0%)

- UNSW-NB15-Train-Basic-Part2 (145173): 
    - Normal: 68799 (47.4%)
    - Generic: 53791 (37.1%)
    - Exploits: 11179 (7.7%)
    - DoS: 6153 (4.2%)
    - Reconnaissance: 5251 (3.6%)

- UNSW-NB15-Train-Basic-Part3 (145173): 
    - Normal: 74951 (51.6%)
    - Generic: 53792 (37.1%)
    - Exploits: 11178 (7.7%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 5252 (3.6%)

In [7]:
# Create the partitions 

# Separate into three partitions normal samples (74050, 68799, 74951)
normal1 = complete[complete['label'] == 0].iloc[:74050]
normal2 = complete[complete['label'] == 0].iloc[74050:(74050+68799)]
normal3 = complete[complete['label'] == 0].iloc[(74050+68799):]

# Separate into three partitions generic samples (53791, 53791, 53792)
generic1 = complete[complete['attack_cat'] == "generic"].iloc[:53791]
generic2 = complete[complete['attack_cat'] == "generic"].iloc[53791:53791*2]
generic3 = complete[complete['attack_cat'] == "generic"].iloc[53791*2:]

# Separate into three partitions exploits samples (11179, 11179, 11178) 
exploits1 = complete[complete['attack_cat'] == "exploits"].iloc[:11179]
exploits2 = complete[complete['attack_cat'] == "exploits"].iloc[11179:11179*2]
exploits3 = complete[complete['attack_cat'] == "exploits"].iloc[11179*2:]

# Separate into two partitions dos samples (6153)
dos1 = complete[complete['attack_cat'] == "dos"].iloc[:6153]
dos2 = complete[complete['attack_cat'] == "dos"].iloc[6153:]


# Separate into two partitions reconnaissance samples (5251, 5252)
recon1 = complete[complete['attack_cat'] == "reconnaissance"].iloc[:5251]
recon2 = complete[complete['attack_cat'] == "reconnaissance"].iloc[5251:]

In [8]:
# Create UNSW-NB15-Train-Basic-PartN, N = 1,2,3, dataset and export to csv
part1 = pd.concat([normal1, generic1, exploits1, dos1])
part2 = pd.concat([normal2, generic2, exploits2, dos2, recon1])
part3 = pd.concat([normal3, generic3, exploits3, recon2])
part1.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3E-Part1.csv', index=False)
part2.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3E-Part2.csv', index=False)
part3.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3E-Part3.csv', index=False)

### id = 3F : Partition with 3 unbalanced nodes

Normal (3 nodes), Generic (3 nodes), Exploits (2 nodes), DoS (1 node), Reconnaissance (1 node)

- UNSW-NB15-Train-Basic-Part1 (100000): 
    - Normal: 50000 (50.0%)
    - Generic: 17191 (17.2%)
    - Exploits: 10000 (10.0%)
    - DoS: 12306 (12.3%)
    - Reconnaissance: 10503 (10.5%)

- UNSW-NB15-Train-Basic-Part2 (135600): 
    - Normal: 67800 (50.0%)
    - Generic: 67800 (50.0%)
    - Exploits: 0 (0.0%)
    - DoS: 0 (0%)
    - Reconnaissance: 0 (0.0%)

- UNSW-NB15-Train-Basic-Part3 (199919): 
    - Normal: 100000 (50.0%)
    - Generic: 76383 (38.2%)
    - Exploits: 23536 (11.8%)
    - DoS: 0 (0.0%)
    - Reconnaissance: 0 (0.0%)

In [5]:
# Create the partitions 

# Separate into three partitions normal samples 
normal1 = complete[complete['label'] == 0].iloc[:50000]
normal2 = complete[complete['label'] == 0].iloc[50000:(50000+67800)]
normal3 = complete[complete['label'] == 0].iloc[(50000+67800):]

# Separate into three partitions generic samples
generic1 = complete[complete['attack_cat'] == "generic"].iloc[:17191]
generic2 = complete[complete['attack_cat'] == "generic"].iloc[17191:(17191+67800)]
generic3 = complete[complete['attack_cat'] == "generic"].iloc[(17191+67800):]

# Separate into two partitions exploits samples 
exploits1 = complete[complete['attack_cat'] == "exploits"].iloc[:10000]
exploits2 = complete[complete['attack_cat'] == "exploits"].iloc[10000:]

# Gather all DoS components
dos1 = complete[complete['attack_cat'] == "dos"]

# Gather all Reconnaissance components 
recon1 = complete[complete['attack_cat'] == "reconnaissance"]

In [6]:
# Create UNSW-NB15-Train-Basic-PartN, N = 1,2,3, dataset and export to csv
part1 = pd.concat([normal1, generic1, exploits1, dos1, recon1])
part2 = pd.concat([normal2, generic2])
part3 = pd.concat([normal3, generic3, exploits2])
part1.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3F-Part1.csv', index=False)
part2.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3F-Part2.csv', index=False)
part3.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3F-Part3.csv', index=False)

### id = 3G : Partition with 3 unbalanced nodes 

- UNSW-NB15-Train-Basic-Part1 (122573): 
    - Normal: 50000 (40.8%)
    - Generic: 53791 (43.9%)
    - Exploits: 11179 (9.1%)
    - DoS: 4102 (3.3%)
    - Reconnaissance: 3501 (2.9%)

- UNSW-NB15-Train-Basic-Part2 (172573): 
    - Normal: 100000 (57.9%)
    - Generic: 53791 (31.2%)
    - Exploits: 11179 (6.5%)
    - DoS: 4102 (2.4%)
    - Reconnaissance: 3501 (2.0%)

- UNSW-NB15-Train-Basic-Part3 (140372)
    - Normal: 67800 (48.3%)
    - Generic: 53791 (38.3%)
    - Exploits: 11178 (8.0%)
    - DoS: 4102 (2.9%)
    - Reconnaissance: 3501 (2.5%)

In [7]:
# Create the partitions 

# Separate into three partitions normal samples 
normal1 = complete[complete['label'] == 0].iloc[:50000]
normal2 = complete[complete['label'] == 0].iloc[50000:(50000+100000)]
normal3 = complete[complete['label'] == 0].iloc[(50000+100000):]

# Separate into three partitions generic samples 
generic1 = complete[complete['attack_cat'] == "generic"].iloc[:53791]
generic2 = complete[complete['attack_cat'] == "generic"].iloc[53791:53791*2]
generic3 = complete[complete['attack_cat'] == "generic"].iloc[53791*2:]

# Separate into three partitions exploits samples
exploits1 = complete[complete['attack_cat'] == "exploits"].iloc[:11179]
exploits2 = complete[complete['attack_cat'] == "exploits"].iloc[11179:(11179*2)]
exploits3 = complete[complete['attack_cat'] == "exploits"].iloc[(11179*2):]

# Separate into three partitions dos samples
dos1 = complete[complete['attack_cat'] == "dos"].iloc[:4102]
dos2 = complete[complete['attack_cat'] == "dos"].iloc[4102:(4102*2)]
dos3 = complete[complete['attack_cat'] == "dos"].iloc[(4102*2):]


# Separate into three partitions reconnaissance samples
recon1 = complete[complete['attack_cat'] == "reconnaissance"].iloc[:3501]
recon2 = complete[complete['attack_cat'] == "reconnaissance"].iloc[3501:(3501*2)]
recon3 = complete[complete['attack_cat'] == "reconnaissance"].iloc[3501*2:]

In [8]:
# Create UNSW-NB15-Train-Basic-PartN, N = 1,2,3, dataset and export to csv
part1 = pd.concat([normal1, generic1, exploits1, dos1, recon1])
part2 = pd.concat([normal2, generic2, exploits2, dos2, recon2])
part3 = pd.concat([normal3, generic3, exploits3, recon3, dos3])
part1.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3G-Part1.csv', index=False)
part2.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3G-Part2.csv', index=False)
part3.to_csv('C:/Users/UX430/Documents/thesis/datasets/UNSW-NB15/UNSW-NB15-Train-Basic-3G-Part3.csv', index=False)