# Data without attacks

In this notebook we're going to take the data we have and create data sets with attacks removed. This will allow us to train the models on this data set too and then review if it can predict out of sample attacks which it may have no knowledge of.

In [1]:
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy as sp
import requests
from io import StringIO

In [2]:
df_KD = pd.read_csv('KD99_corrected.csv')

In [3]:
df_KD.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.


In [4]:
df_KD.shape

(494021, 42)

This is our modified KD_99 data to include the header and other data cleansing we initally performed.

In [5]:
df_KD.groupby('label').size().sort_values(ascending=False)

label
smurf.              280790
neptune.            107201
normal.              97278
back.                 2203
satan.                1589
ipsweep.              1247
portsweep.            1040
warezclient.          1020
teardrop.              979
pod.                   264
nmap.                  231
guess_passwd.           53
buffer_overflow.        30
land.                   21
warezmaster.            20
imap.                   12
rootkit.                10
loadmodule.              9
ftp_write.               8
multihop.                7
phf.                     4
perl.                    3
spy.                     2
dtype: int64

The attack types within this data set are summarised as follows:
 - dos (denial of service)
         - back
         - land
         - neptune
         - pod
         - smurf
         - teardrop
 - u2r (user to root)
         - load module
         - perl
         - rootkit
 - r2l (root to local)
         - ftp write
         - guess password
         - imap
         - multihop
         - phf
         - spy
         - warezclient
         - warezmaster
 - probe 
         - ipsweep
         - nmap
         - portsweep
         - satan

The dos attack we will test is the teardrop attack as it is the median dos attack and therefore we will use this type to remove as it is a fairly large data set to test against but not so big that we remove a large portion of the data. 

The u2r attack we will test is the rootkit attack. All of the u2r attacks have <10 events which means they will be hard to validate and may provide insignificant results but the rootkit attack is the largest so we wil use that.

The r2l attack we will test is the guess password attack. This is the second largest attack and therefore may remove a large portion of the r2l data but to get a non-insignificant result, I feel it is best to remove this data set as it has enough events to test against.

Finally, the probe attack we will test is the portsweep attack. All 4 of these attacks are fairly common and therefore we take the third largest as it still has a large dataset (larger than the other 4 attack types test data sets) but removes a smaller portion of data to train on.

We now create the eight data sets to use for training and testing by partitioning out each attack.

## Creating Teardrop data sets

In [6]:
df_teardrop = df_KD[df_KD['label'] == 'teardrop.']
df_teardrop.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
19286,0,udp,private,SF,28,0,0,1,0,0,...,1,0.01,0.05,0.01,0.0,0.0,0.0,0.0,0.0,teardrop.
19287,0,udp,private,SF,28,0,0,3,0,0,...,2,0.03,0.05,0.03,0.0,0.0,0.0,0.0,0.0,teardrop.
19288,0,udp,private,SF,28,0,0,3,0,0,...,3,0.04,0.05,0.04,0.0,0.0,0.0,0.0,0.0,teardrop.
19289,0,udp,private,SF,28,0,0,3,0,0,...,4,0.05,0.05,0.05,0.0,0.0,0.0,0.0,0.0,teardrop.
19290,0,udp,private,SF,28,0,0,3,0,0,...,5,0.06,0.05,0.06,0.0,0.0,0.0,0.0,0.0,teardrop.


In [7]:
df_teardrop.shape

(979, 42)

We can see from the above that this has worked successfully and has extracted all 979 events that were attributed to teardrop attacks.

In [8]:
df_nonteardrop = df_KD[df_KD['label'] != 'teardrop.']

In [9]:
df_nonteardrop.shape

(493042, 42)

We can also see from the above that we have dropped all events attributed to the teardrop attack from the original data set.

In [10]:
df_nonteardrop.to_csv('KD99_noteardrop.csv', index = False, header = True)
df_teardrop.to_csv('KD99_teardrop.csv', index = False, header = True)

## Creating Rootkit Data Sets

In [11]:
df_rootkit = df_KD[df_KD['label'] == 'rootkit.']
df_rootkit.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
141509,60,tcp,telnet,SF,86,183,0,0,0,0,...,1,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,rootkit.
141510,60,tcp,telnet,SF,90,233,0,0,0,0,...,2,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,rootkit.
141511,708,tcp,telnet,SF,1727,24080,0,0,0,0,...,3,0.01,0.02,0.0,0.0,0.0,0.0,0.0,0.0,rootkit.
141512,21,tcp,ftp,SF,89,345,0,0,0,1,...,1,0.0,0.02,0.0,0.0,0.0,0.0,0.0,0.0,rootkit.
141513,98,tcp,telnet,SF,621,8356,0,0,1,1,...,4,0.02,0.02,0.0,0.0,0.0,0.0,0.0,0.0,rootkit.


In [12]:
df_rootkit.shape

(10, 42)

In [13]:
df_nonrootkit = df_KD[df_KD['label'] != 'rootkit.']

In [14]:
df_nonrootkit.shape

(494011, 42)

In [15]:
df_nonrootkit.to_csv('KD99_nonrootkit.csv', index = False, header = True)
df_rootkit.to_csv('KD99_rootkit.csv', index = False, header = True)

## Creating Guess Password Data Sets

In [16]:
df_guesspassword = df_KD[df_KD['label'] == 'guess_passwd.']
df_guesspassword.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
15699,23,tcp,telnet,SF,104,276,0,0,0,0,...,2,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,guess_passwd.
22750,60,tcp,telnet,S3,125,179,0,0,0,1,...,1,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,guess_passwd.
22751,0,tcp,telnet,RSTO,125,179,0,0,0,1,...,2,1.0,0.0,0.5,0.0,0.5,0.5,0.5,0.5,guess_passwd.
22752,0,tcp,telnet,RSTO,125,179,0,0,0,1,...,3,1.0,0.0,0.33,0.0,0.33,0.33,0.67,0.67,guess_passwd.
22753,0,tcp,telnet,RSTO,125,179,0,0,0,1,...,4,1.0,0.0,0.25,0.0,0.25,0.25,0.75,0.75,guess_passwd.


In [17]:
df_guesspassword.shape

(53, 42)

In [18]:
df_nonguesspassword = df_KD[df_KD['label'] != 'guess_passwd.']

In [19]:
df_nonguesspassword.shape

(493968, 42)

In [20]:
df_nonguesspassword.to_csv('KD99_nonguesspassword.csv', index = False, header = True)
df_guesspassword.to_csv('KD99_guesspassword.csv', index = False, header = True)

## Creating PortSweep Data Sets

In [21]:
df_portsweep = df_KD[df_KD['label'] == 'portsweep.']
df_portsweep.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
22814,1,tcp,private,RSTR,0,0,0,0,0,0,...,2,0.01,0.04,0.04,0.0,0.01,0.0,0.32,1.0,portsweep.
22815,1,tcp,private,RSTR,0,0,0,0,0,0,...,2,0.01,0.06,0.1,0.0,0.01,0.0,0.36,1.0,portsweep.
22816,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.01,0.09,0.14,0.0,0.01,0.0,0.39,1.0,portsweep.
22817,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.0,0.11,0.18,0.0,0.01,0.0,0.42,1.0,portsweep.
22818,1,tcp,private,RSTR,0,0,0,0,0,0,...,2,0.01,0.12,0.22,0.0,0.01,0.0,0.44,1.0,portsweep.


In [22]:
df_portsweep.shape

(1040, 42)

In [23]:
df_nonportsweep = df_KD[df_KD['label'] != 'portsweep.']

In [24]:
df_nonportsweep.shape

(492981, 42)

In [25]:
df_nonportsweep.to_csv('KD99_nonportsweep.csv', index = False, header = True)
df_portsweep.to_csv('KD99_portsweep.csv', index = False, header = True)