# Data without the Teardrop attack

In this notebook we're going to take the data we have and remove the teardrop attack. This will allow us to train the models on this data set too and then review if it can predict out of sample attacks which it may have no knowledge of.

In [1]:
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy as sp
import requests
from io import StringIO

In [2]:
df_KD = pd.read_csv('KD99_corrected.csv')

In [3]:
df_KD.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1.0,0.0,0.11,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.


In [4]:
df_KD.shape

(494021, 42)

This is our modified KD_99 data to include the header and other data cleansing we initally performed.

In [5]:
df_KD.groupby('label').size().sort_values(ascending=False)

label
smurf.              280790
neptune.            107201
normal.              97278
back.                 2203
satan.                1589
ipsweep.              1247
portsweep.            1040
warezclient.          1020
teardrop.              979
pod.                   264
nmap.                  231
guess_passwd.           53
buffer_overflow.        30
land.                   21
warezmaster.            20
imap.                   12
rootkit.                10
loadmodule.              9
ftp_write.               8
multihop.                7
phf.                     4
perl.                    3
spy.                     2
dtype: int64

The teardrop attack is a DDoS attack that is within this data set. It's large enough for us to remove a non-insignificant amount of data. Also, within the data set are other DDoS attacks (pod for example) which means that the training set will have similar but not the same data to train from.

We now create the two data sets to use for training and testing by partitioning out the teardrop data.

In [6]:
df_teardrop = df_KD[df_KD['label'] == 'teardrop.']
df_teardrop.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
19286,0,udp,private,SF,28,0,0,1,0,0,...,1,0.01,0.05,0.01,0.0,0.0,0.0,0.0,0.0,teardrop.
19287,0,udp,private,SF,28,0,0,3,0,0,...,2,0.03,0.05,0.03,0.0,0.0,0.0,0.0,0.0,teardrop.
19288,0,udp,private,SF,28,0,0,3,0,0,...,3,0.04,0.05,0.04,0.0,0.0,0.0,0.0,0.0,teardrop.
19289,0,udp,private,SF,28,0,0,3,0,0,...,4,0.05,0.05,0.05,0.0,0.0,0.0,0.0,0.0,teardrop.
19290,0,udp,private,SF,28,0,0,3,0,0,...,5,0.06,0.05,0.06,0.0,0.0,0.0,0.0,0.0,teardrop.


In [7]:
df_teardrop.shape

(979, 42)

We can see from the above that this has worked successfully and has extracted all 979 events that were attributed to teardrop attacks.

In [8]:
df_KD = df_KD[df_KD['label'] != 'teardrop.']

In [9]:
df_KD.shape

(493042, 42)

We can also see from the above that we have dropped all events attributed to the teardrop attack from the original data set.

In [10]:
df_KD.to_csv('KD99_noteardrop.csv', index = False, header = True)
df_teardrop.to_csv('KD99_teardrop.csv', index = False, header = True)