# Assessment 1 Data

The goal of this assessment is to create a model submission to predict normal vs non-normal traffic under a given performance metric. The analysis will take place on the [KD99 (small, 10%) data set](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html).

To begin the assessment, we first need to import the data into our notebook.

In [10]:
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy as sp
import requests
from io import StringIO

## Importing the Data Set

In [2]:
url_KD99 = 'http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz' # URL of the data
df_KD = pd.read_csv(url_KD99, low_memory=False) # Importing the data

In [3]:
df_KD.head()

Unnamed: 0,0,tcp,http,SF,181,5450,0.1,0.2,0.3,0.4,...,9.1,1.00.1,0.00.6,0.11.1,0.00.7,0.00.8,0.00.9,0.00.10,0.00.11,normal.
0,0,tcp,http,SF,239,486,0,0,0,0,...,19,1.0,0.0,0.05,0.0,0.0,0.0,0.0,0.0,normal.
1,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
2,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1.0,0.0,0.03,0.0,0.0,0.0,0.0,0.0,normal.
3,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.
4,0,tcp,http,SF,217,2032,0,0,0,0,...,59,1.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,normal.


Obviously, the initial dataset isn't what we want as we have no headers and the current headers should be part of the data set. Using reference [2], I've found the headers that should be in place for this data and will update the data frame with these.

In [4]:
df_KD.loc[-1] = ['0','tcp','http','SF','181','5450','0','0','0','0','0','1','0','0','0','0','0','0','0','0','0','0','8','8','0','0','0','0','1','0','0','9','9','1','0','0.11','0','0','0','0','0','normal.'] # grabbing the current header and setting it as a row
df_KD.index = df_KD.index + 2 # resetting the index
df_KD.sort_index(inplace=True) # ensuring index is sorted

# adding column names
df_KD.columns = ['duration','protocol_type','service','flag','src_bytes','dst_bytes','land','wrong_fragment','urgent','hot','num_failed_logins','logged_in','lnum_compromised','lroot_shell','lsu_attempted','lnum_root','lnum_file_creations','lnum_shells','lnum_access_files','lnum_outbound_cmds','is_host_login','is_guest_login','count','srv_count','serror_rate','srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate','diff_srv_rate','srv_diff_host_rate','dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate','dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','label']
df_KD.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
1,0,tcp,http,SF,181,5450,0,0,0,0,...,9,1,0,0.11,0,0,0,0,0,normal.
2,0,tcp,http,SF,239,486,0,0,0,0,...,19,1,0,0.05,0,0,0,0,0,normal.
3,0,tcp,http,SF,235,1337,0,0,0,0,...,29,1,0,0.03,0,0,0,0,0,normal.
4,0,tcp,http,SF,219,1337,0,0,0,0,...,39,1,0,0.03,0,0,0,0,0,normal.
5,0,tcp,http,SF,217,2032,0,0,0,0,...,49,1,0,0.02,0,0,0,0,0,normal.


## Exporting the data set

In the code below we export the data set so that it can be accessed by other members of the team.

In [16]:
df_KD.to_csv('KD99_corrected.csv', index = False, header = True)

## References

1. [Source of data](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)
2. [Source of headers for the data table](http://kdd.ics.uci.edu/databases/kddcup99/kddcup.names)