# CIDDS-01 dataset
The CIDDS-01 dataset was produced by simulating a network of computers (on virtual machines).\
The data is captured in Flow format and not from Pcap files. Each data row will contain an anonimized from/to IP.\
In a first phase attacks were simulated on the network by running scripts from a device within the network. This produced *'internal'* traffic which is saved seperately. In a second phase the researchers opened up the virtual network to the internet and manually labeled attacks in the wild.

More information can be found here: https://www.hs-coburg.de/forschung/forschungsprojekte-oeffentlich/informationstechnologie/cidds-coburg-intrusion-detection-data-sets.html#c6119

We will be working on the **first week of the internal traffic of dataset CIDDS-01**.

In [1]:
import pandas as pd
from sklearn.utils import resample
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
import time

## Assignment
We will be performing **multi-class classification** on the CIDDS-001-DATASET.

An overview of the steps:
1. Exploring & resampling data
2. preprocessing dataset
3. Training & evaluating algorithms

### Preprocessing data

1. First explore the dataset and read the technical report to identify column(s) that you don't need. Also identify columns that contain information you couldn't use when making predictions. One of the columns might contain data _written down in different ways_.\
**Remove some column(s). Correct format of column(s)**


2. The dataset is imbalanced in the amount of rows present for each label. Following techniques are usefull.
    * Removing duplicates
    * Downsampling
    * Stratified sampling\
**Apply these techniques to get a balanced _training_ en _test_ dataset.**
> you can use the resample function for downsampling. See documentation:\
https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html \
> you can use the stratified option of train_test_split for stratified sampling. See documentation:\
> https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

3. Identify all the _categorical_ and _numerical_ columns and think about which encoding technique you can use for each column.\
**Now apply your encoding techniques**

#### 1. First explore the dataset and read the technical report to identify column(s) that you don't need. Also identify columns that contain information you couldn't use when making predictions. One of the columns might contain data _written down in different ways_.\
**Remove some column(s). Correct format of column(s)**

In [2]:
df = pd.read_csv('CIDDS-001-internal-week1.csv')
df.head(5)

Unnamed: 0,Date first seen,Duration,Proto,Src IP Addr,Src Pt,Dst IP Addr,Dst Pt,Packets,Bytes,Flows,Flags,Tos,class,attackType,attackID,attackDescription
0,2017-03-15 00:01:16.632,0.0,TCP,192.168.100.5,445,192.168.220.16,58844.0,1,108.0,1,.AP...,0,normal,---,---,---
1,2017-03-15 00:01:16.552,0.0,TCP,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,1,.AP...,0,normal,---,---,---
2,2017-03-15 00:01:16.551,0.004,TCP,192.168.220.15,48888,192.168.100.5,445.0,2,174.0,1,.AP...,0,normal,---,---,---
3,2017-03-15 00:01:16.631,0.004,TCP,192.168.220.16,58844,192.168.100.5,445.0,2,174.0,1,.AP...,0,normal,---,---,---
4,2017-03-15 00:01:16.552,0.0,TCP,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,1,.AP...,0,normal,---,---,---


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8451520 entries, 0 to 8451519
Data columns (total 16 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Date first seen    object 
 1   Duration           float64
 2   Proto              object 
 3   Src IP Addr        object 
 4   Src Pt             int64  
 5   Dst IP Addr        object 
 6   Dst Pt             float64
 7   Packets            int64  
 8   Bytes              float64
 9   Flows              int64  
 10  Flags              object 
 11  Tos                int64  
 12  class              object 
 13  attackType         object 
 14  attackID           object 
 15  attackDescription  object 
dtypes: float64(3), int64(4), object(9)
memory usage: 1.0+ GB


In [4]:
df.shape

(8451520, 16)

In [5]:
df.columns.tolist()

['Date first seen',
 'Duration',
 'Proto',
 'Src IP Addr',
 'Src Pt',
 'Dst IP Addr',
 'Dst Pt',
 'Packets',
 'Bytes',
 'Flows',
 'Flags',
 'Tos',
 'class',
 'attackType',
 'attackID',
 'attackDescription']

In [6]:
# remove columns

df.drop(['Flows', 'Tos', 'Date first seen', 'attackType', 'attackID', 'attackDescription'], axis=1, inplace=True)

In [7]:
df.head(5)

Unnamed: 0,Duration,Proto,Src IP Addr,Src Pt,Dst IP Addr,Dst Pt,Packets,Bytes,Flags,class
0,0.0,TCP,192.168.100.5,445,192.168.220.16,58844.0,1,108.0,.AP...,normal
1,0.0,TCP,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,.AP...,normal
2,0.004,TCP,192.168.220.15,48888,192.168.100.5,445.0,2,174.0,.AP...,normal
3,0.004,TCP,192.168.220.16,58844,192.168.100.5,445.0,2,174.0,.AP...,normal
4,0.0,TCP,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,.AP...,normal


#### 2.  Identify all the _categorical_ and _numerical_ columns and think about which encoding technique you can use for each column.\
**Now apply your encoding techniques**

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8451520 entries, 0 to 8451519
Data columns (total 10 columns):
 #   Column       Dtype  
---  ------       -----  
 0   Duration     float64
 1   Proto        object 
 2   Src IP Addr  object 
 3   Src Pt       int64  
 4   Dst IP Addr  object 
 5   Dst Pt       float64
 6   Packets      int64  
 7   Bytes        float64
 8   Flags        object 
 9   class        object 
dtypes: float64(3), int64(2), object(5)
memory usage: 644.8+ MB


- Duration     float64
- Proto        object - labelencoder --> onehotencoding
- Src IP Addr  object - ?
- Src Pt       int64  
- Dst IP Addr  object - ?
- Dst Pt       float64
- Packets      int64  
- Bytes        float64
- Flags        object  - ?
- class        object  - labelencoder 

In [9]:
df['Proto'].unique()


array(['TCP  ', 'UDP  ', 'IGMP ', 'ICMP '], dtype=object)

In [10]:
df['Src IP Addr'].unique()


array(['192.168.100.5', '192.168.220.15', '192.168.220.16', ...,
       '15019_18', '15020_29', '10037_218'], dtype=object)

In [11]:
df['Dst IP Addr'].unique()

array(['192.168.220.16', '192.168.220.15', '192.168.100.5', ...,
       '15019_18', '15020_29', '10037_218'], dtype=object)

In [12]:
df['Flags'].unique()


array(['.AP...', '.A....', '.AP.S.', '......', '.A.R..', '....S.',
       '.A..S.', '...RS.', '.AP.SF', '.AP..F', '.A...F', '...R..',
       '.APR.F', '.APRS.', '.APR..', '.A..SF', '.A.R.F', '.APRSF',
       '.A.RS.', '.A.RSF'], dtype=object)

In [13]:
df['class'].unique()


array(['normal', 'victim', 'attacker'], dtype=object)

In [14]:
#means = df.groupby('Src IP Addr)['class'].mean()
#display(means)


In [15]:
# one hot encoder

temp = df[['Proto']]
encoder = OneHotEncoder(sparse=False)

df_onehot_enc = pd.DataFrame(encoder.fit_transform(temp))
df_onehot_enc.columns = encoder.get_feature_names(['Proto'])

df = pd.concat([df, df_onehot_enc], axis=1)
df.drop(['Proto'], axis=1, inplace=True)
df


Unnamed: 0,Duration,Src IP Addr,Src Pt,Dst IP Addr,Dst Pt,Packets,Bytes,Flags,class,Proto_ICMP,Proto_IGMP,Proto_TCP,Proto_UDP
0,0.000,192.168.100.5,445,192.168.220.16,58844.0,1,108.0,.AP...,normal,0.0,0.0,1.0,0.0
1,0.000,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,.AP...,normal,0.0,0.0,1.0,0.0
2,0.004,192.168.220.15,48888,192.168.100.5,445.0,2,174.0,.AP...,normal,0.0,0.0,1.0,0.0
3,0.004,192.168.220.16,58844,192.168.100.5,445.0,2,174.0,.AP...,normal,0.0,0.0,1.0,0.0
4,0.000,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,.AP...,normal,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8451515,0.248,192.168.200.8,62605,EXT_SERVER,8082.0,2,319.0,.AP...,normal,0.0,0.0,1.0,0.0
8451516,0.000,10179_174,443,192.168.210.5,51433.0,1,54.0,.A....,normal,0.0,0.0,1.0,0.0
8451517,0.000,192.168.210.5,51433,10179_174,443.0,1,55.0,.A....,normal,0.0,0.0,1.0,0.0
8451518,0.000,192.168.100.5,445,192.168.220.6,56281.0,1,108.0,.AP...,normal,0.0,0.0,1.0,0.0


In [16]:
# labelencoder class

label_encoder = LabelEncoder()
df['class'] = label_encoder.fit_transform(df['class'])

In [17]:
df['class'].value_counts()

1    7010897
0     746230
2     694393
Name: class, dtype: int64

In [18]:
df

Unnamed: 0,Duration,Src IP Addr,Src Pt,Dst IP Addr,Dst Pt,Packets,Bytes,Flags,class,Proto_ICMP,Proto_IGMP,Proto_TCP,Proto_UDP
0,0.000,192.168.100.5,445,192.168.220.16,58844.0,1,108.0,.AP...,1,0.0,0.0,1.0,0.0
1,0.000,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,.AP...,1,0.0,0.0,1.0,0.0
2,0.004,192.168.220.15,48888,192.168.100.5,445.0,2,174.0,.AP...,1,0.0,0.0,1.0,0.0
3,0.004,192.168.220.16,58844,192.168.100.5,445.0,2,174.0,.AP...,1,0.0,0.0,1.0,0.0
4,0.000,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,.AP...,1,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8451515,0.248,192.168.200.8,62605,EXT_SERVER,8082.0,2,319.0,.AP...,1,0.0,0.0,1.0,0.0
8451516,0.000,10179_174,443,192.168.210.5,51433.0,1,54.0,.A....,1,0.0,0.0,1.0,0.0
8451517,0.000,192.168.210.5,51433,10179_174,443.0,1,55.0,.A....,1,0.0,0.0,1.0,0.0
8451518,0.000,192.168.100.5,445,192.168.220.6,56281.0,1,108.0,.AP...,1,0.0,0.0,1.0,0.0


#### 3.The dataset is imbalanced in the amount of rows present for each label. Following techniques are usefull.
    * Removing duplicates
    * Downsampling
    * Stratified sampling\
**Apply these techniques to get a balanced _training_ en _test_ dataset.**

### Train/evaluate algorithms
We want to compare at least 5 different algorithms with relevant scores for each class.
Do this in the following way:
* Show scores for each unique value of the label column.
* Compare scores on train and test data.

After doing this analysis answer the question below.

In [19]:
#code

#### Question: What is your best performing algorithm? Explain.

#### Answer: ...