# CIDDS-01 dataset
The CIDDS-01 dataset was produced by simulating a network of computers (on virtual machines).\
The data is captured in Flow format and not from Pcap files. Each data row will contain an anonimized from/to IP.\
In a first phase attacks were simulated on the network by running scripts from a device within the network. This produced *'internal'* traffic which is saved seperately. In a second phase the researchers opened up the virtual network to the internet and manually labeled attacks in the wild.

More information can be found here: https://www.hs-coburg.de/forschung/forschungsprojekte-oeffentlich/informationstechnologie/cidds-coburg-intrusion-detection-data-sets.html#c6119

We will be working on the **first week of the internal traffic of dataset CIDDS-01**.

In [1]:
import pandas as pd
import numpy as np
from sklearn.utils import resample

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
import time

## Assignment
We will be performing **multi-class classification** on the CIDDS-001-DATASET.

An overview of the steps:
1. Exploring & resampling data
2. preprocessing dataset
3. Training & evaluating algorithms

### Preprocessing data

#### 1. First explore the dataset and read the technical report to identify column(s) that you don't need. Also identify columns that contain information you couldn't use when making predictions. One of the columns might contain data _written down in different ways_.\
**Remove some column(s). Correct format of column(s)**

In [2]:
df = pd.read_csv('CIDDS-001-internal-week1.csv')
df.head(5)

Unnamed: 0,Date first seen,Duration,Proto,Src IP Addr,Src Pt,Dst IP Addr,Dst Pt,Packets,Bytes,Flows,Flags,Tos,class,attackType,attackID,attackDescription
0,2017-03-15 00:01:16.632,0.0,TCP,192.168.100.5,445,192.168.220.16,58844.0,1,108.0,1,.AP...,0,normal,---,---,---
1,2017-03-15 00:01:16.552,0.0,TCP,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,1,.AP...,0,normal,---,---,---
2,2017-03-15 00:01:16.551,0.004,TCP,192.168.220.15,48888,192.168.100.5,445.0,2,174.0,1,.AP...,0,normal,---,---,---
3,2017-03-15 00:01:16.631,0.004,TCP,192.168.220.16,58844,192.168.100.5,445.0,2,174.0,1,.AP...,0,normal,---,---,---
4,2017-03-15 00:01:16.552,0.0,TCP,192.168.100.5,445,192.168.220.15,48888.0,1,108.0,1,.AP...,0,normal,---,---,---


In [3]:
print(df.shape)

(8451520, 16)


The dataset has 8451520 rows and 16 columns.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8451520 entries, 0 to 8451519
Data columns (total 16 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Date first seen    object 
 1   Duration           float64
 2   Proto              object 
 3   Src IP Addr        object 
 4   Src Pt             int64  
 5   Dst IP Addr        object 
 6   Dst Pt             float64
 7   Packets            int64  
 8   Bytes              float64
 9   Flows              int64  
 10  Flags              object 
 11  Tos                int64  
 12  class              object 
 13  attackType         object 
 14  attackID           object 
 15  attackDescription  object 
dtypes: float64(3), int64(4), object(9)
memory usage: 1.0+ GB


- The attributes of the dataset consists of 3 float, 4 integer and 9 object data type

In [5]:
df.describe()

Unnamed: 0,Duration,Src Pt,Dst Pt,Packets,Bytes,Flows,Tos
count,8451520.0,8451520.0,8451520.0,8451520.0,8451520.0,8451520.0,8451520.0
mean,0.1141597,24570.2,24248.92,15.03053,18965.62,1.0,9.091827
std,0.7683694,24897.69,24886.17,976.8317,2110158.0,0.0,14.90426
min,0.0,0.0,0.0,1.0,42.0,1.0,0.0
25%,0.0,80.0,80.0,1.0,66.0,1.0,0.0
50%,0.0,32775.0,8082.0,2.0,152.0,1.0,0.0
75%,0.025,49745.0,49612.0,4.0,479.0,1.0,32.0
max,224.412,65535.0,65535.0,208768.0,541274900.0,1.0,192.0


In [6]:
# check for missing values
df.isnull().sum()

Date first seen      0
Duration             0
Proto                0
Src IP Addr          0
Src Pt               0
Dst IP Addr          0
Dst Pt               0
Packets              0
Bytes                0
Flows                0
Flags                0
Tos                  0
class                0
attackType           0
attackID             0
attackDescription    0
dtype: int64

In [7]:
df.columns.tolist()

['Date first seen',
 'Duration',
 'Proto',
 'Src IP Addr',
 'Src Pt',
 'Dst IP Addr',
 'Dst Pt',
 'Packets',
 'Bytes',
 'Flows',
 'Flags',
 'Tos',
 'class',
 'attackType',
 'attackID',
 'attackDescription']

Overview columns:
1. Date first seen 
2. Duration
3. Proto 
4. Src IP Addr 
5. Src Pt
6. Dst IP Addr 
7. Dst Pt
8. Packets
9. Bytes
10. Flows 
11. Flags 
12. Tos
13. class
14. attackType
15. attackID 
16. attackDescription 

In [8]:
df.drop(['Flows','Date first seen', 'attackType','attackID', 'attackDescription','Src IP Addr','Dst IP Addr'], axis=1, inplace=True)

The following columns have been removed:
- Flows --> all values consists of 1s 
- Date first seen --> not relevant
- attackType --> additional information and is correlated with the class attribute
- attackID --> additional information and is correlated with the class attribute
- attackDescription --> additional information and is correlated with the class attribute
- Src IP Addr --> IPs were anonynimized, so they do not convey information 
- Dst IP Addr --> IPs were anonynimized, so they do not convey information 

In [9]:
df.head(5)

Unnamed: 0,Duration,Proto,Src Pt,Dst Pt,Packets,Bytes,Flags,Tos,class
0,0.0,TCP,445,58844.0,1,108.0,.AP...,0,normal
1,0.0,TCP,445,48888.0,1,108.0,.AP...,0,normal
2,0.004,TCP,48888,445.0,2,174.0,.AP...,0,normal
3,0.004,TCP,58844,445.0,2,174.0,.AP...,0,normal
4,0.0,TCP,445,48888.0,1,108.0,.AP...,0,normal


In [10]:
df.shape

(8451520, 9)

#### 2. The dataset is imbalanced in the amount of rows present for each label. Following techniques are usefull.
    * Removing duplicates
    * Downsampling
    * Stratified sampling\
    **Apply these techniques to get a balanced _training_ en _test_ dataset.**
> you can use the resample function for downsampling. See documentation:\
https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html \
> you can use the stratified option of train_test_split for stratified sampling. See documentation:\
> https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [11]:
X = df.drop('class', axis=1)
y = df['class']

# make a train and test set - 70/30 and apply stratified sampling
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [12]:
# combine them back for resampling
train_data = pd.concat([X_train, y_train], axis=1)

In [13]:
train_data['class'].value_counts()

normal      4907628
attacker     522361
victim       486075
Name: class, dtype: int64

In [14]:
# seperate majority and minority classes
df_majority = train_data[train_data['class'] == 'normal']
df_middle = train_data[train_data['class'] == 'attacker']
df_minority = train_data[train_data['class'] == 'victim']

In [15]:
# downsample majority
df_majority_downsampled = resample(df_majority, replace=False, n_samples=486075, random_state=42)
#df_majority_downsampled['class'].value_counts()

# downsample middle
df_middle_downsampled = resample(df_middle, replace=False, n_samples=486075, random_state=42)
#df_middle_downsampled['class'].value_counts()

In [16]:
# Combine minority class, middle class, with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled,df_middle_downsampled,df_minority])
df_downsampled['class'].value_counts()  

victim      486075
normal      486075
attacker    486075
Name: class, dtype: int64

In [17]:
X_train = df_downsampled.drop('class', axis=1)
y_train = df_downsampled['class']

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(1458225, 8)
(1458225,)
(2535456, 8)
(2535456,)


#### 3. Identify all the _categorical_ and _numerical_ columns and think about which encoding technique you can use for each column.\
**Now apply your encoding techniques**

#### Training set

In [18]:
# combine them back for encoding
train_df = pd.concat([X_train, y_train], axis=1)
train_df

Unnamed: 0,Duration,Proto,Src Pt,Dst Pt,Packets,Bytes,Flags,Tos,class
6822045,0.000,TCP,60921,80.0,1,55.0,.A....,0,normal
3863019,0.000,UDP,53,58597.0,2,252.0,......,0,normal
5257122,0.039,TCP,52269,443.0,2,290.0,.AP...,0,normal
7797663,0.000,TCP,80,54692.0,1,66.0,.A....,32,normal
1113067,0.000,TCP,58772,443.0,1,55.0,.A....,0,normal
...,...,...,...,...,...,...,...,...,...
1060607,0.004,TCP,80,56740.0,3,206.0,.A..SF,0,victim
6146534,0.004,TCP,80,60571.0,3,206.0,.A..SF,0,victim
37594,0.000,TCP,50300,51357.0,1,54.0,.A.R..,0,victim
890189,0.003,TCP,80,37989.0,3,206.0,.A..SF,0,victim


In [19]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1458225 entries, 6822045 to 548360
Data columns (total 9 columns):
 #   Column    Non-Null Count    Dtype  
---  ------    --------------    -----  
 0   Duration  1458225 non-null  float64
 1   Proto     1458225 non-null  object 
 2   Src Pt    1458225 non-null  int64  
 3   Dst Pt    1458225 non-null  float64
 4   Packets   1458225 non-null  int64  
 5   Bytes     1458225 non-null  float64
 6   Flags     1458225 non-null  object 
 7   Tos       1458225 non-null  int64  
 8   class     1458225 non-null  object 
dtypes: float64(3), int64(3), object(3)
memory usage: 111.3+ MB


After removing the columns the attributes consists of 3 float, 3 integer and 3 object data types.

In [20]:
train_df['Proto'].value_counts()

TCP      1378586
UDP        73013
ICMP        6605
IGMP          21
Name: Proto, dtype: int64

In [21]:
train_df['Flags'].value_counts()

.A..SF    439028
.AP.SF    422101
.A....    185035
......     79639
.AP...     72788
....S.     70227
.AP.S.     59475
.A.R..     44284
.A...F     42567
.AP..F     17978
.A..S.     15948
...R..      3199
.APR.F      2164
...RS.      2030
.A.R.F       660
.APR..       394
.APRSF       349
.APRS.       284
.A.RS.        53
.A.RSF        22
Name: Flags, dtype: int64

In [22]:
# onehotencoding proto and flags

# selecting categorical data attributes
cat_col = ['Proto','Flags']

In [23]:
# creating a dataframe with only categorical attributes
categorical = train_df[cat_col]
categorical.head()

Unnamed: 0,Proto,Flags
6822045,TCP,.A....
3863019,UDP,......
5257122,TCP,.AP...
7797663,TCP,.A....
1113067,TCP,.A....


In [24]:
# one-hot-encoding categorical attributes using pandas.get_dummies() function
categorical = pd.get_dummies(categorical,columns=cat_col)
categorical

Unnamed: 0,Proto_ICMP,Proto_IGMP,Proto_TCP,Proto_UDP,Flags_......,Flags_....S.,Flags_...R..,Flags_...RS.,Flags_.A....,Flags_.A...F,...,Flags_.A.RS.,Flags_.A.RSF,Flags_.AP...,Flags_.AP..F,Flags_.AP.S.,Flags_.AP.SF,Flags_.APR..,Flags_.APR.F,Flags_.APRS.,Flags_.APRSF
6822045,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3863019,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5257122,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
7797663,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1113067,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1060607,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6146534,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
37594,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
890189,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:
train_df = pd.concat([train_df,categorical], axis=1)
train_df = train_df.drop(['Proto', 'Flags'], axis=1)
train_df

Unnamed: 0,Duration,Src Pt,Dst Pt,Packets,Bytes,Tos,class,Proto_ICMP,Proto_IGMP,Proto_TCP,...,Flags_.A.RS.,Flags_.A.RSF,Flags_.AP...,Flags_.AP..F,Flags_.AP.S.,Flags_.AP.SF,Flags_.APR..,Flags_.APR.F,Flags_.APRS.,Flags_.APRSF
6822045,0.000,60921,80.0,1,55.0,0,normal,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3863019,0.000,53,58597.0,2,252.0,0,normal,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5257122,0.039,52269,443.0,2,290.0,0,normal,0,0,1,...,0,0,1,0,0,0,0,0,0,0
7797663,0.000,80,54692.0,1,66.0,32,normal,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1113067,0.000,58772,443.0,1,55.0,0,normal,0,0,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1060607,0.004,80,56740.0,3,206.0,0,victim,0,0,1,...,0,0,0,0,0,0,0,0,0,0
6146534,0.004,80,60571.0,3,206.0,0,victim,0,0,1,...,0,0,0,0,0,0,0,0,0,0
37594,0.000,50300,51357.0,1,54.0,0,victim,0,0,1,...,0,0,0,0,0,0,0,0,0,0
890189,0.003,80,37989.0,3,206.0,0,victim,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [26]:
train_df['class'].value_counts()

victim      486075
normal      486075
attacker    486075
Name: class, dtype: int64

In [27]:
# labelencoder class 
le = LabelEncoder()

train_df['class'] = le.fit_transform(train_df['class'])

#### Test set

In [28]:
# combine them back for encoding - test set
test_df = pd.concat([X_test, y_test], axis=1)
test_df

Unnamed: 0,Duration,Proto,Src Pt,Dst Pt,Packets,Bytes,Flags,Tos,class
6371433,0.000,TCP,57941,443.0,1,55.0,.A....,0,normal
82757,0.041,UDP,37410,53.0,2,156.0,......,0,normal
1510237,0.027,TCP,60803,8082.0,2,338.0,.AP...,0,normal
5175140,0.000,TCP,443,64601.0,1,66.0,.A....,32,normal
630887,0.000,UDP,49277,53.0,2,150.0,......,0,normal
...,...,...,...,...,...,...,...,...,...
973592,0.000,TCP,443,60914.0,1,66.0,.A....,32,normal
6542900,0.000,TCP,46206,80.0,1,66.0,.A....,0,normal
4909270,0.871,TCP,51739,445.0,3,240.0,.AP...,0,normal
2116069,0.000,TCP,80,33607.0,1,66.0,.A....,32,normal


In [29]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2535456 entries, 6371433 to 5154676
Data columns (total 9 columns):
 #   Column    Dtype  
---  ------    -----  
 0   Duration  float64
 1   Proto     object 
 2   Src Pt    int64  
 3   Dst Pt    float64
 4   Packets   int64  
 5   Bytes     float64
 6   Flags     object 
 7   Tos       int64  
 8   class     object 
dtypes: float64(3), int64(3), object(3)
memory usage: 193.4+ MB


After removing the columns the attributes consists of 3 float, 3 integer and 3 object data types.

In [30]:
# onehotencoding proto and flags

# selecting categorical data attributes
cat_col = ['Proto','Flags']

In [31]:
# creating a dataframe with only categorical attributes
categorical = test_df[cat_col]
categorical.head()

Unnamed: 0,Proto,Flags
6371433,TCP,.A....
82757,UDP,......
1510237,TCP,.AP...
5175140,TCP,.A....
630887,UDP,......


In [32]:
# one-hot-encoding categorical attributes using pandas.get_dummies() function
categorical = pd.get_dummies(categorical,columns=cat_col)
categorical

Unnamed: 0,Proto_ICMP,Proto_IGMP,Proto_TCP,Proto_UDP,Flags_......,Flags_....S.,Flags_...R..,Flags_...RS.,Flags_.A....,Flags_.A...F,...,Flags_.A.RS.,Flags_.A.RSF,Flags_.AP...,Flags_.AP..F,Flags_.AP.S.,Flags_.AP.SF,Flags_.APR..,Flags_.APR.F,Flags_.APRS.,Flags_.APRSF
6371433,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
82757,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1510237,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
5175140,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
630887,0,0,0,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
973592,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
6542900,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4909270,0,0,1,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2116069,0,0,1,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
test_df = pd.concat([test_df,categorical], axis=1)
test_df = test_df.drop(['Proto', 'Flags'], axis=1)
test_df

Unnamed: 0,Duration,Src Pt,Dst Pt,Packets,Bytes,Tos,class,Proto_ICMP,Proto_IGMP,Proto_TCP,...,Flags_.A.RS.,Flags_.A.RSF,Flags_.AP...,Flags_.AP..F,Flags_.AP.S.,Flags_.AP.SF,Flags_.APR..,Flags_.APR.F,Flags_.APRS.,Flags_.APRSF
6371433,0.000,57941,443.0,1,55.0,0,normal,0,0,1,...,0,0,0,0,0,0,0,0,0,0
82757,0.041,37410,53.0,2,156.0,0,normal,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1510237,0.027,60803,8082.0,2,338.0,0,normal,0,0,1,...,0,0,1,0,0,0,0,0,0,0
5175140,0.000,443,64601.0,1,66.0,32,normal,0,0,1,...,0,0,0,0,0,0,0,0,0,0
630887,0.000,49277,53.0,2,150.0,0,normal,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
973592,0.000,443,60914.0,1,66.0,32,normal,0,0,1,...,0,0,0,0,0,0,0,0,0,0
6542900,0.000,46206,80.0,1,66.0,0,normal,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4909270,0.871,51739,445.0,3,240.0,0,normal,0,0,1,...,0,0,1,0,0,0,0,0,0,0
2116069,0.000,80,33607.0,1,66.0,32,normal,0,0,1,...,0,0,0,0,0,0,0,0,0,0


In [34]:
# labelencoder test
le = LabelEncoder()

test_df['class'] = le.fit_transform(test_df['class'])

In [35]:
test_df['class'].value_counts()

1    2103269
0     223869
2     208318
Name: class, dtype: int64

In [36]:
test_df

Unnamed: 0,Duration,Src Pt,Dst Pt,Packets,Bytes,Tos,class,Proto_ICMP,Proto_IGMP,Proto_TCP,...,Flags_.A.RS.,Flags_.A.RSF,Flags_.AP...,Flags_.AP..F,Flags_.AP.S.,Flags_.AP.SF,Flags_.APR..,Flags_.APR.F,Flags_.APRS.,Flags_.APRSF
6371433,0.000,57941,443.0,1,55.0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
82757,0.041,37410,53.0,2,156.0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1510237,0.027,60803,8082.0,2,338.0,0,1,0,0,1,...,0,0,1,0,0,0,0,0,0,0
5175140,0.000,443,64601.0,1,66.0,32,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
630887,0.000,49277,53.0,2,150.0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
973592,0.000,443,60914.0,1,66.0,32,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
6542900,0.000,46206,80.0,1,66.0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4909270,0.871,51739,445.0,3,240.0,0,1,0,0,1,...,0,0,1,0,0,0,0,0,0,0
2116069,0.000,80,33607.0,1,66.0,32,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0


### train test split

In [37]:
# shuffle
from sklearn.utils import shuffle

train_df = shuffle(train_df)
test_df = shuffle(test_df)

In [38]:
X_train = train_df.drop('class', axis=1)
y_train = train_df['class']

X_test = test_df.drop('class', axis=1)
y_test = test_df['class']

In [39]:
num_cols = ['Duration','Src Pt','Dst Pt','Packets','Bytes','Tos']

for i in num_cols:
    
    # fit on training data column
    scale = StandardScaler().fit(X_train[[i]])
    
    # transform the training data column
    X_train[i] = scale.transform(X_train[[i]])
    
    # transform the testing data column
    X_test[i] = scale.transform(X_test[[i]])

### Train/evaluate algorithms
We want to compare at least 5 different algorithms with relevant scores for each class.
Do this in the following way:
* Show scores for each unique value of the label column.
* Compare scores on train and test data.

After doing this analysis answer the question below.

### 1. Logistic regression

In [40]:
from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression(solver='lbfgs', C=10)

start = time.time()
model1.fit(X_train, y_train)
end = time.time()

duration = end - start
    
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 47 seconds to train the model


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [41]:
y_pred = model1.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100) 

              precision    recall  f1-score   support

           0       0.91      0.99      0.95    223869
           1       1.00      0.99      0.99   2103269
           2       0.97      1.00      0.98    208318

    accuracy                           0.99   2535456
   macro avg       0.96      0.99      0.98   2535456
weighted avg       0.99      0.99      0.99   2535456

[[ 222674    1141      54]
 [  22391 2074639    6239]
 [     68     625  207625]]
98.79635063672964


In [42]:
print("score on train: "+ str(model1.score(X_train, y_train) * 100))
print("score on test: " + str(model1.score(X_test, y_test) * 100))

score on train: 99.25306451336385
score on test: 98.79635063672964


The logistic regression algorithm took 47 seconds to train a model and has a remarkable result of 98.79% accuracy. Moreover, there is no overfitting as the training and test score are almost similar.

### 2. Naive Bayes

In [43]:
from sklearn.naive_bayes import GaussianNB

model2 = GaussianNB()

start = time.time()
model2.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 1 seconds to train the model


In [44]:
y_pred = model2.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.89      0.99      0.94    223869
           1       1.00      0.97      0.98   2103269
           2       0.84      1.00      0.91    208318

    accuracy                           0.97   2535456
   macro avg       0.91      0.99      0.94   2535456
weighted avg       0.98      0.97      0.97   2535456

[[ 222595    1078     196]
 [  28430 2035882   38957]
 [     68     275  207975]]
97.27843827698055


In [45]:
print("score on train: "+ str(model2.score(X_train, y_train) * 100))
print("score on test: " + str(model2.score(X_test, y_test) * 100))

score on train: 98.6945087349346
score on test: 97.27843827698055


The Naive Bayes algorithm is very fast as it only took 1 second to train the model. Furthermore, it provides a very good result of 97.27% accuracy. The score on the training and test are almost similar, which indicates that there is no overfitting.

### 3. Random Forest

In [46]:
from sklearn.ensemble import RandomForestClassifier

model3 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')

start = time.time()
model3.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 17 seconds to train the model


In [47]:
y_pred = model3.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100) 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    223869
           1       1.00      1.00      1.00   2103269
           2       1.00      1.00      1.00    208318

    accuracy                           1.00   2535456
   macro avg       1.00      1.00      1.00   2535456
weighted avg       1.00      1.00      1.00   2535456

[[ 223498     371       0]
 [    467 2102325     477]
 [      0      62  208256]]
99.9456902427019


In [48]:
print("score on train: "+ str(model3.score(X_train, y_train) * 100))
print("score on test: " + str(model3.score(X_test, y_test) * 100))

score on train: 99.94877333744793
score on test: 99.9456902427019


The Random Forest algorithm took 17 seconds to train a model. Moreover, it has an outstanding result with an accuracy score of 99.94% and shows no overfitting issues.

### 4. Decision tree

In [49]:
from sklearn.tree import DecisionTreeClassifier

model4 = DecisionTreeClassifier(criterion='entropy')

start = time.time()
model4.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 6 seconds to train the model


In [50]:
y_pred = model4.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100) 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    223869
           1       1.00      1.00      1.00   2103269
           2       1.00      1.00      1.00    208318

    accuracy                           1.00   2535456
   macro avg       1.00      1.00      1.00   2535456
weighted avg       1.00      1.00      1.00   2535456

[[ 223507     362       0]
 [    676 2102058     535]
 [      0      50  208268]]
99.93598784597327


In [51]:
print("score on train: "+ str(model4.score(X_train, y_train) * 100))
print("score on test: " + str(model4.score(X_test, y_test) * 100))

score on train: 99.95185928097516
score on test: 99.93598784597327


The Decision Tree algorithm took 6 seconds to train a model. Furthermore, it has an outstanding result with an accuracy score of 99.93% and shows no overfitting issues.

### 5. XGBOOST

In [52]:
# xgboost
from xgboost import XGBClassifier

model5= XGBClassifier()
model5.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")



It took about 115 seconds to train the model


In [53]:
y_pred = model5.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    223869
           1       1.00      1.00      1.00   2103269
           2       1.00      1.00      1.00    208318

    accuracy                           1.00   2535456
   macro avg       1.00      1.00      1.00   2535456
weighted avg       1.00      1.00      1.00   2535456

[[ 223473     396       0]
 [    110 2103013     146]
 [      0      53  208265]]
99.97219435083866


In [54]:
print("score on train: "+ str(model5.score(X_train, y_train) * 100))
print("score on test: " + str(model5.score(X_test, y_test) * 100))

score on train: 99.9394469303434
score on test: 99.97219435083866


The XGBOOST algorithm took 115 seconds to train a model. Furthermore, it has an excellent result with an accuracy score of 99.97% and shows no overfitting issues.

#### Conclusion

In total we have trained 5 different algorithms which are Logistic Regression, Naive Bayes, Random Forest, Decision Tree and XGBOOST.

The results are remarkably good as all algorithms are showing results of above 97% accuracy score. On top we have the XGBOOST with 99.97% accuracy, followed by Random Forest with 99.94%, Decision Tree with 99.93%, Logistic Regression with 98.79% and at last Naive Bayes with 97.27%.

In terms of training time, we got on the first place naive bayes with 1 second, secondly decision tree with 6 seconds, followed by random forest with 17 seconds, logistic regression 47 seconds and at last XGBOOST with 115 seconds.

Therefore, I'll choose the random forest algorithm for training the model as it has a high accuracy rate of 99.94% and a reasonable training time with 17 seconds.

#### Question: What is your best performing algorithm? Explain.

#### Answer: The best performing algorithm is the Random Forest with 99.94% accuracy. The reason is because it has the power to handle a large data set with higher dimensionality. It selects features randomly during the training process. Therefore, it does not depend highly on any specific set of features and it can generalize over the data in a better way. On the other hand, the training time is reasonable and there is also no overfitting issues.