# Introduction
The official description of the KDD'99 dataset:
> The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs. The objective was to survey and evaluate research in intrusion detection.  A standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment, was provided.  The 1999 KDD intrusion detection contest uses a version of this dataset.

This dataset can be found on the UCI machine learning repository: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

The goal of the contest in 1999 was to make an intrusion detection system using machine learning which can differentiate between normal connections and attacks on the network. This is a simple binary classification task, there are only two classes we are trying to predict.  (binary means 0 or 1 this correspond with the two classes we try to predict:"normal" connections and "abnormal" connections or attacks.) 

There are different ways to derive features from tcp connections and audit logs from the original DARPA'98 dataset. In this lab you will only use the features the contain numerical values.

For this lab you should be familiar with the following python libraries:
- Pandas: a data manipulation library
- Scikit-learn: a machine learning library
- Seaborn or matplotlib: a plotting library

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score
import time
from tqdm.notebook import tqdm

#all columns of pandas dataframes will be printed out with the following option
pd.set_option('display.max_columns', None)

## 1. Data cleaning and exploration.

#### 1) Load in dataset  **KDD_data.txt** using pandas.

In [2]:
df = pd.read_csv('KDD_data.txt')
df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,tcp,http,SF,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.00,0,0,0.0,0.0,0.00,0.00,0.0,0.00,0.0,0.0,normal.
1,0,tcp,http,SF,162,4528,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.00,1,1,1.0,0.0,1.00,0.00,0.0,0.00,0.0,0.0,normal.
2,0,tcp,http,SF,236,1228,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.00,2,2,1.0,0.0,0.50,0.00,0.0,0.00,0.0,0.0,normal.
3,0,tcp,http,SF,233,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.00,3,3,1.0,0.0,0.33,0.00,0.0,0.00,0.0,0.0,normal.
4,0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.0,0.0,0.0,0.0,1.0,0.0,0.00,4,4,1.0,0.0,0.25,0.00,0.0,0.00,0.0,0.0,normal.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4898426,0,tcp,http,SF,212,2288,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,32,0.0,0.0,0.0,0.0,1.0,0.0,0.16,3,255,1.0,0.0,0.33,0.05,0.0,0.01,0.0,0.0,normal.
4898427,0,tcp,http,SF,219,236,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,33,0.0,0.0,0.0,0.0,1.0,0.0,0.15,4,255,1.0,0.0,0.25,0.05,0.0,0.01,0.0,0.0,normal.
4898428,0,tcp,http,SF,218,3610,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,5,34,0.0,0.0,0.0,0.0,1.0,0.0,0.15,5,255,1.0,0.0,0.20,0.05,0.0,0.01,0.0,0.0,normal.
4898429,0,tcp,http,SF,219,1234,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,6,35,0.0,0.0,0.0,0.0,1.0,0.0,0.14,6,255,1.0,0.0,0.17,0.05,0.0,0.01,0.0,0.0,normal.


#### 2) Look at the 'label' column. Make label column binary: **0 for normal connections, 1 for abnormal connections or attacks.**

In [3]:
df['label'].value_counts()

smurf.              2807886
neptune.            1072017
normal.              972781
satan.                15892
ipsweep.              12481
portsweep.            10413
nmap.                  2316
back.                  2203
warezclient.           1020
teardrop.               979
pod.                    264
guess_passwd.            53
buffer_overflow.         30
land.                    21
warezmaster.             20
imap.                    12
rootkit.                 10
loadmodule.               9
ftp_write.                8
multihop.                 7
phf.                      4
perl.                     3
spy.                      2
Name: label, dtype: int64

In [4]:
# convert label column to binary

label = {'normal.': 0, 'smurf.': 1, 'neptune.': 1, 'satan.': 1, 'ipsweep.': 1, 'portsweep.': 1,
         'nmap.': 1, 'back.': 1, 'warezclient.': 1, 'teardrop.': 1, 'pod.': 1, 'guess_passwd.': 1,
         'buffer_overflow.': 1, 'land.': 1, 'warezmaster.': 1, 'imap.': 1, 'rootkit.': 1,
         'loadmodule.': 1, 'ftp_write.': 1, 'multihop.': 1, 'phf.': 1, 'perl.': 1, 'spy.': 1}

df['label'] = [label[item] for item in df['label']]
df['label'].unique()

array([0, 1], dtype=int64)

In [5]:
# check value count of label
df['label'].value_counts()

1    3925650
0     972781
Name: label, dtype: int64

#### 3) Explore the content of the dataset to determine which columns contain numbers or text.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898431 entries, 0 to 4898430
Data columns (total 42 columns):
 #   Column                       Dtype  
---  ------                       -----  
 0   duration                     int64  
 1   protocol_type                object 
 2   service                      object 
 3   flag                         object 
 4   src_bytes                    int64  
 5   dst_bytes                    int64  
 6   land                         int64  
 7   wrong_fragment               int64  
 8   urgent                       int64  
 9   hot                          int64  
 10  num_failed_logins            int64  
 11  logged_in                    int64  
 12  num_compromised              int64  
 13  root_shell                   int64  
 14  su_attempted                 int64  
 15  num_root                     int64  
 16  num_file_creations           int64  
 17  num_shells                   int64  
 18  num_access_files             int64  
 19  

- The attributes of the dataset consists of 15 float, 24 integer and 3 object data type
- Numerics are float and integer
- Text data type is known as objects in pandas

#### 4) Remove _ALL COLUMNS CONTAINING TEXT_. 

In [7]:
df = df.drop(['protocol_type', 'service', 'flag'], axis=1)
df.head(5)

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
1,0,162,4528,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1,1,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0
2,0,236,1228,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.0,2,2,1.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0
3,0,233,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3,3,1.0,0.0,0.33,0.0,0.0,0.0,0.0,0.0,0
4,0,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3,3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,4,4,1.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0


#### 5) Some columns containing numbers do not help an ML algorithm in predicting the labels. Find out what these column(s) are and remove them from your dataset. Hint: use statistics of the columns to make your decision, no security knowledge is needed.

In [8]:
df.describe()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,num_outbound_cmds,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
count,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0,4898431.0
mean,48.34243,1834.621,1093.623,5.716116e-06,0.0006487792,7.961733e-06,0.01243766,3.205108e-05,0.143529,0.008088304,6.81851e-05,3.674646e-05,0.01293496,0.001188748,7.430951e-05,0.001021143,0.0,4.08294e-07,0.0008351654,334.9734,295.2671,0.1779703,0.178037,0.05766509,0.0577301,0.7898842,0.02117961,0.0282608,232.9811,189.2142,0.7537132,0.03071111,0.605052,0.006464107,0.1780911,0.1778859,0.0579278,0.05765941,0.8014097
std,723.3298,941431.1,645012.3,0.002390833,0.04285434,0.007215084,0.4689782,0.007299408,0.3506116,3.856481,0.008257146,0.008082432,3.938075,0.1241857,0.00873759,0.03551048,0.0,0.0006389788,0.02888716,211.9908,245.9927,0.3818756,0.3822541,0.2322529,0.2326604,0.3892958,0.08271458,0.1405596,64.02094,105.9128,0.411186,0.1085432,0.4809877,0.04125978,0.3818382,0.3821774,0.2309428,0.2309777,0.3989389
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,45.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,121.0,10.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,255.0,49.0,0.41,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,0.0,520.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,510.0,510.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,255.0,255.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.0,1032.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,511.0,511.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,255.0,255.0,1.0,0.04,1.0,0.0,0.0,0.0,0.0,0.0,1.0
max,58329.0,1379964000.0,1309937000.0,1.0,3.0,14.0,77.0,5.0,1.0,7479.0,1.0,2.0,7468.0,43.0,2.0,9.0,0.0,1.0,1.0,511.0,511.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,255.0,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


- Based on the statistics, num_outbound_cmds seems to be irrelevant as there are only zeros.

In [9]:
df = df.drop('num_outbound_cmds', axis=1)
df

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,root_shell,su_attempted,num_root,num_file_creations,num_shells,num_access_files,is_host_login,is_guest_login,count,srv_count,serror_rate,srv_serror_rate,rerror_rate,srv_rerror_rate,same_srv_rate,diff_srv_rate,srv_diff_host_rate,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,label
0,0,215,45076,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.00,0,0,0.0,0.0,0.00,0.00,0.0,0.00,0.0,0.0,0
1,0,162,4528,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.00,1,1,1.0,0.0,1.00,0.00,0.0,0.00,0.0,0.0,0
2,0,236,1228,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,1,0.0,0.0,0.0,0.0,1.0,0.0,0.00,2,2,1.0,0.0,0.50,0.00,0.0,0.00,0.0,0.0,0
3,0,233,2032,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,2,2,0.0,0.0,0.0,0.0,1.0,0.0,0.00,3,3,1.0,0.0,0.33,0.00,0.0,0.00,0.0,0.0,0
4,0,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3,3,0.0,0.0,0.0,0.0,1.0,0.0,0.00,4,4,1.0,0.0,0.25,0.00,0.0,0.00,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4898426,0,212,2288,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,3,32,0.0,0.0,0.0,0.0,1.0,0.0,0.16,3,255,1.0,0.0,0.33,0.05,0.0,0.01,0.0,0.0,0
4898427,0,219,236,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,4,33,0.0,0.0,0.0,0.0,1.0,0.0,0.15,4,255,1.0,0.0,0.25,0.05,0.0,0.01,0.0,0.0,0
4898428,0,218,3610,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,5,34,0.0,0.0,0.0,0.0,1.0,0.0,0.15,5,255,1.0,0.0,0.20,0.05,0.0,0.01,0.0,0.0,0
4898429,0,219,1234,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,6,35,0.0,0.0,0.0,0.0,1.0,0.0,0.14,6,255,1.0,0.0,0.17,0.05,0.0,0.01,0.0,0.0,0


In [10]:
# check for missing values
df.isnull().sum()

duration                       0
src_bytes                      0
dst_bytes                      0
land                           0
wrong_fragment                 0
urgent                         0
hot                            0
num_failed_logins              0
logged_in                      0
num_compromised                0
root_shell                     0
su_attempted                   0
num_root                       0
num_file_creations             0
num_shells                     0
num_access_files               0
is_host_login                  0
is_guest_login                 0
count                          0
srv_count                      0
serror_rate                    0
srv_serror_rate                0
rerror_rate                    0
srv_rerror_rate                0
same_srv_rate                  0
diff_srv_rate                  0
srv_diff_host_rate             0
dst_host_count                 0
dst_host_srv_count             0
dst_host_same_srv_rate         0
dst_host_d

In [11]:
df.shape

(4898431, 38)

In [12]:
# check for highly correlated variables and remove it

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
print(to_drop)

['num_root', 'srv_serror_rate', 'srv_rerror_rate', 'dst_host_same_srv_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']


In [13]:
# Drop features 
df.drop(to_drop, axis=1, inplace=True)

In [14]:
df.shape

(4898431, 30)

Removed columns are:
- num_outbound_cmds
- num_root
- srv_serror_rate
- srv_rerror_rate
- dst_host_same_srv_rate
- dst_host_serror_rate
- dst_host_srv_serror_rate
- dst_host_rerror_rate
- dst_host_srv_rerror_rate

## Preprocessing
Make a train and test set. The training set should contain 70% of the data and the test set 30%.\
Make use of the *shuffle function from sklearn library* as seen in the lesson.\
There are other functions to split up datasets but use the shuffle function for now.


In [15]:
from sklearn.utils import shuffle

X = df.drop(['label'], axis=1)
y = df['label']

X, y = shuffle(X, y)

X_train = X[:round(len(X)*0.7)]
X_test = X[round(len(X)*0.7):]
y_train = y[:round(len(X)*0.7)]
y_test = y[round(len(X)*0.7):]

In [16]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model training & evaluation
Train 4 different algorithms and evaluate their performance. \
Note that the dataset is quite large so it may take some time to train all your algorithms. Depending on your device some algorithms might take too much RAM memory or timeout. It's advised to use lightweight algorithms.


Your analysis should include:
* Accuracy scores
* time to train
* comparison between performance on train and test set (overfitting analysis).

### 1. Logistic regression

In [17]:
from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression(solver='lbfgs', C=10)

start = time.time()
model1.fit(X_train, y_train)
end = time.time()

duration = end - start
    
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 29 seconds to train the model


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [18]:
y_pred = model1.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100) 

              precision    recall  f1-score   support

           0       0.99      0.99      0.99    291892
           1       1.00      1.00      1.00   1177637

    accuracy                           1.00   1469529
   macro avg       1.00      1.00      1.00   1469529
weighted avg       1.00      1.00      1.00   1469529

[[ 290421    1471]
 [   2449 1175188]]
99.73324786377131


In [19]:
print("score on train: "+ str(model1.score(X_train, y_train) * 100))
print("score on test: " + str(model1.score(X_test, y_test) * 100))

score on train: 99.73892517196467
score on test: 99.73324786377131


The logistic regression algorithm took 29 seconds to train a model and has a remarkable result of 99.73% accuracy. Moreover, there is no overfitting as the training and test score are almost similar.

### 2. Naive Bayes

In [20]:
from sklearn.naive_bayes import GaussianNB

model2 = GaussianNB()

start = time.time()
model2.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 2 seconds to train the model


In [21]:
y_pred = model2.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.97      0.87      0.92    291892
           1       0.97      0.99      0.98   1177637

    accuracy                           0.97   1469529
   macro avg       0.97      0.93      0.95   1469529
weighted avg       0.97      0.97      0.97   1469529

[[ 253210   38682]
 [   7447 1170190]]
96.86096701732325


In [22]:
print("score on train: "+ str(model2.score(X_train, y_train) * 100))
print("score on test: " + str(model2.score(X_test, y_test) * 100))

score on train: 96.86608132865857
score on test: 96.86096701732325


The Naive Bayes algorithm is very fast as it only took 2 seconds to train the model. Furthermore, it provides a very good result of 96.86% accuracy. The score on the training and test are almost similar, which indicates that there is no overfitting.

### 3. Random Forest

In [23]:
from sklearn.ensemble import RandomForestClassifier

model3 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')

start = time.time()
model3.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 29 seconds to train the model


In [24]:
y_pred = model3.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100) 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    291892
           1       1.00      1.00      1.00   1177637

    accuracy                           1.00   1469529
   macro avg       1.00      1.00      1.00   1469529
weighted avg       1.00      1.00      1.00   1469529

[[ 291871      21]
 [     90 1177547]]
99.99244655940781


In [25]:
print("score on train: "+ str(model3.score(X_train, y_train) * 100))
print("score on test: " + str(model3.score(X_test, y_test) * 100))

score on train: 99.99932923133994
score on test: 99.99244655940781


The Random Forest algorithm took 29 seconds to train a model. Moreover, it has an outstanding result with an accuracy score of 99.99% and shows no overfitting issues.

### 4. Decision tree

In [26]:
from sklearn.tree import DecisionTreeClassifier

model4 = DecisionTreeClassifier(criterion='entropy')

start = time.time()
model4.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 15 seconds to train the model


In [27]:
y_pred = model4.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100) 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00    291892
           1       1.00      1.00      1.00   1177637

    accuracy                           1.00   1469529
   macro avg       1.00      1.00      1.00   1469529
weighted avg       1.00      1.00      1.00   1469529

[[ 291843      49]
 [     55 1177582]]
99.99292290250821


In [28]:
print("score on train: "+ str(model4.score(X_train, y_train) * 100))
print("score on test: " + str(model4.score(X_test, y_test) * 100))

score on train: 99.99962086988779
score on test: 99.99292290250821


The Decision Tree algorithm took 15 seconds to train a model. Furthermore, it has an outstanding result with an accuracy score of 99.99% and shows no overfitting issues.

#### Conclusion

In total we have trained 4 different algorithms which are Logistic Regression, Naive Bayes, Random Forest and Decision Tree.

The results are remarkably good as all algorithms are showing results of above 96% accuracy score. Logistic regression, decision tree and random forest are showing a result of 99% while naive bayes does a little worse with 96%.

In terms of training time, we got on the first place naive bayes with 2 seconds, secondly decision tree with 15 seconds, followed by logistic regression and random forest with 29 seconds.

Therefore, I'll choose the decision tree algorithm for training the model as it has a high accuracy rate of 99% and the training time is reasonable as well with 14 seconds.