# CIC'18

The CIC'18 is a fairly recent dataset made available by a canadian research institute. They wrote software called Cicflowmeter-v3 to extract features from pcap in Netflow format. Most of these features are statistics of a from and to IP during certain time windows. More information can be found here: https://www.unb.ca/cic/datasets/ids-2018.html

The research group created two profiles: 
* **B-profile** for **benign** connections on the network
* **M-profile** for **malicious** connections on the network

The profiles are a description of what kind of behaviour (Benign, (D)dos attacks, Web attack, Infilteration attack,...).\
There are multiple tools used within one M-profile. 

The full collection of Pcap files is very large ~50Gb.
The extracted features are about ~5Gb in size.

To be able to smoothly work with data a smaller sample of the original dataset was taken which is about ~0.5Gb in size.

# Assignment
_____________________________________________________
You will perform **binary classification** on this problem while holding some of the tools as zero-day attacks. 
As seen in the lesson of this week, malicious persons will use unknown flaws in the software to manipulate IT systems.\
We can simulate zero-day attacks by leaving certain exploit tools out of the train set.
___________________________________________________________________________

In [65]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split
import time

#### * **Exploring & cleaning dataset**: Using your knowledge of previous labs explore the dataset, remove rows and columns that are not needed and come up with a strategy for preprocessing.\
**NOT** needed to upsample or downsample data.


In [66]:
df = pd.read_csv('CIC18.txt')
df

Unnamed: 0,Dst Port,Protocol,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,80,6,14766,3,4,315,935,315,0,105,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
1,80,6,30518,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
2,80,6,478,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
3,80,6,15018,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
4,80,6,17512,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1603862,53,17,2071,1,1,43,99,43,43,43,...,8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
1603863,3389,6,2833891,9,7,1128,1581,661,0,125,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
1603864,3389,6,3678344,10,7,1148,1581,677,0,114,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
1603865,20231,6,86190790,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,86190790.0,0.0,86190790.0,86190790.0,Benign


In [67]:
print(df.shape)

(1603867, 79)


The dataset has 1603867 rows and 79 columns.

In [68]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1603867 entries, 0 to 1603866
Data columns (total 79 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Dst Port           1603867 non-null  int64  
 1   Protocol           1603867 non-null  int64  
 2   Flow Duration      1603867 non-null  int64  
 3   Tot Fwd Pkts       1603867 non-null  int64  
 4   Tot Bwd Pkts       1603867 non-null  int64  
 5   TotLen Fwd Pkts    1603867 non-null  int64  
 6   TotLen Bwd Pkts    1603867 non-null  int64  
 7   Fwd Pkt Len Max    1603867 non-null  int64  
 8   Fwd Pkt Len Min    1603867 non-null  int64  
 9   Fwd Pkt Len Mean   1603867 non-null  int64  
 10  Fwd Pkt Len Std    1603867 non-null  int64  
 11  Bwd Pkt Len Max    1603867 non-null  int64  
 12  Bwd Pkt Len Min    1603867 non-null  int64  
 13  Bwd Pkt Len Mean   1603867 non-null  int64  
 14  Bwd Pkt Len Std    1603867 non-null  int64  
 15  Flow Byts/s        1603867 non-n

- The attributes of the dataset consists of 36 float, 42 integer and 1 object data type.

In [69]:
df['Label'].value_counts()

Benign                      733458
DDOS attack-HOIC            274404
DoS attacks-Hulk            184764
Bot                         114476
FTP-BruteForce               77344
SSH-Bruteforce               75035
Infilteration                64773
DoS attacks-SlowHTTPTest     55956
DoS attacks-GoldenEye        16603
DoS attacks-Slowloris         4396
DDOS attack-LOIC-UDP          1730
Brute Force -Web               611
Brute Force -XSS               230
SQL Injection                   87
Name: Label, dtype: int64

In [70]:
df.describe()

Unnamed: 0,Dst Port,Protocol,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,...,Fwd Act Data Pkts,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min
count,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,...,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0,1603867.0
mean,7481.739,7.445545,7492208.0,131.0898,4.439233,4378.411,2834.549,177.5638,5.99667,43.15188,...,128.081,22.50744,72181.24,32434.72,110536.9,51245.91,3021648.0,104020.7,3124266.0,2925615.0
std,17005.26,3.819363,24825380.0,3962.794,154.1345,127168.6,214261.3,292.7001,20.75356,61.60575,...,3962.309,9.00213,1320985.0,768087.5,1747283.0,1124720.0,13612810.0,1411382.0,13864860.0,13509580.0
min,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,53.0,6.0,486.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,80.0,6.0,8039.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,3389.0,6.0,285954.0,3.0,4.0,326.0,314.0,299.0,0.0,77.0,...,1.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,65534.0,17.0,120000000.0,309629.0,121091.0,9908128.0,156199200.0,1968.0,1460.0,1460.0,...,309628.0,56.0,114000000.0,67800000.0,114000000.0,114000000.0,120000000.0,75900000.0,120000000.0,120000000.0


In [71]:
one_value_features = [f for f in df.columns if df[f].nunique()<=1]
print(one_value_features)

['Bwd PSH Flags', 'Bwd URG Flags', 'Fwd Byts/b Avg', 'Fwd Pkts/b Avg', 'Fwd Blk Rate Avg', 'Bwd Byts/b Avg', 'Bwd Pkts/b Avg', 'Bwd Blk Rate Avg']


In [72]:
# remove columns that contains only 1 value
df.drop(['Bwd PSH Flags', 'Bwd URG Flags', 'Fwd Byts/b Avg', 'Fwd Pkts/b Avg', 'Fwd Blk Rate Avg', 'Bwd Byts/b Avg', 'Bwd Pkts/b Avg', 'Bwd Blk Rate Avg'], axis=1)

Unnamed: 0,Dst Port,Protocol,Flow Duration,Tot Fwd Pkts,Tot Bwd Pkts,TotLen Fwd Pkts,TotLen Bwd Pkts,Fwd Pkt Len Max,Fwd Pkt Len Min,Fwd Pkt Len Mean,...,Fwd Seg Size Min,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,80,6,14766,3,4,315,935,315,0,105,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
1,80,6,30518,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
2,80,6,478,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
3,80,6,15018,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
4,80,6,17512,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,DDOS attack-HOIC
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1603862,53,17,2071,1,1,43,99,43,43,43,...,8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
1603863,3389,6,2833891,9,7,1128,1581,661,0,125,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
1603864,3389,6,3678344,10,7,1148,1581,677,0,114,...,20,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Benign
1603865,20231,6,86190790,2,0,0,0,0,0,0,...,20,0.0,0.0,0.0,0.0,86190790.0,0.0,86190790.0,86190790.0,Benign


In [73]:
# check for missing values
df.isnull().sum()

Dst Port         0
Protocol         0
Flow Duration    0
Tot Fwd Pkts     0
Tot Bwd Pkts     0
                ..
Idle Mean        0
Idle Std         0
Idle Max         0
Idle Min         0
Label            0
Length: 79, dtype: int64

In [74]:
# check for highly correlated variables and remove it 

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
print(to_drop)

# Drop features 
df.drop(to_drop, axis=1, inplace=True)

['TotLen Fwd Pkts', 'TotLen Bwd Pkts', 'Fwd Pkt Len Std', 'Bwd Pkt Len Std', 'Flow IAT Min', 'Fwd IAT Tot', 'Fwd IAT Mean', 'Fwd IAT Max', 'Fwd IAT Min', 'Fwd Header Len', 'Bwd Header Len', 'Fwd Pkts/s', 'Pkt Len Std', 'SYN Flag Cnt', 'CWE Flag Count', 'ECE Flag Cnt', 'Pkt Size Avg', 'Fwd Seg Size Avg', 'Bwd Seg Size Avg', 'Subflow Fwd Pkts', 'Subflow Fwd Byts', 'Subflow Bwd Pkts', 'Subflow Bwd Byts', 'Fwd Act Data Pkts', 'Idle Max', 'Idle Min']


In [75]:
df.shape

(1603867, 53)

## Preprocessing

#### * **Preprocessing**: Execute your strategy for preprocessing.\ There are two ways you have to do this: 
1. All attacks are present in both train and test set. 
2. Some attacks are 'zero-day', so only present in the test set. For the same **M-profile** choose some exploits as zero-days.

In [76]:
# copy dataset for the 2 methods
df_1 = df.copy()
df_2 = df.copy()

#### 1. All attacks are present in both train and test set

In [77]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1603867 entries, 0 to 1603866
Data columns (total 53 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   Dst Port           1603867 non-null  int64  
 1   Protocol           1603867 non-null  int64  
 2   Flow Duration      1603867 non-null  int64  
 3   Tot Fwd Pkts       1603867 non-null  int64  
 4   Tot Bwd Pkts       1603867 non-null  int64  
 5   Fwd Pkt Len Max    1603867 non-null  int64  
 6   Fwd Pkt Len Min    1603867 non-null  int64  
 7   Fwd Pkt Len Mean   1603867 non-null  int64  
 8   Bwd Pkt Len Max    1603867 non-null  int64  
 9   Bwd Pkt Len Min    1603867 non-null  int64  
 10  Bwd Pkt Len Mean   1603867 non-null  int64  
 11  Flow Byts/s        1603867 non-null  float64
 12  Flow Pkts/s        1603867 non-null  float64
 13  Flow IAT Mean      1603867 non-null  float64
 14  Flow IAT Std       1603867 non-null  float64
 15  Flow IAT Max       1603867 non-n

In [78]:
df_1['Label'].value_counts()

Benign                      733458
DDOS attack-HOIC            274404
DoS attacks-Hulk            184764
Bot                         114476
FTP-BruteForce               77344
SSH-Bruteforce               75035
Infilteration                64773
DoS attacks-SlowHTTPTest     55956
DoS attacks-GoldenEye        16603
DoS attacks-Slowloris         4396
DDOS attack-LOIC-UDP          1730
Brute Force -Web               611
Brute Force -XSS               230
SQL Injection                   87
Name: Label, dtype: int64

In [79]:
train, test = train_test_split(df_1, test_size=0.3, random_state=42)

In [80]:
clean_train = train.copy()
clean_test = test.copy()

In [81]:
# convert label column to binary - train set
label = {'Benign': 0, 'DDOS attack-HOIC': 1, 'DoS attacks-Hulk': 1, 'Bot': 1, 'FTP-BruteForce': 1,
         'SSH-Bruteforce': 1, 'Infilteration': 1, 'DoS attacks-SlowHTTPTest': 1, 'DoS attacks-GoldenEye': 1,
         'DoS attacks-Slowloris': 1, 'SQL Injection': 1, 'Brute Force -XSS': 1, 'Brute Force -Web': 1,
         'DDOS attack-LOIC-UDP': 1}

train['Label'] = [label[item] for item in train['Label']]
train['Label'].unique()
# check value count of label
train['Label'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train['Label'] = [label[item] for item in train['Label']]


1    609879
0    512827
Name: Label, dtype: int64

In [82]:
# convert label column to binary - test set
label = {'Benign': 0, 'DDOS attack-HOIC': 1, 'DoS attacks-Hulk': 1, 'Bot': 1, 'FTP-BruteForce': 1,
         'SSH-Bruteforce': 1, 'Infilteration': 1, 'DoS attacks-SlowHTTPTest': 1, 'DoS attacks-GoldenEye': 1,
         'DoS attacks-Slowloris': 1, 'SQL Injection': 1, 'Brute Force -XSS': 1, 'Brute Force -Web': 1,
         'DDOS attack-LOIC-UDP': 1}

test['Label'] = [label[item] for item in test['Label']]
test['Label'].unique()
# check value count of label
test['Label'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Label'] = [label[item] for item in test['Label']]


1    260530
0    220631
Name: Label, dtype: int64

#### train test split

In [83]:
X_train = train.drop('Label', axis=1)
y_train = train['Label']

X_test = test.drop('Label', axis=1)
y_test = test['Label']

In [84]:
# original label
y_train_og = clean_train['Label']
y_test_og = clean_test['Label']

In [85]:
# Find all numeric columns
num_cols = X_train.columns[X_train.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
# apply scaler only to those numeric columns

# fit on training data column
scale = StandardScaler().fit(X_train)

# transform the training data column
X_train = scale.transform(X_train)

# transform the testing data column
X_test = scale.transform(X_test)


#### 2. Some attacks are 'zero-day', so only present in the test set. For the same **M-profile** choose some exploits as zero-days.

For this method, I'll remove the DoS attacks-Slowloris, DoS attacks-SlowHTTPTest from the same M-profile and FTP-BruteForce.

In [86]:
df_2['Label'].value_counts()

Benign                      733458
DDOS attack-HOIC            274404
DoS attacks-Hulk            184764
Bot                         114476
FTP-BruteForce               77344
SSH-Bruteforce               75035
Infilteration                64773
DoS attacks-SlowHTTPTest     55956
DoS attacks-GoldenEye        16603
DoS attacks-Slowloris         4396
DDOS attack-LOIC-UDP          1730
Brute Force -Web               611
Brute Force -XSS               230
SQL Injection                   87
Name: Label, dtype: int64

Attacks:
- Bruteforce attack:
    - FTP-BruteForce
    - SSH-BruteForce
- DoS attack:
    - DoS attacks-Hulk
    - DoS attacks-GoldenEye
    - DoS attacks-SLowloris
    - Dos attacks-SlowHTTPTest
- DDOS attacks:
    - DDOS attack-HOIC
    - DDOS attack-LOIC-UDP
- Web attack:
    - Brute Force-Web
    - Brute Force-XSS
- Infiltration attack:
    - Infilteration
- Botnet attack:
    - Bot
- SQL Injection


In [87]:
train, test = train_test_split(df_2, test_size=0.3, random_state=42)


In [88]:
# DoS m-profile
# remove 'DoS attacks-Slowloris' from the the train set
train = train[train['Label'] != 'DoS attacks-Slowloris']
# remove 'DoS attacks-SlowHTTPTest' from the the train set
train = train[train['Label'] != 'DoS attacks-SlowHTTPTest']

# Bruteforce m-profile
# remove 'FTP-BruteForce' from the the train set
train = train[train['Label'] != 'FTP-BruteForce']

In [89]:
train['Label'].value_counts()

Benign                   512827
DDOS attack-HOIC         192285
DoS attacks-Hulk         129670
Bot                       80157
SSH-Bruteforce            52539
Infilteration             45159
DoS attacks-GoldenEye     11648
DDOS attack-LOIC-UDP       1235
Brute Force -Web            431
Brute Force -XSS            154
SQL Injection                69
Name: Label, dtype: int64

In [90]:
test['Label'].value_counts()

Benign                      220631
DDOS attack-HOIC             82119
DoS attacks-Hulk             55094
Bot                          34319
FTP-BruteForce               23168
SSH-Bruteforce               22496
Infilteration                19614
DoS attacks-SlowHTTPTest     16734
DoS attacks-GoldenEye         4955
DoS attacks-Slowloris         1262
DDOS attack-LOIC-UDP           495
Brute Force -Web               180
Brute Force -XSS                76
SQL Injection                   18
Name: Label, dtype: int64

In [91]:
clean_train = train.copy()
clean_test = test.copy()

In [92]:
# train set
# convert label column to binary
label = {'Benign': 0, 'DDOS attack-HOIC': 1, 'DoS attacks-Hulk': 1, 'Bot': 1,
         'SSH-Bruteforce': 1, 'Infilteration': 1, 'DoS attacks-GoldenEye': 1,
         'SQL Injection': 1, 'Brute Force -XSS': 1, 'Brute Force -Web': 1,
         'DDOS attack-LOIC-UDP': 1}

train['Label'] = [label[item] for item in train['Label']]
train['Label'].unique()
# check value count of label
train['Label'].value_counts()

1    513347
0    512827
Name: Label, dtype: int64

In [93]:
# test set
# convert label column to binary
label = {'Benign': 0, 'DDOS attack-HOIC': 1, 'DoS attacks-Hulk': 1, 'Bot': 1, 'FTP-BruteForce': 1,
         'SSH-Bruteforce': 1, 'Infilteration': 1, 'DoS attacks-SlowHTTPTest': 1, 'DoS attacks-GoldenEye': 1,
         'DoS attacks-Slowloris': 1, 'SQL Injection': 1, 'Brute Force -XSS': 1, 'Brute Force -Web': 1,
         'DDOS attack-LOIC-UDP': 1}

test['Label'] = [label[item] for item in test['Label']]
test['Label'].unique()
# check value count of label
test['Label'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['Label'] = [label[item] for item in test['Label']]


1    260530
0    220631
Name: Label, dtype: int64

In [94]:
X_train2 = train.drop('Label', axis=1)
y_train2 = train['Label']

X_test2 = test.drop('Label', axis=1)
y_test2 = test['Label']

In [95]:
# original label
y_train2_og = clean_train['Label']
y_test2_og = clean_test['Label']

In [96]:
# Find all numeric columns
num_cols = X_train2.columns[X_train2.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
# apply scaler only to those numeric columns

# fit on training data column
scale = StandardScaler().fit(X_train2)

# transform the training data column
X_train2 = scale.transform(X_train2)

# transform the testing data column
X_test2 = scale.transform(X_test2)

## Training and evaluation

* **Training & evaluating algorithms**: Train and evaluate your algorithms for both situations (with and without zero-days).
    In this evaluation include:
    1. Comparison between test set with/without zero-days.
    2. Scores for the binary labels and on the original labels.
    3. How much overfitting there is.

## With zero_days

### 1. Logistic regression

#### binary labels

In [97]:
from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression(solver='lbfgs', C=10)

start = time.time()
model1.fit(X_train, y_train)
end = time.time()

duration = end - start
    
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 11 seconds to train the model


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [98]:
y_pred = model1.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100) 

              precision    recall  f1-score   support

           0       0.93      0.91      0.92    220631
           1       0.93      0.94      0.93    260530

    accuracy                           0.93    481161
   macro avg       0.93      0.92      0.93    481161
weighted avg       0.93      0.93      0.93    481161

[[201091  19540]
 [ 16115 244415]]
92.58979842505939


#### original labels

In [99]:
from sklearn.linear_model import LogisticRegression

model1_og = LogisticRegression(solver='lbfgs', C=10)

start = time.time()
model1_og.fit(X_train, y_train_og)
end = time.time()

duration = end - start
    
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 106 seconds to train the model


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [100]:
y_pred = model1_og.predict(X_test)

print(classification_report(y_test_og, y_pred))

cf = confusion_matrix(y_test_og, y_pred)
print(cf)
print(accuracy_score(y_test_og, y_pred) * 100) 

                          precision    recall  f1-score   support

                  Benign       0.92      0.96      0.94    220631
                     Bot       0.80      1.00      0.89     34319
        Brute Force -Web       0.95      0.21      0.34       180
        Brute Force -XSS       1.00      0.46      0.63        76
        DDOS attack-HOIC       0.99      1.00      0.99     82119
    DDOS attack-LOIC-UDP       1.00      0.99      1.00       495
   DoS attacks-GoldenEye       0.97      0.83      0.89      4955
        DoS attacks-Hulk       0.98      1.00      0.99     55094
DoS attacks-SlowHTTPTest       0.64      0.55      0.59     16734
   DoS attacks-Slowloris       0.92      0.75      0.83      1262
          FTP-BruteForce       0.71      0.79      0.74     23168
           Infilteration       0.54      0.05      0.08     19614
           SQL Injection       0.00      0.00      0.00        18
          SSH-Bruteforce       1.00      1.00      1.00     22496

        

In [101]:
print("score on train: "+ str(model1.score(X_train, y_train) * 100))
print("score on test: " + str(model1.score(X_test, y_test) * 100))

score on train: 92.70975660591463
score on test: 92.58979842505939


In [102]:
print("score on train: "+ str(model1_og.score(X_train, y_train_og) * 100))
print("score on test: " + str(model1_og.score(X_test, y_test_og) * 100))

score on train: 91.3285401520968
score on test: 91.24056189092632


The logistic regression algorithm has a remarkable result of 92.58% and 91.24% accuracy. Moreover, there is no overfitting as the training and test score are almost similar.

### 2. Naive Bayes

#### binary labels

In [103]:
from sklearn.naive_bayes import GaussianNB

model2 = GaussianNB()

start = time.time()
model2.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 1 seconds to train the model


In [104]:
y_pred = model2.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.92      0.82      0.86    220631
           1       0.86      0.94      0.90    260530

    accuracy                           0.88    481161
   macro avg       0.89      0.88      0.88    481161
weighted avg       0.89      0.88      0.88    481161

[[180171  40460]
 [ 16127 244403]]
88.23948740650219


#### original labels

In [105]:
from sklearn.naive_bayes import GaussianNB

model2_og = GaussianNB()

start = time.time()
model2_og.fit(X_train, y_train_og)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 2 seconds to train the model


In [106]:
y_pred = model2_og.predict(X_test)

print(classification_report(y_test_og, y_pred))

cf = confusion_matrix(y_test_og, y_pred)
print(cf)
print(accuracy_score(y_test_og, y_pred) * 100) 

                          precision    recall  f1-score   support

                  Benign       0.95      0.43      0.59    220631
                     Bot       0.73      1.00      0.84     34319
        Brute Force -Web       0.00      0.24      0.00       180
        Brute Force -XSS       0.00      0.46      0.01        76
        DDOS attack-HOIC       0.85      1.00      0.92     82119
    DDOS attack-LOIC-UDP       1.00      1.00      1.00       495
   DoS attacks-GoldenEye       0.71      0.82      0.76      4955
        DoS attacks-Hulk       0.98      0.96      0.97     55094
DoS attacks-SlowHTTPTest       0.52      0.97      0.68     16734
   DoS attacks-Slowloris       0.67      0.78      0.72      1262
          FTP-BruteForce       0.95      0.35      0.52     23168
           Infilteration       0.15      0.42      0.22     19614
           SQL Injection       0.00      0.83      0.01        18
          SSH-Bruteforce       1.00      1.00      1.00     22496

        

In [107]:
print("score on train: "+ str(model2.score(X_train, y_train) * 100))
print("score on test: " + str(model2.score(X_test, y_test) * 100))

score on train: 88.2674538124852
score on test: 88.23948740650219


In [108]:
print("score on train: "+ str(model2_og.score(X_train, y_train_og) * 100))
print("score on test: " + str(model2_og.score(X_test, y_test_og) * 100))

score on train: 67.66134678179327
score on test: 67.57966668121482


The Naive Bayes algorithm has a good result of 88.23% and 67.57% accuracy. Moreover, there is no overfitting as the training and test score are almost similar.

### 3. Random Forest

#### binary labels

In [109]:
from sklearn.ensemble import RandomForestClassifier

model3 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')

start = time.time()
model3.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 31 seconds to train the model


In [110]:
y_pred = model3.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100) 

              precision    recall  f1-score   support

           0       0.95      0.98      0.96    220631
           1       0.98      0.95      0.97    260530

    accuracy                           0.96    481161
   macro avg       0.96      0.97      0.96    481161
weighted avg       0.97      0.96      0.96    481161

[[215556   5075]
 [ 11864 248666]]
96.47955673880469


#### original labels

In [111]:
from sklearn.ensemble import RandomForestClassifier

model3_og = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')

start = time.time()
model3_og.fit(X_train, y_train_og)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 36 seconds to train the model


In [112]:
y_pred = model3_og.predict(X_test)

print(classification_report(y_test_og, y_pred))

cf = confusion_matrix(y_test_og, y_pred)
print(cf)
print(accuracy_score(y_test_og, y_pred) * 100) 

                          precision    recall  f1-score   support

                  Benign       0.95      0.98      0.96    220631
                     Bot       1.00      1.00      1.00     34319
        Brute Force -Web       0.92      0.74      0.82       180
        Brute Force -XSS       0.97      0.91      0.94        76
        DDOS attack-HOIC       1.00      1.00      1.00     82119
    DDOS attack-LOIC-UDP       1.00      1.00      1.00       495
   DoS attacks-GoldenEye       1.00      1.00      1.00      4955
        DoS attacks-Hulk       1.00      1.00      1.00     55094
DoS attacks-SlowHTTPTest       0.76      0.51      0.61     16734
   DoS attacks-Slowloris       1.00      1.00      1.00      1262
          FTP-BruteForce       0.71      0.88      0.79     23168
           Infilteration       0.61      0.40      0.48     19614
           SQL Injection       0.91      0.56      0.69        18
          SSH-Bruteforce       1.00      1.00      1.00     22496

        

In [113]:
print("score on train: "+ str(model3.score(X_train, y_train) * 100))
print("score on test: " + str(model3.score(X_test, y_test) * 100))

score on train: 98.53372120572973
score on test: 96.47955673880469


In [114]:
print("score on train: "+ str(model3_og.score(X_train, y_train_og) * 100))
print("score on test: " + str(model3_og.score(X_test, y_test_og) * 100))

score on train: 96.21245455177045
score on test: 94.20817564183298


The Random Forest algorithm has an outstanding result of 96.47% and 94.20% accuracy. Moreover, there is no overfitting as the training and test score are almost similar.

### 4. Decision tree

#### binary

In [115]:
from sklearn.tree import DecisionTreeClassifier

model4 = DecisionTreeClassifier(criterion='entropy')

start = time.time()
model4.fit(X_train, y_train)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 25 seconds to train the model


In [116]:
y_pred = model4.predict(X_test)

print(classification_report(y_test, y_pred))

cf = confusion_matrix(y_test, y_pred)
print(cf)
print(accuracy_score(y_test, y_pred) * 100) 

              precision    recall  f1-score   support

           0       0.95      0.96      0.96    220631
           1       0.97      0.96      0.96    260530

    accuracy                           0.96    481161
   macro avg       0.96      0.96      0.96    481161
weighted avg       0.96      0.96      0.96    481161

[[211740   8891]
 [ 10699 249831]]
95.92859770430272


#### orginal labels

In [117]:
from sklearn.tree import DecisionTreeClassifier

model4_og = DecisionTreeClassifier(criterion='entropy')

start = time.time()
model4_og.fit(X_train, y_train_og)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 30 seconds to train the model


In [118]:
y_pred = model4_og.predict(X_test)

print(classification_report(y_test_og, y_pred))

cf = confusion_matrix(y_test_og, y_pred)
print(cf)
print(accuracy_score(y_test_og, y_pred) * 100) 

                          precision    recall  f1-score   support

                  Benign       0.95      0.96      0.96    220631
                     Bot       1.00      1.00      1.00     34319
        Brute Force -Web       0.81      0.83      0.82       180
        Brute Force -XSS       0.97      0.97      0.97        76
        DDOS attack-HOIC       1.00      1.00      1.00     82119
    DDOS attack-LOIC-UDP       1.00      1.00      1.00       495
   DoS attacks-GoldenEye       1.00      1.00      1.00      4955
        DoS attacks-Hulk       1.00      1.00      1.00     55094
DoS attacks-SlowHTTPTest       0.77      0.51      0.62     16734
   DoS attacks-Slowloris       1.00      1.00      1.00      1262
          FTP-BruteForce       0.72      0.89      0.79     23168
           Infilteration       0.50      0.45      0.48     19614
           SQL Injection       0.48      0.67      0.56        18
          SSH-Bruteforce       1.00      1.00      1.00     22496

        

In [119]:
print("score on train: "+ str(model4.score(X_train, y_train) * 100))
print("score on test: " + str(model4.score(X_test, y_test) * 100))

score on train: 99.34239239836609
score on test: 95.92859770430272


In [120]:
print("score on train: "+ str(model4_og.score(X_train, y_train_og) * 100))
print("score on test: " + str(model4_og.score(X_test, y_test_og) * 100))

score on train: 97.10592087331858
score on test: 93.69213215534926


The Random Forest algorithm has an outstanding result too with 95.92% and 93.69% accuracy. Moreover, there is a small overfitting.

## Without zero_days

### 1. Logistic regression

#### binary labels

In [121]:
from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression(solver='lbfgs', C=10)

start = time.time()
model1.fit(X_train2, y_train2)
end = time.time()

duration = end - start
    
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 9 seconds to train the model


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [122]:
y_pred = model1.predict(X_test2)

print(classification_report(y_test2, y_pred))

cf = confusion_matrix(y_test2, y_pred)
print(cf)
print(accuracy_score(y_test2, y_pred) * 100) 

              precision    recall  f1-score   support

           0       0.92      0.91      0.92    220631
           1       0.93      0.93      0.93    260530

    accuracy                           0.92    481161
   macro avg       0.92      0.92      0.92    481161
weighted avg       0.92      0.92      0.92    481161

[[201531  19100]
 [ 17061 243469]]
92.48463611971877


#### orginal labels

In [123]:
from sklearn.linear_model import LogisticRegression

model1_og = LogisticRegression(solver='lbfgs', C=10)

start = time.time()
model1_og.fit(X_train2, y_train2_og)
end = time.time()

duration = end - start
    
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 80 seconds to train the model


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [124]:
y_pred = model1_og.predict(X_test2)

print(classification_report(y_test2_og, y_pred))

cf = confusion_matrix(y_test2_og, y_pred)
print(cf)
print(accuracy_score(y_test2_og, y_pred) * 100) 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                          precision    recall  f1-score   support

                  Benign       0.91      0.97      0.94    220631
                     Bot       0.87      1.00      0.93     34319
        Brute Force -Web       1.00      0.21      0.34       180
        Brute Force -XSS       1.00      0.43      0.61        76
        DDOS attack-HOIC       0.99      1.00      0.99     82119
    DDOS attack-LOIC-UDP       1.00      0.99      1.00       495
   DoS attacks-GoldenEye       0.87      0.83      0.85      4955
        DoS attacks-Hulk       0.98      1.00      0.99     55094
DoS attacks-SlowHTTPTest       0.00      0.00      0.00     16734
   DoS attacks-Slowloris       0.00      0.00      0.00      1262
          FTP-BruteForce       0.00      0.00      0.00     23168
           Infilteration       0.06      0.09      0.07     19614
           SQL Injection       0.00      0.00      0.00        18
          SSH-Bruteforce       0.74      1.00      0.85     22496

        

In [125]:
print("score on train: "+ str(model1.score(X_train2, y_train2) * 100))
print("score on test: " + str(model1.score(X_test2, y_test2) * 100))

score on train: 92.09948800106025
score on test: 92.48463611971877


In [126]:
print("score on train: "+ str(model1_og.score(X_train2, y_train2_og) * 100))
print("score on test: " + str(model1_og.score(X_test2, y_test2_og) * 100))

score on train: 94.24980558852593
score on test: 86.08324448573347


The logistic regression algorithm has an excellent result of 92.48% and 86.08% accuracy. Moreover, there is only an overfitting issue for the model with the original labels.

### 2. Naive Bayes

#### binary labels

In [127]:
from sklearn.naive_bayes import GaussianNB

model2 = GaussianNB()

start = time.time()
model2.fit(X_train2, y_train2)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 1 seconds to train the model


In [128]:
y_pred = model2.predict(X_test2)

print(classification_report(y_test2, y_pred))

cf = confusion_matrix(y_test2, y_pred)
print(cf)
print(accuracy_score(y_test2, y_pred) * 100)

              precision    recall  f1-score   support

           0       0.88      0.73      0.80    220631
           1       0.80      0.91      0.85    260530

    accuracy                           0.83    481161
   macro avg       0.84      0.82      0.82    481161
weighted avg       0.84      0.83      0.83    481161

[[160895  59736]
 [ 22371 238159]]
82.93564939801854


#### original labels

In [129]:
from sklearn.naive_bayes import GaussianNB

model2_og = GaussianNB()

start = time.time()
model2_og.fit(X_train2, y_train2_og)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 2 seconds to train the model


In [130]:
y_pred = model2_og.predict(X_test2)

print(classification_report(y_test2_og, y_pred))

cf = confusion_matrix(y_test2_og, y_pred)
print(cf)
print(accuracy_score(y_test2_og, y_pred) * 100)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                          precision    recall  f1-score   support

                  Benign       0.91      0.43      0.59    220631
                     Bot       0.73      1.00      0.84     34319
        Brute Force -Web       0.00      0.24      0.00       180
        Brute Force -XSS       0.00      0.46      0.01        76
        DDOS attack-HOIC       0.85      1.00      0.92     82119
    DDOS attack-LOIC-UDP       1.00      1.00      1.00       495
   DoS attacks-GoldenEye       0.71      0.82      0.76      4955
        DoS attacks-Hulk       0.98      0.96      0.97     55094
DoS attacks-SlowHTTPTest       0.00      0.00      0.00     16734
   DoS attacks-Slowloris       0.00      0.00      0.00      1262
          FTP-BruteForce       0.00      0.00      0.00     23168
           Infilteration       0.09      0.42      0.15     19614
           SQL Injection       0.00      0.83      0.01        18
          SSH-Bruteforce       1.00      1.00      1.00     22496

        

In [131]:
print("score on train: "+ str(model2.score(X_train2, y_train2) * 100))
print("score on test: " + str(model2.score(X_test2, y_test2) * 100))

score on train: 83.46255118527657
score on test: 82.93564939801854


In [132]:
print("score on train: "+ str(model2_og.score(X_train2, y_train2_og) * 100))
print("score on test: " + str(model2_og.score(X_test2, y_test2_og) * 100))

score on train: 68.27068313950656
score on test: 62.3088737449627


The Naive Bayes algorithm has a good result of 82.9% and 62.30% accuracy. Moreover, there is an overfitting issue for the model with the original labels.

### 3. Random Forest

#### binary labels

In [133]:
from sklearn.ensemble import RandomForestClassifier

model3 = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')

start = time.time()
model3.fit(X_train2, y_train2)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 30 seconds to train the model


In [134]:
y_pred = model3.predict(X_test2)

print(classification_report(y_test2, y_pred))

cf = confusion_matrix(y_test2, y_pred)
print(cf)
print(accuracy_score(y_test2, y_pred) * 100) 

              precision    recall  f1-score   support

           0       0.92      0.98      0.95    220631
           1       0.98      0.93      0.96    260530

    accuracy                           0.95    481161
   macro avg       0.95      0.96      0.95    481161
weighted avg       0.95      0.95      0.95    481161

[[215815   4816]
 [ 17558 242972]]
95.34999719428632


#### original labels

In [135]:
from sklearn.ensemble import RandomForestClassifier

model3_og = RandomForestClassifier(n_estimators = 10, criterion = 'entropy')

start = time.time()
model3_og.fit(X_train2, y_train2_og)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 32 seconds to train the model


In [136]:
y_pred = model3_og.predict(X_test2)

print(classification_report(y_test2_og, y_pred))

cf = confusion_matrix(y_test2_og, y_pred)
print(cf)
print(accuracy_score(y_test2_og, y_pred) * 100) 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                          precision    recall  f1-score   support

                  Benign       0.93      0.98      0.95    220631
                     Bot       1.00      1.00      1.00     34319
        Brute Force -Web       0.91      0.74      0.82       180
        Brute Force -XSS       0.97      0.83      0.89        76
        DDOS attack-HOIC       1.00      1.00      1.00     82119
    DDOS attack-LOIC-UDP       1.00      1.00      1.00       495
   DoS attacks-GoldenEye       0.90      1.00      0.95      4955
        DoS attacks-Hulk       1.00      1.00      1.00     55094
DoS attacks-SlowHTTPTest       0.00      0.00      0.00     16734
   DoS attacks-Slowloris       0.00      0.00      0.00      1262
          FTP-BruteForce       0.00      0.00      0.00     23168
           Infilteration       0.58      0.40      0.48     19614
           SQL Injection       0.87      0.72      0.79        18
          SSH-Bruteforce       0.39      1.00      0.56     22496

        

In [137]:
print("score on train: "+ str(model3.score(X_train2, y_train2) * 100))
print("score on test: " + str(model3.score(X_test2, y_test2) * 100))

score on train: 98.34521241037095
score on test: 95.34999719428632


In [138]:
print("score on train: "+ str(model3_og.score(X_train2, y_train2_og) * 100))
print("score on test: " + str(model3_og.score(X_test2, y_test2_og) * 100))

score on train: 98.37483701594466
score on test: 87.92005170826397


The Random Forest algorithm has a good result of 95.34% and 87.92% accuracy. Moreover, there is an overfitting issue.

### 4. Decision tree

#### binary labels

In [139]:
from sklearn.tree import DecisionTreeClassifier

model4 = DecisionTreeClassifier(criterion='entropy')

start = time.time()
model4.fit(X_train2, y_train2)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 24 seconds to train the model


In [140]:
y_pred = model4.predict(X_test2)

print(classification_report(y_test2, y_pred))

cf = confusion_matrix(y_test2, y_pred)
print(cf)
print(accuracy_score(y_test2, y_pred) * 100) 

              precision    recall  f1-score   support

           0       0.95      0.96      0.95    220631
           1       0.97      0.95      0.96    260530

    accuracy                           0.96    481161
   macro avg       0.96      0.96      0.96    481161
weighted avg       0.96      0.96      0.96    481161

[[211734   8897]
 [ 11790 248740]]
95.7006074889694


#### original labels

In [141]:
from sklearn.tree import DecisionTreeClassifier

model4_og = DecisionTreeClassifier(criterion='entropy')

start = time.time()
model4_og.fit(X_train2, y_train2_og)
end = time.time()
duration = end - start
print("It took about "+str(int(duration))+ " seconds to train the model")

It took about 29 seconds to train the model


In [142]:
y_pred = model4_og.predict(X_test2)

print(classification_report(y_test2_og, y_pred))

cf = confusion_matrix(y_test2_og, y_pred)
print(cf)
print(accuracy_score(y_test2_og, y_pred) * 100) 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


                          precision    recall  f1-score   support

                  Benign       0.94      0.96      0.95    220631
                     Bot       1.00      1.00      1.00     34319
        Brute Force -Web       0.83      0.79      0.81       180
        Brute Force -XSS       0.97      0.97      0.97        76
        DDOS attack-HOIC       1.00      1.00      1.00     82119
    DDOS attack-LOIC-UDP       1.00      1.00      1.00       495
   DoS attacks-GoldenEye       0.85      1.00      0.92      4955
        DoS attacks-Hulk       1.00      1.00      1.00     55094
DoS attacks-SlowHTTPTest       0.00      0.00      0.00     16734
   DoS attacks-Slowloris       0.00      0.00      0.00      1262
          FTP-BruteForce       0.00      0.00      0.00     23168
           Infilteration       0.42      0.45      0.44     19614
           SQL Injection       0.48      0.67      0.56        18
          SSH-Bruteforce       0.39      1.00      0.56     22496

        

In [143]:
print("score on train: "+ str(model4.score(X_train2, y_train2) * 100))
print("score on test: " + str(model4.score(X_test2, y_test2) * 100))

score on train: 99.2805313718726
score on test: 95.7006074889694


In [144]:
print("score on train: "+ str(model4_og.score(X_train2, y_train2_og) * 100))
print("score on test: " + str(model4_og.score(X_test2, y_test2_og) * 100))

score on train: 99.27994667570997
score on test: 87.3830173268407


The Random Forest algorithm has an excellent result with 95.70% and 87.38% accuracy. Moreover, there is an overfitting issue.

#### Comparison between test set with/without zero-days

Binary labels:

Algorithms |with zero days |without zero days
-----|-----|----- 
Logistic Regression|train: 92.70 test: 92.58|train: 92.09 test: 92.48
Naive Bayes|train: 88.26 test: 88.24|train: 83.46 test: 82.93
Random Forest|train: 98.53 test: 96.47|train: 98.34 test: 95.35
Decision Tree|train: 99.34 test: 95.93|train: 99.28 test: 95.70

Original labels:

Algorithms |with zero days |without zero days
-----|-----|----- 
Logistic Regression|train: 91.32 test: 91.24|train: 94.2424 test: 86.08
Naive Bayes|train: 67.66 test: 67.57|train: 68.27 test: 62.31
Random Forest|train: 96.21 test: 94.20|train: 98.37 test: 87.92
Decision Tree|train: 97.11 test: 93.69|train: 99.28 test: 87.38

### Question: Can you explain the result you get with/without zero-days in your test set.
### Answer:  In general for the binary labels, the test set without zero days is performing worse than the set with zero-days. Even though it's slightly worse the impact won't be big as the difference is very little.

### For the original labels, it is also the same that the set without zero days is performing worse than the set with zero-days. The impact is bigger as the scores differs. 

### The reason is because that we removed some samples of the M-profiles in the training set which is not detectable. 

### On the other hand, we can see that there are more overfitting issues with the orginal labels than the binary labels.