# Supervised Learning

## BCCC-DoHBrw-2020

### Introduction

#### About Dataset

The dataset utilized in this project originates from [Kaggle's BCCC-CIRA-CIC-DoHBrw-2020 -- DOH Dataset ](https://www.kaggle.com/datasets/supplejade/bccc-cira-cic-dohbrw-2020-dns-over-http?resource=download). It has about 500k instances, each characterized by 29 attributes. Among these attributes is the target Label, which can be Benign or Malicious. The other 28 attributes are numeric values such as FlowBytesSent, PacketLengthVariance, PacketLengthMedian, ResponseTimeTimeVariance, ResponseTimeTimeSkewFromMode and many more.

#### About the Problem

The goal of this project is to predict whether an HTTPS network flow is benign or malicious, based on the provided attributes. This constitutes a binary classification problem, where the target variable is the label indicating whether the HTTPS network flow is benign or malicious.

#### About the Solution

To address this problem, a supervised learning approach will be employed ... ***METER AQUI DEPOIS O PROCEDIMENTO
+- ***

| Name | UP_Number |
|-|-|
| Guilherme Coutinho | up202108872|
| Xavier Outeiro | up202108895 |
| Miguel Figueiredo | up201706105 |
| Group | T04-G46 |


In [9]:
import warnings 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import math
from sklearn.preprocessing import LabelEncoder, StandardScaler, RobustScaler, MinMaxScaler, MaxAbsScaler
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve, KFold
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier, AdaBoostClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import KernelPCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve, average_precision_score, ConfusionMatrixDisplay

warnings.filterwarnings('ignore')

In [10]:
#Creates a dataframe from the csv file related to the dataset
df=pd.read_csv('../data/cyber_data.csv')
#Displays the first 5 rows 
df.head()

Unnamed: 0,FlowBytesSent,FlowSentRate,FlowBytesReceived,FlowReceivedRate,PacketLengthVariance,PacketLengthStandardDeviation,PacketLengthMean,PacketLengthMedian,PacketLengthMode,PacketLengthSkewFromMedian,...,PacketTimeCoefficientofVariation,ResponseTimeTimeVariance,ResponseTimeTimeStandardDeviation,ResponseTimeTimeMean,ResponseTimeTimeMedian,ResponseTimeTimeMode,ResponseTimeTimeSkewFromMedian,ResponseTimeTimeSkewFromMode,ResponseTimeTimeCoefficientofVariation,Label
0,353,80.890348,393,90.056393,469.209877,21.661253,82.888889,66.0,66,2.339046,...,0.534524,1.754601e-09,4.2e-05,4.8e-05,1.9e-05,1.7e-05,2.028699,0.73749,0.869641,Benign
1,1807,53.056709,4828,141.758602,145520.370987,381.471324,228.793103,76.0,68,1.201609,...,1.591559,5.348911e-05,0.007314,0.011523,0.015273,1.6e-05,-1.538407,1.573304,0.634722,Malicious
2,15000,479.536009,27719,886.150575,25949.480963,161.088426,135.186709,87.0,87,0.897396,...,0.67095,0.3356292,0.579335,0.139518,0.001976,3e-06,0.712241,0.240819,4.152404,Malicious
3,1755,58.193065,4617,153.125406,154088.445853,392.541007,245.112953,75.5,54,1.296268,...,2.029971,124.5935,11.162146,5.011613,0.015778,0.012884,1.342708,0.447829,2.227258,Benign
4,618,3.491495,315,1.779646,319.41,17.872045,93.3,105.0,105,-1.963961,...,0.816362,2.975575e-09,5.4e-05,0.015507,0.015471,0.015466,1.986391,0.751146,0.003502,Benign


# Preprocessing Data

### First analysis

Here we are trying to get a grasp of the data in our dataset and filter out any outliers within it


In [13]:
#Gives some statistics ofevery collumn of the dataframe
df.describe()

Unnamed: 0,FlowBytesSent,FlowSentRate,FlowBytesReceived,FlowReceivedRate,PacketLengthVariance,PacketLengthStandardDeviation,PacketLengthMean,PacketLengthMedian,PacketLengthMode,PacketLengthSkewFromMedian,...,PacketTimeSkewFromMode,PacketTimeCoefficientofVariation,ResponseTimeTimeVariance,ResponseTimeTimeStandardDeviation,ResponseTimeTimeMean,ResponseTimeTimeMedian,ResponseTimeTimeMode,ResponseTimeTimeSkewFromMedian,ResponseTimeTimeSkewFromMode,ResponseTimeTimeCoefficientofVariation
count,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0,...,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0,499106.0
mean,40200.93,47339.15,42501.56,31668.29,92635.85,220.169083,173.159814,95.472702,70.749953,0.442934,...,1.246052,0.972995,1.711825,0.319586,0.442021,0.394671,0.205894,-0.968984,-0.0608,1.114611
std,143961.7,421275.1,139392.7,256680.5,153493.8,210.141825,85.50649,32.99974,14.830277,1.551998,...,0.713803,0.520212,11.106412,1.244031,2.071807,2.351233,1.724873,3.163775,3.19877,1.73582
min,55.0,1.464903,54.0,1.576245,0.0,0.0,56.0,54.0,54.0,-10.0,...,-5.265523,0.077182,0.0,0.0,5e-06,2e-06,-1e-06,-10.0,-10.0,0.0
25%,618.0,54.10781,476.0,141.8128,469.2099,21.661253,92.0,76.0,66.0,0.199848,...,0.627988,0.577836,2.1e-05,0.004454,0.010538,0.012238,1.6e-05,-1.797168,0.393686,0.552687
50%,1807.0,364.0969,4827.0,461.1146,18267.89,135.142971,152.488283,87.0,68.0,0.986397,...,1.224949,0.748184,7.9e-05,0.00882,0.015148,0.015407,3.4e-05,0.0,0.908002,0.800649
75%,5542.0,3810.26,7888.0,4215.498,141598.9,376.296309,228.758621,105.0,68.0,1.201609,...,1.716577,1.516263,0.000357,0.01864,0.024903,0.0163,0.015161,0.936908,1.305393,1.208105
max,8015359.0,23043480.0,7723184.0,7600000.0,1578115.0,1256.230616,689.8,317.0,553.0,2.932375,...,12.956406,5.616085,647.24533,25.441017,28.017596,28.017596,28.017596,2.970716,5.428781,66.309747


In [12]:
# Checking if there are values missing 
df.isna().any()

FlowBytesSent                             False
FlowSentRate                              False
FlowBytesReceived                         False
FlowReceivedRate                          False
PacketLengthVariance                      False
PacketLengthStandardDeviation             False
PacketLengthMean                          False
PacketLengthMedian                        False
PacketLengthMode                          False
PacketLengthSkewFromMedian                False
PacketLengthSkewFromMode                  False
PacketLengthCoefficientofVariation        False
PacketTimeVariance                        False
PacketTimeStandardDeviation               False
PacketTimeMean                            False
PacketTimeMedian                          False
PacketTimeMode                            False
PacketTimeSkewFromMedian                  False
PacketTimeSkewFromMode                    False
PacketTimeCoefficientofVariation          False
ResponseTimeTimeVariance                

### Filtering out outliers 
From this first analysis, we weren't able to deduce any possible outliers since every collumn might be useful for our analysis 

In [15]:
df.shape

(499106, 29)

### Encoding the target variable
We need to encode the target variable **Label**, because it's of object type. The rest of the columns are already numeric, so we don't need to encode them.

In [16]:
encoder = LabelEncoder()
#Transforms the Label collumn into numeric values
df['Label'] = encoder.fit_transform(df['Label'])

df.head()

Unnamed: 0,FlowBytesSent,FlowSentRate,FlowBytesReceived,FlowReceivedRate,PacketLengthVariance,PacketLengthStandardDeviation,PacketLengthMean,PacketLengthMedian,PacketLengthMode,PacketLengthSkewFromMedian,...,PacketTimeCoefficientofVariation,ResponseTimeTimeVariance,ResponseTimeTimeStandardDeviation,ResponseTimeTimeMean,ResponseTimeTimeMedian,ResponseTimeTimeMode,ResponseTimeTimeSkewFromMedian,ResponseTimeTimeSkewFromMode,ResponseTimeTimeCoefficientofVariation,Label
0,353,80.890348,393,90.056393,469.209877,21.661253,82.888889,66.0,66,2.339046,...,0.534524,1.754601e-09,4.2e-05,4.8e-05,1.9e-05,1.7e-05,2.028699,0.73749,0.869641,0
1,1807,53.056709,4828,141.758602,145520.370987,381.471324,228.793103,76.0,68,1.201609,...,1.591559,5.348911e-05,0.007314,0.011523,0.015273,1.6e-05,-1.538407,1.573304,0.634722,1
2,15000,479.536009,27719,886.150575,25949.480963,161.088426,135.186709,87.0,87,0.897396,...,0.67095,0.3356292,0.579335,0.139518,0.001976,3e-06,0.712241,0.240819,4.152404,1
3,1755,58.193065,4617,153.125406,154088.445853,392.541007,245.112953,75.5,54,1.296268,...,2.029971,124.5935,11.162146,5.011613,0.015778,0.012884,1.342708,0.447829,2.227258,0
4,618,3.491495,315,1.779646,319.41,17.872045,93.3,105.0,105,-1.963961,...,0.816362,2.975575e-09,5.4e-05,0.015507,0.015471,0.015466,1.986391,0.751146,0.003502,0


Now the Label values will be : 
Benign = 0 , Malicious = 1 