# Binary Classification

## Introduction

**Distributed denial-of-service (DDoS)** attack is a form of attack that targets critical systems such as servers, websites or network resources by temporarily or permanently disrupting connection to the network host, causing a denial of service for users. A typical DDoS attack is accomplished by flooding the target with excess traffic to render the service offline. DDoS attacks are practically impossible to prevent and pose several information and business risks.

**Binary Classification** is the task of classifying a dataset into two given groups. In Machine Learning, a binary classifier is a form of supervised learning, which means an input is mapped to an output based on example input-output pairs. For example, food would be an input and edible/inedible would be outputs. A binary classifier would attempt to predict edible or inedible food types by learning from examples of edible and inedible food. <br/>
Some commonly used methods for binary classification include decision trees, random forests, bayesian networks, support vector machines, neural networks and logistic regression. Binary classifiers have a dependent variable with two possible values, such as pass/fail, win/loss or safe/unsafe and one or more independent variables which contain any real value.

***

## Outline

In this example, we use a dataset that contains multiple types of DDoS attacks - Netbios, Portmap and Syn - which take place over two different days. The dataset will be split into two sets **training** and **testing**. The training set will be used to train a logistic regression model to predict our testing set by classifying DDOS or BENIGN connections based on the independent variables. The dependent variable is a label of two values NETBIOS/PORTMAP (DDoS attacks) or BENIGN (Harmless connection) and the independent variables are real network data such as flow duration, packets sent per second, lengths of forward packets and more. <br/>
In simple terms, our model will look at the the day 1 data and learn what network data is labelled as NETBIOS/PORTMAP or BENIGN and attempt to predict whether connections from day 2 are DDoS connections or benign connections without looking at their correct labels. The predicted labels are compared to actual labels to obtain a classification report containing accuracy and error rates.


### Contents
- [Part 0: Getting Started](#part0)
- [Part 1: Load The Data](#part1)
- [Part 2: Clean The Data](#part2)
- [Part 3: Explore The Data](#part3.0) <br/>
  [3.1: Overview of data and features](#part3.1) <br/>
  [3.2: Numerical and categorical features](#part3.2) <br/>
- [Part 4: Machine Learning](#part4.0) <br/>
  [4.1: Feature Selection](#part4.1) <br/>
  [4.2: Training](#part4.2) <br/>
  [4.3: Prediction](#part4.3) <br/>
- [Part 5: Evaluation](#part5.0) <br/>
  [5.1: Classification report](#part5.1) <br/>
  [5.2: Confusion matrix](#part5.2)





 ***

<a id='part0'></a>
# Part 0: Getting Started

### Imports and Formatting
**Sci-kit learn (sklearn)** is a commonly used python machine learning library that features various classification, regression and clustering algorithms and powerful data pre and post processing and evaluation methods 

**Pandas** is a data processing and manipulation module used in python which can read various files and augment the data 

**Numpy** is a mathematical library that supports multi-dimensional arrays and matrices and high-level math functions

In [None]:
import sklearn
import pandas 
import numpy
from scipy import stats
import seaborn as sns
import ipywidgets as widgets
import matplotlib.pyplot as plt
from ipywidgets import interactive
from IPython.display import display
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix 

# Set display format of pandas output to "%.3f" (makes things look pretty)
# Rounds any floating point decimal value to 3 places after decimal point  
pandas.set_option('display.float_format', lambda x: '%.3f' % x)


 ***

<a id='part1'></a>
# Part 1: Load the Data

To accomodate for long run-times for large datasets such as the one used in this example, we read only columns that are required. 20 preselected columns are used for this machine learning example. Add the following line into the **TODO** section of code cell below to read in required columns/features.

```python
use_columns = ["Flow ID", "Source IP", "Source Port", "Destination IP", "Destination Port", "Timestamp", "Flow Duration", "Total Fwd Packets", "Total Backward Packets", "Fwd IAT Mean", "Fwd IAT Std", "Fwd IAT Min", "Fwd Packets/s", "Bwd Packets/s","SYN Flag Count", "RST Flag Count", "PSH Flag Count", "ACK Flag Count", "URG Flag Count", "Label"]
```

In [None]:
# Put column names into a list
header_names = list(pandas.read_csv('netbios_day1.csv', nrows=0))

# Remove leading and trailing whitespaces
for i in range(0, len(header_names)):
    header_names[i] = header_names[i].strip()

# List of columns to use 
# TODO    
use_columns = ["Flow ID", "Source IP", "Source Port", "Destination IP", "Destination Port", "Timestamp", "Flow Duration", "Total Fwd Packets", "Total Backward Packets", "Fwd IAT Mean", "Fwd IAT Std", "Fwd IAT Min", "Fwd Packets/s", "Bwd Packets/s","SYN Flag Count", "RST Flag Count", "PSH Flag Count", "ACK Flag Count", "URG Flag Count", "Label"]

### Read the data

Pandas provides easy ways to read multiple file types. In our case, the dataset is a comma-seperated values (csv) file. An easy to use function to read csv files is

```python
#Usage: pandas.read_csv(filename, nrows, skiprows, names, usecols, ...)
pandas.read_csv()
```
which takes in parameters like:<br/>

- ```filename``` name of the file 
- ```nrows``` to specify number of rows to read
- ```skiprows``` to skip any number of rows
- ```names``` to specify header names
- ```usecols``` to specify which column names to read, and more


We read in 3 different files: 

- **'netbios_day1.csv'** contains training data from day 1
- **'netbios_day2.csv'** contains netbios testing data from day 2
- **'portmap.csv'** contains portmap testing data from day 2

The following code reads data from 'netbios_day1.csv' and puts it into a pandas dataframe, which is a two-dimensional tabular data structure:

```python
dataframe_train = pd.read_csv('netbios_day1.csv', names = header_names, skiprows=1, usecols=use_columns)
```

Use the same template to read in the other two test data ('netbios_day2.csv' and 'Portmap.csv') in the **TODO** section

In [None]:
# Example read data into df_train from 'netbios_day1.csv'
# Use header_names extracted previously
# Skip first row which contains header names and use custom column names
dataframe_train = pandas.read_csv('netbios_day1.csv', names = header_names, skiprows=1, usecols=use_columns)

# TODO
dataframe_test = pandas.read_csv('netbios_day2.csv', names = header_names, skiprows=1, usecols=use_columns)
dataframe_portmap = pandas.read_csv('portmap.csv', names = header_names, skiprows=1, usecols=use_columns)

***

<a id='part2'></a>
# Part 2: Clean the Data

Cleaning data is one of the key steps in data preprocessing. There are several values that we would like to remove from our data before running a machine learning algorithm on it. Data that is not clean could hurt insights and deem your results useless or lead to false conslusions. 


### Infinity

Removing rows containing infinity is important as these could skew our results

```python
#Usage: dataframe.replace([values, to, replace], value)
dataframe.replace()
```

can be used to replace all infinity values with null values which we will remove in the next section

In [None]:
dataframe_train = dataframe_train.replace([numpy.inf, -numpy.inf], numpy.nan)
dataframe_test = dataframe_test.replace([numpy.inf, -numpy.inf], numpy.nan)
dataframe_portmap = dataframe_portmap.replace([numpy.inf, -numpy.inf], numpy.nan)

### Null Values

Null values are values that are missing or not present in the database. These values are commonly indicated by a NaN. They can easily be removed with the function

```python
dataframe.dropna()```

which drops all rows with null values

In [None]:
dataframe_train = dataframe_train.dropna()
dataframe_test = dataframe_test.dropna()
dataframe_portmap = dataframe_portmap.dropna()

### Zero Values

We want to eliminate any columns with all zero values as these do not contribute to our algorithm and may increase run-time

```python
#Usage: dataframe.loc[label to look for or boolean array] 
dataframe.loc[]
```


```(dataframe != 0)``` returns a boolean dataframe indicating columns having non-zero entries  <br />
```(dataframe != 0).any(axis=0)``` returns a boolean series indicating which columns have non-zero entries  <br />
```df.loc``` can then be used to select those columns with non-zero entries 

In [None]:
dataframe_train = dataframe_train.loc[:, (dataframe_train != 0).any(axis=0)]
dataframe_test = dataframe_test.loc[:, (dataframe_test != 0).any(axis=0)]
dataframe_portmap = dataframe_portmap.loc[:, (dataframe_portmap != 0).any(axis=0)]

***

<a id='part3.0'></a>
# Part 3: Explore the Data

<a id='part3.1'></a>
### 3.1 Overview of data and features

We explore our data to get an overview of our training set and to answer questions such as <br/>
How many rows and columns does the dataset contain? <br/>
What are the names of the features (columns)? <br/>
Which features are identifying, numerical and categorical? <br/>

```python
dataframe.shape```

```python
dataframe.head()```

```python
dataframe.info()```

These three functions answers the questions above by printing the first 5 rows of the table, number of rows, columns and details about the features.



#### Shape, head, info

Shape prints out the number of rows and number of columns of the dataframe. The shape of a dataframe can be obtained by calling dataframe.shape

Print the shape of the three dataframes in the **TODO** sections below


In [None]:
# Find shape of dataframe_train
# TODO: Print training dataframe shape
print()
print("*" * 50)

# TODO: Print netbios test dataframe shape
print()
print("*" * 50)

#TODO: Print the portmap test dataframe shape
print()


Head prints out the first five values of the dataframe. The head of a dataframe can be found by calling dataframe.head()

Print the head of the dataframes in the **TODO** sections of the code cells below

In [None]:
print("Training data:")
dataframe_train.head()

In [None]:
print("Netbios testing data: ")
# TODO: print head() of Netbios testing data


In [None]:
print("Portmap testing data: ")
# TODO: print head() of portmap testing data


Info provides the data types of our features. This information is important since only numerical data and Label is required for our machine learning analysis. Furthermore, we would like to leave out certain identifying features such as Source Port and Destination Port.

Print out dataframe.info() in the **TODO** sections below

In [None]:
print(" Training Data : ")
# TODO: print info() of training data

We can obtain counts of values within each column with 

```python
dataframe['column name'].value_counts()
```

In [None]:
print("Training Data Label Counts:")
# TODO: print counts of 'Label' values of training data
print(dataframe_train['Label'].value_counts())
print("\n")

print("Testing Data Label Counts:")
# TODO: print counts of 'Label' values of Netbios testing data
print()
print("\n")

print("Portmap Label Counts:")
# TODO: print counts of 'Label' values of Portmap testing data
print()
print("\n")

#### Summary

From this exploratory analysis we can conclude that:

**Training Data:**
- Our training data has 19 columns 
- 12 numerical features, 6 categorical features and the label 
- 4,094,986 total entries or number of rows
- 4,093,279 labels are DDOS attacks and 1707 labels are BENIGN.

**Netbios Testing Data:**
- Our netbios testing data also has 19 columns 
- 3,455,899 total entries or number of rows
- 3,454,578 labels are NETBIOS attacks and 1321 are BENIGN.

**Portmap Testing Data:**
- Our portmap testing data also has 19 columns 
- 191,694 total entries or number of rows
- 186,960 labels are PORTMAP attacks and 4734 are BENIGN. 

<a id='part3.2'></a>
### 3.2 Numerical and Categorical Features

We can seperate our numerical and categorical features by searching for a specific data type in our dataframe. Categorical datatypes are characterized by being an "object". So, all columns with datatypes that are numerical are candidates for our training. 

In [None]:
numerical_feats = dataframe_train.dtypes[dataframe_train.dtypes != "object"].index
print("Number of Numerical features: ", len(numerical_feats))

categorical_feats = dataframe_train.dtypes[dataframe_train.dtypes == "object"].index
print("Number of Categorical features: ", len(categorical_feats))

Certain numerical features such as Source Port and Destination Port are numerical however can be considered as identifying features and therefore should be left out from our training data.

In [None]:
print("Numerical Features: ")
print(dataframe_train[numerical_feats].columns)
print("*"*100)
print("Categorical Features: ")
print(dataframe_train[categorical_feats].columns)



Create indicator variable from our Label feature. Indicator variable is the dependent variable or the variable that we attempt to predict. For ease of use we map our labels such that: <br/>
NETBIOS/Portmap ---> 1 <br/>
BENIGN          ---> 0

In [None]:
dataframe_train['Label'] = [1.0 if x == "DrDoS_NetBIOS" else 0.0 for x in dataframe_train['Label']]
dataframe_test['Label'] = [1.0 if x == "NetBIOS" else 0.0 for x in dataframe_test['Label']]
dataframe_portmap['Label'] = [1.0 if x == "Portmap" else 0.0 for x in dataframe_portmap['Label']]

***

<a id='part4.0'></a>
# Part 4: Machine Learning

<a id='part4.1'></a>
### 4.1 Feature Selection

Feature Selection is important to machine learning as it plays a crucial role in delivering accurate outputs. Irrelevant features can decrease accuracy of the models or even prevent algorithms from running to completion.



### Correlation

Correlation is a numerical measure of the relationship between two variables. Higher correelation means two variables have a strong relationship with each other. In our dataset, it would be beneficial to map the correlation between our numerical features against our Label in order to pick out the features with highest correlation.

The code below generates a heatmap of correlation between each column of the dataframe.

We need to look at the correlation of each value on the y-axis against the Label in the x-axis to find the correlation between our independent and dependent variables.

In [None]:
corr_matrx = dataframe_train.corr()
fig, ax = plt.subplots(figsize=(16,16))
sns.heatmap(corr_matrx, annot=True, ax=ax, fmt='0.3f')
plt.show()

### Selecting features

```python
# Usage dataframe.loc[ [columns to select] ]
dataframe.loc[]
```

can be used to select subset of columns or rows from our dataframe.

we can create a dataframe with our required features by including a list of the required features within the dataframe.loc[] function call. Uncomment the code in the code cell and copy the list below that contains the numerical features and paste it into the dataframe.loc[] function call. The list of features can be manually changed, such as excluding features with low correlation, to test for different accuracies.

``` python
["Flow Duration", "Total Fwd Packets", "Total Backward Packets", "Fwd IAT Mean", "Fwd IAT Std", "Fwd IAT Min", "Fwd Packets/s", "Bwd Packets/s","SYN Flag Count", "RST Flag Count", "ACK Flag Count", "URG Flag Count"]
```

In [None]:
# TODO: Include a list of required features within dataframe_train[]
dataframe_features = dataframe_train[ ["Flow Duration", "Total Fwd Packets", "Total Backward Packets", "Fwd IAT Mean", "Fwd IAT Std", "Fwd IAT Min", "Fwd Packets/s", "Bwd Packets/s","SYN Flag Count", "RST Flag Count", "ACK Flag Count", "URG Flag Count"] ]

We then set up two arrays that contain the dependent and independent variables. Our dependent variable is the Label, i.e, NETBIOS/PORTMAP (1) or BENIGN (0),  and our independent variables are the features we selected in the code cells above.  

Copy the following code into the code cell below and run it to set up Y_train which stores the label values in an array and X_train which stores the numerical features in an array

```python
Y_train = dataframe_train['Label'].values
X_train = dataframe_features.values
```

In [None]:
# TODO: Create X_train and Y_train variables 
Y_train = dataframe_train['Label'].values
X_train = dataframe_features.values

<a id='part4.2'></a>
### 4.2 Training

Training in machine learning involves a sequence of setting up dependent and independent variables, running an algorithm on the data and building a mathematical model based on the sample data.

We set up our required variables in the previous section. Now we choose our classifier to build our model. In this example we use the binary logistic regression model. Binary logistic regression is frequently used when our dependent variable is categorical and has only two possible outcomes, in our case, ddos or benign. 

Our X_train and Y_train is then passed into our machine learning algorithm, in this case logistic regression, to build our model. Copy the following code into the code cell below to build the model.

```python
logmodel = LogisticRegression()
logmodel.fit(X_train, Y_train)
```

In [None]:
# TODO: Create logistic regression model
logmodel = LogisticRegression()
logmodel.fit(X_train, Y_train)

<a id='part4.3'></a>
### 4.3 Prediction

We can now use our trained logistic regression model to see how well it can predict if our testing data is a DDOS attack or benign.

Once again, we split our testing data into X_test and Y_test. X_test contains our numerical features which we will use to predict new labels. Y_test contains the correct labels that will then be compared against the labels predicted using X_test and the logistic regression model. 

The following code can be copied into the code cell and ran to create the variables and predictions for the netbios data.

```python
Y_test = dataframe_test['Label'].values
X_test = dataframe_test.drop(categorical_feats, axis=1)
X_test = X_test.drop(['Source Port', 'Destination Port'], axis=1).values

predictions = logmodel.predict(X_test)
```

The following code creates the variables and predictions for portmap data:

```python
Y_test_portmap = dataframe_test['Label'].values
X_test_portmap = dataframe_test.drop(categorical_feats, axis=1)
X_test_portmap = X_test_portmap.drop(['Source Port', 'Destination Port'], axis=1).values
predictions_portmap = logmodel.predict(X_test_portmap)
```

In [None]:
Y_test = dataframe_test['Label'].values
X_test = dataframe_test.drop(categorical_feats, axis=1)
X_test = X_test.drop(['Source Port', 'Destination Port'], axis=1).values
predictions = logmodel.predict(X_test)

In [None]:
Y_test_portmap = dataframe_portmap['Label'].values
X_test_portmap = dataframe_portmap.drop(categorical_feats, axis=1)
X_test_portmap = X_test_portmap.drop(['Source Port', 'Destination Port'], axis=1).values
predictions_portmap = logmodel.predict(X_test_portmap)

***

<a id='part5.0'></a>
# Part 5: Evaluation

Now that we have completed the machine learning, we can evaluate how well our logistic regression model has performed against the testing data. We use two reports to find out our accuracy - classification report and confusion matrix. 



<a id='part5.1'></a>
### 5.1 Classification Report

A classification report displays the precision, recall, F1, and support scores for the model.

- **Precision:** Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. In our case, precision answers the question: Of all the network data that we labelled as NETBIOS/PORTMAP, how many were actually NETBIOS/PORTMAP
- **Recall:** Recall or sensitivity is the ratio of correctly predicted positive observations to all observations in the actual class. Recall answers the question: Of all the network data that truly is NETBIOS/PORTMAP, how many did we label?
- **F1 Score:** F1 score is the weighted average of precision and recall
- **Support:** Number of samples of the true dataset in each class

#### 5.1.1 Netbios

In [None]:
print(classification_report(Y_test,predictions))

According to this classification report 90% of BENIGN (0) labels were predicted correctly and 100% of NETBIOS DDOS (1) labels were predicted correctly. The recall and f1 score of BENIGN values are 22% and 35% respectively. 

For our BENIGN labels, the classifier has high precision and low recall which means the classifier is very selective. A majority of the results returned by the classifier are correctly predicted as BENIGN attacks, however the classifier misses a lot of BENIGN attacks because it is very picky. 

#### 5.1.2 Portmap

In [None]:
print(classification_report(Y_test_portmap,predictions_portmap))

91% of BENIGN (0) labels were predicted correctly and 98% of PORTMAP (1) labels were predicted correctly. BENIGN had a recall of 24% and F1 score of 37% which means 

Similarily for the portmap testing the classifier has a high precision but low recall.

<a id='part5.2'></a>
## 5.2 Confusion Matrix

To better understand our precision, recall and F1 scores we create a confusion matrix. The confusion matrix shows the actual and predicted labels from a classifier. The output is a grid that contains 4 values - True Positive (top left), False Positive (top right), False Negatives (bottom left) and True Negatives (bottom right)

**True Positive** refers to number of labels that are positive (1) and correctly predicted as positive (1)

**False Positive** refers to number of labels that are actually negative (0) but incorrectly predicted as positive (1)

**False Negative** refers to number of labels that are actually positive (1) but incorrectly predicted as negative (0)

**True Negative** refers to number of labels that are negative (0) and correctly predicted as negative (0)

The percent of true positive and true negative scores together make up the accuracy of the model and the percent of false positives and false negatives together make up the error or misclassification rate.


#### 5.2.1 Netbios

In [None]:
cm=confusion_matrix(Y_test,predictions)
print(cm)

accuracy = (cm[0,0]+cm[1,1])/cm.sum()*100
error = (cm[0,1]+cm[1,0])/cm.sum()*100

print('Accuracy of the model: {0:.2f}%'.format(accuracy))
print('Error/ Misclassification rate: {0:.2f}%'.format(error))

From the grid we can identify that our scores are as follows:
- True Positive: 287
- False Positive: 1034
- False Negative: 33
- True Negative: 3454545

Our calculated accuracy is 99.97% and error rate is 0.03%

#### 5.2.2 Portmap

In [None]:
cm = confusion_matrix(Y_test_portmap,predictions_portmap)
print (cm)
accuracy = (cm[0,0]+cm[1,1])/cm.sum()*100
error = (cm[0,1]+cm[1,0])/cm.sum()*100


print('Accuracy of the model: {0:.2f}%'.format(accuracy))
print('Error/ Misclassification rate: {0:.2f}%'.format(error))

From the grid we can identify that our scores are as follows:
- True Positive: 1113
- False Positive: 3621
- False Negative: 107
- True Negative: 186853

Our calculated accuracy is 98.06% and error rate is 1.94%

***

# Resources


- Pandas Documentation: https://pandas.pydata.org/docs/
- Sci-kit learn Docmentation https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
- Seaborn Documentation: https://seaborn.pydata.org/introduction.html
- 