# 1. Introduction

The purpose of this tutorial is to provide you with an introduction to some of the commonly used machine learning techniques. Given that the focus of the course this semester is with cyberscecurity, I've chosen to use a Phishing website prediction dataset. The tutorial will go over data preprocessing and modelling techniques. The corresponding presentations will go through a cursory introduction to ML and cybersecurity that may be useful in prototyping your ideas.

# 2. Import Packages

In [52]:
import numpy as np
import pandas as pd # this library is used for data processing
import seaborn as sns # used for data visualization

from matplotlib import pyplot as plt
%matplotlib inline
sns.set_style("whitegrid")

import warnings
warnings.filterwarnings("ignore")

# 3. Loading Dataset

The first step in any machine learning project is to load your dataset. We use the pandas library to do this as it provides us with dataframe objects that handle large amounts of data well.

In [53]:
phishing_df = pd.read_csv('../data/dataset.csv')
phishing_df.head()

Unnamed: 0,index,having_IPhaving_IP_Address,URLURL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,1,-1,1,1,1,-1,-1,-1,-1,-1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,2,1,1,1,1,1,-1,0,1,-1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,3,1,0,1,1,1,-1,-1,-1,-1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,4,1,0,1,1,1,-1,-1,-1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,5,1,0,-1,1,1,-1,1,1,-1,...,-1,1,-1,-1,0,-1,1,1,1,1


As we can see above there are a number of attributes about each website that could be interesting features for us to look at and use to answer our question of whether it is a phishing website or not.

# 4. Dataset Statistics 

Often times it is important to understand the summary statistics of your data to get a better sense of what type of preprocessing you might need. Here we get a sense for how many examples there are for each feature, the mean, standard deviation, the minimum value and maximum value.

In [54]:
phishing_df.describe()

Unnamed: 0,index,having_IPhaving_IP_Address,URLURL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
count,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,...,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0,11055.0
mean,5528.0,0.313795,-0.633198,0.738761,0.700588,0.741474,-0.734962,0.063953,0.250927,-0.336771,...,0.613388,0.816915,0.061239,0.377114,0.287291,-0.483673,0.721574,0.344007,0.719584,0.113885
std,3191.447947,0.949534,0.766095,0.673998,0.713598,0.671011,0.678139,0.817518,0.911892,0.941629,...,0.789818,0.576784,0.998168,0.926209,0.827733,0.875289,0.692369,0.569944,0.694437,0.993539
min,1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0
25%,2764.5,-1.0,-1.0,1.0,1.0,1.0,-1.0,-1.0,-1.0,-1.0,...,1.0,1.0,-1.0,-1.0,0.0,-1.0,1.0,0.0,1.0,-1.0
50%,5528.0,1.0,-1.0,1.0,1.0,1.0,-1.0,0.0,1.0,-1.0,...,1.0,1.0,1.0,1.0,1.0,-1.0,1.0,0.0,1.0,1.0
75%,8291.5,1.0,-1.0,1.0,1.0,1.0,-1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
max,11055.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# 5. Data Preprocessing

This is a crucial part of any machine learning project because these preprocessing steps can drastically improve your model's performance. We also need to be sure that we get rid of errors in the data that may cause issues such as null values.

## a. Cleaning Null Values

In [55]:
pd.isnull(phishing_df).sum()

index                          0
having_IPhaving_IP_Address     0
URLURL_Length                  0
Shortining_Service             0
having_At_Symbol               0
double_slash_redirecting       0
Prefix_Suffix                  0
having_Sub_Domain              0
SSLfinal_State                 0
Domain_registeration_length    0
Favicon                        0
port                           0
HTTPS_token                    0
Request_URL                    0
URL_of_Anchor                  0
Links_in_tags                  0
SFH                            0
Submitting_to_email            0
Abnormal_URL                   0
Redirect                       0
on_mouseover                   0
RightClick                     0
popUpWidnow                    0
Iframe                         0
age_of_domain                  0
DNSRecord                      0
web_traffic                    0
Page_Rank                      0
Google_Index                   0
Links_pointing_to_page         0
Statistica

Based on our check, there are no null values in the data thus there are no further steps needed to deal with null values!

# 6. Training Models

Now that we have preprocess our data, we are ready to train and evaluate models. First thing we need to do is to split our dataset into a training set and a test set. The training set is used to train the algorithm and the test set is used to evaluate its performance on unseen data. Splitting and the models themselves are often done using the scikit-learn library.

In [56]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, accuracy_score, roc_auc_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow import keras

## Dataset Split

In [57]:
training_df = phishing_df
X_train = training_df.drop(columns=['Result'])
y_train = training_df['Result']
X_test = training_df.drop(training_df.index)

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size = 0.2, random_state = 420)

## Logistic Regression

In [59]:
LR_Model = LogisticRegression()
LR_Model.fit(X_train, y_train)
LR_Predict = LR_Model.predict(X_test)
LR_Accuracy = accuracy_score(y_test, LR_Predict)
print("Accuracy: " + str(LR_Accuracy))

LR_AUC = roc_auc_score(y_test, LR_Model.predict_proba(X_test)[:,1]) 
print("AUC: " + str(LR_AUC))

Accuracy: 0.9298959746720941
AUC: 0.9815121083894551


## Random Forest

In [60]:
RFC_Model = RandomForestClassifier()
RFC_Model.fit(X_train, y_train)
RFC_Predict = RFC_Model.predict(X_test)
RFC_Accuracy = accuracy_score(y_test, RFC_Predict)
print("Accuracy: " + str(RFC_Accuracy))

RFC_AUC = roc_auc_score(y_test, RFC_Model.predict_proba(X_test)[:,1]) 
print("AUC: " + str(RFC_AUC))

Accuracy: 0.9769335142469471
AUC: 0.9972249538113117


## Neural Network

In [66]:
model = keras.Sequential([
    keras.layers.Flatten(input_shape=(31, )),
    keras.layers.Dense(15, activation = 'relu'),
    keras.layers.Dense(1, activation = 'tanh')
])

model.compile(optimizer='sgd',
             loss=tf.keras.losses.BinaryCrossentropy(),
             metrics=['Accuracy'])

model.fit(X_train, y_train, epochs=10)

Train on 8844 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7fcc8dc71320>

Sometimes more complex models are not beneficial. This is a good example where the number of features andn umber of examples is low enough that both logistic regression and random forest outperform the neural network.

# Conclusion

Now you've gone through the process of a machine learning model applied to website phishing data. It is important to keep these concepts of data preprocessing and model selection when determining the best way to solve your problems.