Firewall Device used at Firat University. In this report, I will compare and contrast the five supervised learning algorithm in terms of learning curve, model complexity, the time it takes to execute. I will also do parameter tunning for each of the algorithms to improve the performance. Exploratory data analysis will also be done on both datasets.

1. Importing libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import (accuracy_score,
                             f1_score,
                             classification_report,
                             roc_auc_score,
                             confusion_matrix,
                             ConfusionMatrixDisplay)
import scikitplot as skplot
import matplotlib.pyplot as plt
import seaborn as sns



2. Reading dataset

In [None]:
df = pd.read_csv("dataset.csv")

Take a look at the dataset

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.isnull().sum()


No null values

3. Exploratory Data Analysis for Customer Churn Prediction

I will perform some exploratory data analysis to gain a better understanding of the independent variables in the dataset and their relationship with customer churn.  I will begin by looking at the dataset

In [None]:
df.info()

Dataset Features: 
Firstly, log records are received via the firewall. The Log records used were taken from the Palo Alto 5020 Firewall device used at Firat University. The receiving log record consists of 65532 records and is obtained as a recording result of approximately 30 seconds. In the receiving log, the attributes are taken with importance to port, byte, packet and time information. the target feature is Action. The class has the action attribute with “allow”, “deny”, “drop” and “reset-both” values


Source Port: The port the client application is originating traffic
Destination Port: The port on which the destination application is listening on
NAT Source Port: Network address translation Source port
NAT Destination Port: Network address translation destination port
Action: The actions the firewall performs based on the analysis of the traffic. The class is allow, deny, drop, reset-both
Bytes:Total traffic in bytes
Bytes Sent: total traffic sent in byte
Bytes Received: total traffic received in byte
Packets: Total packet in volve 
Elapsed Time (sec): Elapsed time for the flow in seconds
pkts_sent: total packet sent
pkts_received: total packet received

There are 4 classes in the action attribute used as a class. They are described below
Allow: Allow the traffic
Deny: Block the traffice and enforces the defualt Deny Action defined for the application that is being denied
Drop:Silently drops the traffic; for an application, it overides the defualt deny action. ATCP reset is not sent to the host/application
Reset-Both: Sends a TCP reset to both the client-side and server sice devices


  


3.1 Target Distribution

In [None]:
# target distribution
print('Absolute Frequencies:')
print(df.Action.value_counts())
print()

print('Percentages:')
print(df.Action.value_counts(normalize=True)*100)

df.Action.value_counts().plot(kind='bar')
plt.title('Target (Action)')
plt.grid()
plt.show()


3.2 Feature Exploration

3.2.1 Numerical features

In [None]:
# list of Numerical feactures
features_num = ['Bytes', 'Bytes Sent', 'Bytes Received',
                'Packets', 'Elapsed Time (sec)',
                'pkts_sent', 'pkts_received']

# define log transformation for numerical features

def num_transfo(x):
    return np.log10(1+x)


# plot distribution of numerical features
for f in features_num:
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(11, 7), sharex=True)
    ax1.hist(num_transfo(df[f]), bins=20)
    ax1.grid()
    ax1.set_title('Feature: ' + f + ' - transfo [log_10(1+x)]')
    ax2.boxplot(num_transfo(df[f]), vert=False)
    ax2.grid()
    ax2.set_title('Feature: ' + f + ' - transfo [log_10(1+x)]')
    plt.show()



3.2.2 Categorical Features

In [None]:
features_cat = ['Source Port', 'Destination Port',
                'NAT Source Port', 'NAT Destination Port']

# show only top 10 levels for each feature
for f in features_cat:
    print('Feature:', f)
    print(df[f].value_counts()[0:10])
    print()
    df[f].value_counts()[0:10].plot(kind='bar')
    plt.title(f)
    plt.grid()
    plt.show()


3.3 Target vs Features

3.3.1 target vs Numerical Features

In [None]:
# add transformations of numerical features
for f in features_num:
    new_feature = f + '_transfo'
    df[new_feature] = num_transfo(df[f])

features_num_transfo = [f+'_transfo' for f in features_num]

# plot features distribution by target level
for f in features_num_transfo:  # use transformed features for plot
    plt.figure(figsize=(10, 6))
    sns.violinplot(x=f, y='Action', data=df)
    my_title = 'Distribution by Action for ' + f
    plt.title(my_title)
    plt.grid()

3.3.2 target vs Categorical Features Heatmap for top 20 levels only

In [None]:
# visualize crosstable target vs feature (using top 10 levels only)
for f in features_cat:
    top10_levels = df[f].value_counts()[0:10].index.to_list()
    df_temp = df[df[f].isin(top10_levels)]
    ctab = pd.crosstab(df_temp.Action, df_temp[f])
    print('Feature:' + f + ' - Top 10 levels only')
    plt.figure(figsize=(12, 5))
    sns.heatmap(ctab, annot=True, fmt='d',
                cmap='Blues',
                linecolor='black',
                linewidths=0.1)
    plt.show()


3.4 Source/Destination plots split by target

3.4.1  Source Port/Destination Port plots split by target

In [None]:
# source/destination plot by Action
xx = 'Source Port'
yy = 'Destination Port'

fig, axs = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(10, 10))

df_temp = df[df.Action == 'allow']
axs[0, 0].scatter(df_temp[xx], df_temp[yy], alpha=0.05)
axs[0, 0].set_title('Action = allow')
axs[0, 0].set_xlabel(xx)
axs[0, 0].set_ylabel(yy)
axs[0, 0].grid()

df_temp = df[df.Action == 'deny']
axs[0, 1].scatter(df_temp[xx], df_temp[yy], alpha=0.05)
axs[0, 1].set_title('Action = deny')
axs[0, 1].set_xlabel(xx)
axs[0, 1].set_ylabel(yy)
axs[0, 1].grid()

df_temp = df[df.Action == 'drop']
axs[1, 0].scatter(df_temp[xx], df_temp[yy], alpha=0.5)
axs[1, 0].set_title('Action = drop')
axs[1, 0].set_xlabel(xx)
axs[1, 0].set_ylabel(yy)
axs[1, 0].grid()

df_temp = df[df.Action == 'reset-both']
axs[1, 1].scatter(df_temp[xx], df_temp[yy], alpha=0.5)
axs[1, 1].set_title('Action = reset-both')
axs[1, 1].set_xlabel(xx)
axs[1, 1].set_ylabel(yy)
axs[1, 1].grid()

plt.show()


3.4.1  NAT Source Port/NAT Destination Port plots split by target

In [None]:
# source/destination plot by Action - NAT (Network Address Translation) version
xx = 'NAT Source Port'
yy = 'NAT Destination Port'

fig, axs = plt.subplots(2, 2, sharex=True, sharey=True, figsize=(10, 10))

df_temp = df[df.Action == 'allow']
axs[0, 0].scatter(df_temp[xx], df_temp[yy], alpha=0.05)
axs[0, 0].set_title('Action = allow')
axs[0, 0].set_xlabel(xx)
axs[0, 0].set_ylabel(yy)
axs[0, 0].grid()

df_temp = df[df.Action == 'deny']
axs[0, 1].scatter(df_temp[xx], df_temp[yy], alpha=0.5)
axs[0, 1].set_title('Action = deny')
axs[0, 1].set_xlabel(xx)
axs[0, 1].set_ylabel(yy)
axs[0, 1].grid()

df_temp = df[df.Action == 'drop']
axs[1, 0].scatter(df_temp[xx], df_temp[yy], alpha=0.5)
axs[1, 0].set_title('Action = drop')
axs[1, 0].set_xlabel(xx)
axs[1, 0].set_ylabel(yy)
axs[1, 0].grid()

df_temp = df[df.Action == 'reset-both']
axs[1, 1].scatter(df_temp[xx], df_temp[yy], alpha=0.5)
axs[1, 1].set_title('Action = reset-both')
axs[1, 1].set_xlabel(xx)
axs[1, 1].set_ylabel(yy)
axs[1, 1].grid()

plt.show()


4. Fiting model

4.1 Train Test Split

In [None]:
X = df.drop('Action', axis=1)
y = df.Action

X1=X
y1 = y

In [None]:
X.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0, stratify=y)


X_train1 = X_train
X_test1 = X_test
y_train1 = y_train
y_test1 = y_test


In [None]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))


In [None]:
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))


4.2 Models

4.2.1 k-nearest neighbors

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
# predicting on test data
y_pred_knn = knn.predict(X_test)
# predicting on training data
X_pred_knn = knn.predict(X_train)


Comparison of training and test accuracy as a function of n_neighbors

In [None]:
training_accuracy = []
test_accuracy = []
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, y_train)
    training_accuracy.append(knn.score(X_train, y_train))
    test_accuracy.append(knn.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()

Determining the best K value

In [None]:
from sklearn.model_selection import GridSearchCV
params = {'n_neighbors':[2,3,4,5,6,7,8,9, 10]}
knn = KNeighborsClassifier()
model = GridSearchCV(knn, params, cv=5)
model.fit(X_train, y_train)
model.best_params_


Using K value of 3 

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
# predicting on test data
y_pred_knn = knn.predict(X_test)
# predicting on training data
X_pred_knn = knn.predict(X_train)


In [None]:
print('---------------------------------------------------------')
print('****************** KNN Classification ******************')
print('Classes: ', knn.classes_)
print('Effective Metric: ', knn.effective_metric_)
print('Effective Metric Params: ', knn.effective_metric_params_)
print('No. of Samples Fit: ', knn.n_samples_fit_)
# print('Outputs 2D: ', clf.outputs_2d_)
# print('--------------------------------------------------------')
print("")

print('*************** Evaluation on Test Data ***************')
scoreC_te = knn.score(X_test, y_test)
print('Accuracy Score: ', scoreC_te)
# Look at classification report to evaluate the model
print(classification_report(y_test, y_pred_knn))
# print('--------------------------------------------------------')
print("")

print('*************** Evaluation on Training Data ***************')
scoreC_tr = knn.score(X_train, y_train)
print('Accuracy Score: ', scoreC_tr)
# Look at classification report to evaluate the model
print(classification_report(y_train, X_pred_knn))
print('---------------------------------------------------------')




Confusion matrix

In [None]:
skplot.metrics.plot_confusion_matrix(y_test,y_pred_knn)

4.2.2 Decision Tree

encoding the categorical value Action

In [None]:
from sklearn import preprocessing

cat =  ['Action']
le = preprocessing.LabelEncoder()
for i in cat:
    df[i] = le.fit_transform(df[i])

df.head()


Test split

In [None]:
X = df.drop('Action', axis=1)
y = df.Action
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0, stratify=y)


In [None]:
### DECISION TREE CLASSIFIER
from sklearn.tree import DecisionTreeClassifier
dtree_p = DecisionTreeClassifier()
dtree_p.fit(X_train, y_train)
y_pred_dt = dtree_p.predict(X_test)
print(classification_report(y_test,y_pred_dt))

skplot.metrics.plot_confusion_matrix(y_test,y_pred_dt)

tree before prunning

In [None]:
from sklearn.tree import export_graphviz
from six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(dtree_p, out_file=dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(
    '/Users/Felix Delali Adigbli/OneDrive - Northeastern University/Spring 2023/Info6105/assignment2/tree.png')
Image(graph.create_png())


Fitting after pruning

In [None]:
dtree_p = DecisionTreeClassifier(criterion = "gini", splitter = 'random', max_leaf_nodes = 10, min_samples_leaf = 5, max_depth= 5)
dtree_p.fit(X_train,y_train)

In [None]:
dtree_p.fit(X_train, y_train)
y_pred_dt = dtree_p.predict(X_test)
print(classification_report(y_test,y_pred_dt))

skplot.metrics.plot_confusion_matrix(y_test,y_pred_dt)
print("Accuracy on training set after parameter tunning: {:.2f}".format(
    dtree_p.score(X_train, y_train)))
print("Accuracy on test set after parameter tunning: {:.2f}".format(
    dtree_p.score(X_test, y_test)))


the treeafter model fitting


In [None]:
dot_data = StringIO()
export_graphviz(dtree_p, out_file=dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png(
    '/Users/Felix Delali Adigbli/OneDrive - Northeastern University/Spring 2023/Info6105/assignment2/treeafter1.png')
Image(graph.create_png())

In [None]:
X = X1
y = y1

X_train = X_train1
X_test = X_test1
y_train = y_train1
y_test = y_test1


4.2.3 Boosting

I will use Gradient boosted regression trees (gradient boosting machines) for the boosting of decision tree.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbrt = GradientBoostingClassifier(random_state=0)
gbrt.fit(X_train, y_train)


y_pred_dt = gbrt.predict(X_test)
print(classification_report(y_test, y_pred_dt))

skplot.metrics.plot_confusion_matrix(y_test, y_pred_dt)


SVM

In [None]:
from sklearn.svm import SVC # "Support vector classifier"

SVM_model = SVC(kernel='linear', C=1.0)
SVM_model.fit(X_train, y_train)
y_pred_svm = SVM_model.predict(X_test)
print(classification_report(y_test, y_pred_svm))

skplot.metrics.plot_confusion_matrix(y_test, y_pred_svm)


In [None]:
SVM_model = SVC(kernel="rbf", gamma=0.7, C=1.0)
SVM_model.fit(X_train, y_train)
y_pred_svm = SVM_model.predict(X_test)
print(classification_report(y_test, y_pred_svm))

skplot.metrics.plot_confusion_matrix(y_test, y_pred_svm)


Neural Network

using Multilayer perceptrons (MLPs) are also known as (vanilla) feed-forward neural networks


In [None]:
from sklearn.neural_network import MLPClassifier
mlp_model = MLPClassifier(random_state=42)
mlp_model.fit(X_train, y_train)
y_pred_mlp = mlp_model.predict(X_test)
print(classification_report(y_test, y_pred_mlp))

skplot.metrics.plot_confusion_matrix(y_test, y_pred_mlp)

print("Accuracy on training set: {:.2f}".format(
    mlp_model.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp_model.score(X_test, y_test)))


In [None]:
mlp_model = MLPClassifier(max_iter=1000, alpha=1, random_state=0)
mlp_model.fit(X_train, y_train)
y_pred_mlp = mlp_model.predict(X_test)
print(classification_report(y_test, y_pred_mlp))

skplot.metrics.plot_confusion_matrix(y_test, y_pred_mlp)

print("Accuracy on training set: {:.2f}".format(
    mlp_model.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp_model.score(X_test, y_test)))


In [None]:

mlp_model = MLPClassifier(solver='lbfgs', random_state=0, max_iter=1000,
                          hidden_layer_sizes=[100, 100])
mlp_model.fit(X_train, y_train)
y_pred_mlp = mlp_model.predict(X_test)
print(classification_report(y_test, y_pred_mlp))

skplot.metrics.plot_confusion_matrix(y_test, y_pred_mlp)
print("Accuracy on training set: {:.2f}".format(
    mlp_model.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(mlp_model.score(X_test, y_test)))
