# Author: Babatunde John Olanipekun.  
## About the dataset and approach.  
- This is a classification task.
- This is an internet traffic dataset for a website.  
- It contains the attributes of device entities that were screened ('allow', 'deny', 'drop') for access.

### The approach will be to utilize numpy, base python and sklearn to preprocess and fit the dataset.

ip address

In [64]:
import os

In [65]:
#Know where we are
os.getcwd()

'c:\\Users\\olani\\OneDrive\\Documents\\Data Science\\SMU-Data Science\\Machine Learning 2'

In [66]:
import pandas as pd
import numpy as np
import copy


In [67]:
#We read in the .csv dataset
import csv
rows = []
csv.field_size_limit(100)
with open("log2.csv", 'r') as file:
    csvreader = csv.reader(file)
    header = next(csvreader)
    for row in csvreader:
        rows.append(row)
print(header)
print(rows[:3])

['Source Port', 'Destination Port', 'NAT Source Port', 'NAT Destination Port', 'Action', 'Bytes', 'Bytes Sent', 'Bytes Received', 'Packets', 'Elapsed Time (sec)', 'pkts_sent', 'pkts_received']
[['57222', '53', '54587', '53', 'allow', '177', '94', '83', '2', '30', '1', '1'], ['56258', '3389', '56258', '3389', 'allow', '4768', '1600', '3168', '19', '17', '10', '9'], ['6881', '50321', '43265', '50321', 'allow', '238', '118', '120', '2', '1199', '1', '1']]


In [68]:
#work in numpy now. Change the list of lists to numpy arrays and stack them
#For better information we take a look at the columns and the values.
ip_df = np.vstack((header, rows))
ip_df

array([['Source Port', 'Destination Port', 'NAT Source Port', ...,
        'Elapsed Time (sec)', 'pkts_sent', 'pkts_received'],
       ['57222', '53', '54587', ..., '30', '1', '1'],
       ['56258', '3389', '56258', ..., '17', '10', '9'],
       ...,
       ['54871', '445', '0', ..., '0', '1', '0'],
       ['54870', '445', '0', ..., '0', '1', '0'],
       ['54867', '445', '0', ..., '0', '1', '0']], dtype='<U20')

In [69]:
import copy
# good practice to have a checkpoint. 
ip_df_ = copy.deepcopy(ip_df)

In [70]:

values, counts = np.unique(ip_df_[1:,4], return_counts=True)
cat_count=dict(zip(values, counts))
cat_count, sum(list(cat_count.values()))

({'allow': 37640, 'deny': 14987, 'drop': 12851, 'reset-both': 54}, 65532)

In [71]:
# We get the actual datasets by excluding the column labels.
ip_df_values = ip_df_[1:]

### Minimally represented class
- There is only 54 instances for class 'reset-both' so we can drop it as it will likely not add much to the decision of classifiers. Remove rows having 'reset-both' from the NumPy arrays.  


In [72]:
ip_df_values_new = np.delete(ip_df_values, np.where((ip_df_values == 'reset-both'))[0], axis=0)
print(ip_df_values_new)

[['57222' '53' '54587' ... '30' '1' '1']
 ['56258' '3389' '56258' ... '17' '10' '9']
 ['6881' '50321' '43265' ... '1199' '1' '1']
 ...
 ['54871' '445' '0' ... '0' '1' '0']
 ['54870' '445' '0' ... '0' '1' '0']
 ['54867' '445' '0' ... '0' '1' '0']]


# Sanity check.
- remember we have 54 values for 'reset-both' that we want to remove due to its sparsity in the entire dataset.  
- So I will check difference in the number of rows between before and after deletion to ensure accuracy.

In [73]:
ip_df_values_new.shape, ip_df_values.shape, ip_df_values.shape[0]-ip_df_values_new.shape[0]

((65478, 12), (65532, 12), 54)

In [74]:
columns = ip_df_[0]
columns

array(['Source Port', 'Destination Port', 'NAT Source Port',
       'NAT Destination Port', 'Action', 'Bytes', 'Bytes Sent',
       'Bytes Received', 'Packets', 'Elapsed Time (sec)', 'pkts_sent',
       'pkts_received'], dtype='<U20')

## Train-test split.  
- Here we split the dataset.   
- We will stratify our dataset based on the proportion of the classes.  

In [75]:
## Train test split.

from sklearn.model_selection import train_test_split
train, test, y_train, y_test = train_test_split(ip_df_values_new, ip_df_values_new[:,4],
 test_size=0.3,
  random_state=42, stratify = ip_df_values_new[:,4])

In [76]:
print(train.shape, test.shape, y_train.shape, y_test.shape)

(45834, 12) (19644, 12) (45834,) (19644,)


In [77]:
## See a snapshot of the dataset.  
print(train[:2])

[['58399' '56205' '0' '0' 'deny' '66' '66' '0' '1' '0' '1' '0']
 ['39735' '53' '51086' '53' 'allow' '223' '78' '145' '2' '30' '1' '1']]


## Assess imbalance in our dataset.  
Note that the proportions are the same in all the subsets of the dataset.  

In [78]:
values, counts = np.unique(y_train, return_counts=True)
print(values, counts)

['allow' 'deny' 'drop'] [26348 10491  8995]


## There is significant imbalance.  
The 'allow' class has the highest porportion but 'deny' and 'drop' classes have similar but much lower proportions as we can see below.

In [79]:
count_series = np.array(counts)/len(y_train)
count_series

array([0.57485709, 0.22889122, 0.19625169])

In [80]:
#remove target column from train and test sets.
train = np.delete(arr=train, obj=4, axis=1)
test = np.delete(arr=test, obj=4, axis=1)

In [81]:
print(train.shape, test.shape, y_train.shape, y_test.shape)

(45834, 11) (19644, 11) (45834,) (19644,)


In [82]:
print(train[:2])

[['58399' '56205' '0' '0' '66' '66' '0' '1' '0' '1' '0']
 ['39735' '53' '51086' '53' '223' '78' '145' '2' '30' '1' '1']]


# Feature engineering.  
- One hot encoding was performed. BUT:  
    - During predict, there was an error because the number of categorical levels in the test was not equal to that of the train set.
- So I reverted to label encoder that does not create sparse matrix.

In [83]:
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

## The numerical pipeline involves scaling all the numeric variables.  
- The categorical atributes (i.e those that end in * port) will be encoded using catboost encoding.  
    - Catboost encoding encodes in place and does not create separate column for categorical attributes.
- Categorical attribute (target) will be encoded using label encoding.  


In [84]:
#Standard default scaler for numeric variables.
numeric_pipeline = StandardScaler() 
# One-Hot Encoder for low cardinal variables
oh_pipeline = OneHotEncoder(handle_unknown='ignore')


In [85]:
print('column names are: \n', columns)

column names are: 
 ['Source Port' 'Destination Port' 'NAT Source Port' 'NAT Destination Port'
 'Action' 'Bytes' 'Bytes Sent' 'Bytes Received' 'Packets'
 'Elapsed Time (sec)' 'pkts_sent' 'pkts_received']


In [86]:
fullpipeline_simple = ColumnTransformer(transformers=\
                                       [('numeric_pipeline', numeric_pipeline, train[:,4:]), #if it contains '*port' then it s categorical
                                        ('oh_pipeline', oh_pipeline, train[:,0:4])
                                         ],                                       
                                       remainder='drop')

In [87]:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [88]:
num_pipeline = make_column_transformer(
    (numeric_pipeline, train[:,4:])
    )




In [89]:
cat_pipeline = make_column_transformer(
    (oh_pipeline, train[:,0:4])
    )

**We fit the training set's target variable to the on labelencoder() and use that to transform the traininig set and the test set.**    

In [90]:
# Encode for string labels
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder().fit(y_train)
y_train_ = label_encoder.transform(y_train)
y_test_ =label_encoder.transform(y_test)

In [91]:
label_encoder

LabelEncoder()

In [92]:
#import category_encoders as ce


## This displays the respective numerical coes assigned to each categorical level

In [93]:
label_encoder.classes_, label_encoder.inverse_transform(y_train_)

(array(['allow', 'deny', 'drop'], dtype='<U20'),
 array(['deny', 'allow', 'allow', ..., 'allow', 'allow', 'deny'],
       dtype='<U20'))

In [94]:
cat_labels=dict(zip(y_train_, label_encoder.inverse_transform(y_train_)))
cat_labels

{1: 'deny', 0: 'allow', 2: 'drop'}

In [95]:
import category_encoders as ce
ohe = OneHotEncoder()
cat_enc = ce.CatBoostEncoder()
#array_hot_encoded = ohe.fit_transform(train[:,0:4])
array_scaler = StandardScaler().fit_transform(train[:,4:])
array_hot_encoded = cat_enc.fit_transform(train[:,0:4], y_train_
)
print(array_hot_encoded.shape, array_scaler.shape)

(45834, 4) (45834, 7)


In [96]:
type(array_hot_encoded), type(array_scaler)

(pandas.core.frame.DataFrame, numpy.ndarray)

In [97]:
from scipy.sparse import hstack
x_train=np.hstack((array_hot_encoded,array_scaler))
x_train.shape

(45834, 11)

## Sanity check

In [98]:
print('final df col shape: {} must equal to sum of before {}'.format(x_train.shape[1], array_hot_encoded.shape[1]+array_scaler.shape[1])) 

final df col shape: 11 must equal to sum of before 11


## Column transformation for test set

In [99]:
ohe = OneHotEncoder()
#array_hot_encoded_test = ohe.fit_transform(test[:,0:4])
array_hot_encoded_test = cat_enc.transform(test[:,0:4])
array_scaler_test = StandardScaler().fit_transform(test[:,4:])
print(array_hot_encoded_test.shape, array_scaler_test.shape)

(19644, 4) (19644, 7)


In [100]:
#from scipy.sparse import hstack
x_test=np.hstack((array_hot_encoded_test,array_scaler_test))
x_test.shape

(19644, 11)

In [101]:
print('final df col shape: {} must equal to sum of before {}'.format(x_test.shape[1], array_hot_encoded_test.shape[1]+array_scaler_test.shape[1])) 

final df col shape: 11 must equal to sum of before 11


# Classifier.  
We have been advsed that linear Support Vector Machine, SVM can be trained using the Stochastic Gradient Descent classifier, SGD, if we set the 'loss' paramter to 'hinge'.  
- And that SGD is much faster than SVM.  
- Therefore we will adopt SGD as our classifier but also comapre speed of SVM classifier.  

In [102]:
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

In [117]:
%%time
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import make_scorer
import sklearn.metrics


Wall time: 1e+03 µs


In [104]:
#clf = a_clf.set_params(**params)

## Grid search.  
- We will like to perform a parameter search for the best combination.  
- SVM class algorithms can be quite slow. 
- We will add the SVM as part of the estimators to use by SGD and see what the metrics suggest for us.  

In [105]:
#####Start with a gridesearch parameter space##############
params = { 'loss': ['hinge', 'log'],
    'alpha': [0.00001, 0.0001, 0.0009, 0.001],
    'max_iter': [500, 1500, 2000],
    'early_stopping': [True, False],    
    'eta0':[0.0, 0.002],
    'penalty':['l2', 'l1']   
                  
                 }

##Scoring parameters for multiclass

scoring = {'accuracy': make_scorer(sklearn.metrics.accuracy_score),
           'precision': make_scorer(sklearn.metrics.precision_score, average = 'macro'),
           'recall': make_scorer(sklearn.metrics.recall_score, average = 'macro'),
           'f1_macro': make_scorer(sklearn.metrics.f1_score, average = 'macro'),
           'f1_weighted': make_scorer(sklearn.metrics.f1_score, average = 'weighted')
           }


##We use the sci-kit learning api for the parameter search but final fit will be done with learning API
#xgb_clf = xgb.XGBClassifier(seed=123)
clf1 = GridSearchCV(estimator=clf,
                    param_grid=params,
                    scoring='f1_weighted',
                    #n_iter=25,
                    verbose=3,
                    cv=5
                    )



In [106]:
#Fit the gridsearch parameters.  
search = clf1.fit(x_train, y_train_)

Fitting 5 folds for each of 192 candidates, totalling 960 fits
[CV 1/5] END alpha=1e-05, early_stopping=True, eta0=0.0, loss=hinge, max_iter=500, penalty=l2; total time=   0.2s
[CV 2/5] END alpha=1e-05, early_stopping=True, eta0=0.0, loss=hinge, max_iter=500, penalty=l2; total time=   0.2s
[CV 3/5] END alpha=1e-05, early_stopping=True, eta0=0.0, loss=hinge, max_iter=500, penalty=l2; total time=   0.2s
[CV 4/5] END alpha=1e-05, early_stopping=True, eta0=0.0, loss=hinge, max_iter=500, penalty=l2; total time=   0.1s
[CV 5/5] END alpha=1e-05, early_stopping=True, eta0=0.0, loss=hinge, max_iter=500, penalty=l2; total time=   0.2s
[CV 1/5] END alpha=1e-05, early_stopping=True, eta0=0.0, loss=hinge, max_iter=500, penalty=l1; total time=   0.2s
[CV 2/5] END alpha=1e-05, early_stopping=True, eta0=0.0, loss=hinge, max_iter=500, penalty=l1; total time=   0.2s
[CV 3/5] END alpha=1e-05, early_stopping=True, eta0=0.0, loss=hinge, max_iter=500, penalty=l1; total time=   0.2s
[CV 4/5] END alpha=1e-05,

In [107]:
#search

## Best parameters.  
Given the 'hinge' loss here it appears that Support Vector Machine is the best estimator for thos dataset.

In [108]:
# We retrieve the best resulting parameters.  
params1=search.best_params_
params1

{'alpha': 1e-05,
 'early_stopping': False,
 'eta0': 0.002,
 'loss': 'log',
 'max_iter': 2000,
 'penalty': 'l1'}

In [109]:
import timeit
#%%time
#we use the validation to track the performance.
%timeit sgd = SGDClassifier(validation_fraction=0.15, class_weight='balanced', random_state=123)

15.6 µs ± 2.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [125]:
clf = SGDClassifier(class_weight='balanced', random_state=152)

In [127]:
from sklearn.metrics import classification_report
clf_ = clf.set_params(**params1)
clf_.fit(x_train, y_train_)

#Predict using the previusly unseen test set.
clf_pred = clf_.predict(x_test)
print(classification_report(clf_pred, y_test_))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     11262
           1       1.00      0.99      0.99      4514
           2       1.00      1.00      1.00      3868

    accuracy                           1.00     19644
   macro avg       1.00      1.00      1.00     19644
weighted avg       1.00      1.00      1.00     19644



# Compare the SGD method with SVC class method 

In [111]:
sgd_extra = SGDClassifier(loss='hinge', class_weight='balanced')
%timeit sgd_extra.fit(x_train, y_train_)

140 ms ± 9.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [112]:
from sklearn.svm import SVC, LinearSVC
clf_svm = LinearSVC(loss='hinge')
%timeit clf_svm.fit(x_train, y_train_)



1.4 s ± 386 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)




# Findings.  
- The SVC algorithm is much much slower than the SGD algorithm. (1.33seconds versus 150 milliseconds)

## Evaluation metrics.  
Given the 100% accuracy across all metrics and class, as seen in this classification report, I do not trust the result of this model. There could be some data leakage in the original dataset, but in this interest of time I will hand this work in as is.  
Especially given that the key requirement here is to implement Support Vector Machine and out-of-coe methods.  
-If we accept the result then we this model is perfect across all metrics.  
