# Intrusion Detection with Logistic Regression
In this laboratory, we will use a Logistic Regression to classy the network traffic flows as benign or malicious. The Logistic regression model returns a value between 0 and 1, which is the probability of the input flow of being malicious. We use a threshold set to 0.5 to decide whether the network flow is malicious or not.
We will train a logistic regression model on a dataset of benign traffic and DDoS attack traffic.

We will use a dataset of benign and various DDoS attacks from the CIC-DDoS2019 dataset (https://www.unb.ca/cic/datasets/ddos-2019.html).
The network traffic has been previously pre-processed in a way that packets are grouped in bi-directional traffic flows using the 5-tuple (source IP, destination IP, source Port, destination Port, protocol). Each flow is represented with 21 packet-header features computed from max 10 packets:

| Features           | Logistic Regression model           |
|---------------------|--------------------|
| timestamp (mean IAT)  <br> packet_length (mean) <br> IP_flags_df (sum) <br> IP_flags_mf (sum) <br> IP_flags_rb (sum) <br> IP_frag_off (sum) <br> protocols (mean) <br> TCP_length (mean) <br> TCP_flags_ack (sum) <br> TCP_flags_cwr (sum) <br> TCP_flags_ece (sum) <br> TCP_flags_fin (sum) <br> TCP_flags_push (sum) <br> TCP_flags_res (sum) <br> TCP_flags_reset (sum) <br> TCP_flags_syn (sum) <br> TCP_flags_urg (sum) <br> TCP_window_size (mean) <br> UDP_length (mean) <br> ICMP_type (mean) <br> Packets (counter) <br>| <img src="./logistic_regression_CIC2019.png" width="100%">  |

In [13]:
# Author: Roberto Doriguzzi-Corin
# Project: Course on Network Intrusion and Anomaly Detection with Machine Learning
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Import necessary libraries

import numpy as np
import glob
import h5py
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score
from keras.models import Sequential
from keras.layers import Dense
from tensorflow.keras.optimizers import SGD,Adam
from util_functions import *
DATASET_FOLDER = "./DOS2019"

In [14]:
# Load training, validation and test sets
feature_names = get_feature_names()
target_names = ['benign', 'dns',  'syn', 'udplag', 'webddos'] 
target_names_full = ['benign', 'dns', 'ldap', 'mssql', 'netbios', 'ntp', 'portmap', 'snmp', 'ssdp', 'syn', 'tftp', 'udp', 'udplag', 'webddos'] 
X_train, y_train = load_dataset(DATASET_FOLDER + "/*" + '-train.hdf5')
y_train = np.array([1 if y[0] == 0 else 0 for y in y_train]) # from one-hot-encoding to binary

X_val, y_val = load_dataset(DATASET_FOLDER + "/*" + '-val.hdf5')
y_val = np.array([1 if y[0] == 0 else 0 for y in y_val]) # from one-hot-encoding to binary

X_test, y_test = load_dataset(DATASET_FOLDER + "/*" + '-test.hdf5')
y_test = np.array([1 if y[0] == 0 else 0 for y in y_test]) # from one-hot-encoding to binary

In [15]:
# Logistic Regression model
def LogRegression(model_name, input_shape):
    activation_function = 'sigmoid'
    model = Sequential(name=model_name)
    model.add(Dense(1, input_shape=input_shape,activation=activation_function, name='fc1'))

    print(model.summary())
    return model

# Optimisation algorithm
Here we can compare the performance of different optimizers 

In [17]:
def compileModel(model,lr):
    #optimizer = SGD(learning_rate=lr, momentum=0.0) # the optimisation algorithm
    optimizer = Adam(learning_rate=lr)
    model.compile(loss='binary_crossentropy', optimizer=optimizer,metrics=['accuracy'])  # here we specify the loss function

# Model training

In [None]:
model = LogRegression('log_reg', X_train.shape[1:4])
compileModel(model,0.001)

# Train the model
model.fit(X_train, y_train, epochs=200, batch_size=10, validation_data=(X_val, y_val))

# Prediction on unseen data

In [None]:
y_pred = np.squeeze(model.predict(X_test, batch_size=32) > 0.5)

print("F1 Score: ", f1_score(y_test,y_pred))