---

# AERO 5 - Hands on Machine Learning for cybersecurity (2023/2024)


# 3 – Malicious URLs detection

---

This work is done by (write the members of the group below):

  

In this lab session we will discuss how the Machine Learning is used for the malicious URLs detection. This will first involve in cleaning of our data within the datasets. We will use pandas and define our own vectorizer to clear the datasets. More on this later, we will use Logistic Regression then the Support Vector Machine to train our model! The `scikit-learn` documentation is complete and should be consulted whenever necessary. In particular you can consult :

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

https://scikit-learn.org/stable/modules/svm.html

## 1. Exercise 1

Logistic regression is a binary classification technique. A key difference from linear regression is that the output value being modeled is a binary values (0 or 1) rather than a numeric value. In this exercise, we will apply a logistic regression model to a preprocessed dataset resulting from phishing websites `phishing_dataset.csv`

1. Start by importing the necessary packages that allows you to create matrices, perform mathematical operations and create graphs to easily observe our dataset as well as the model built from it. Then load the data.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("phishing_dataset.csv", header=None)
df
# =======================================================

2. Could you propose an implementation of the logistic regression method without using the `scikit-learn` package?

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================

# Auteur : Maxime Gosselin
# Titre  : Régression Logistique
# Date   : Novembre 2023
# Lien   : https://github.com/bixente-r/ML_project


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize,fmin_tnc
import seaborn as sn
from scipy.interpolate import BSpline
from scipy.signal import savgol_filter
import statistics as st
from time import sleep
from tqdm import tqdm



def convert_X(X):
    for i in X[0:0]:
        dx = {}
        k = 0
        if isinstance((X[i][0]), str) == True:
            for j in range(X.shape[0]):
                if X[i][j] not in dx.keys():
                    dx[X[i][j]] = k
                    k += 1
            for l in range(X.shape[0]):
                for j in dx.keys():
                    if X[i][l] == j:
                        X[i][l] = dx[j]


def convert_y(y):
    dy = {}
    k = 0

    if isinstance((y[0]), str) == True:
        for i in range(0,y.shape[0]):
            if y[i] not in dy.keys():
                dy[y[i]] = k
                k += 1

        for i in range(0,y.shape[0]):
            for j in dy.keys():
                if y[i] == j:
                    y[i] = dy[j]

def init_frame(path, id_col=0):

    df = pd.read_csv(path, header=None) # import the data
    df.head()
    df = df.dropna().reset_index(drop=True) # remove rows with a Nan value from dataset and reset the index
    
    target_class = df.columns[-1]

    nb_row = df.shape[0]
    print(nb_row)
    
    # print(df)
    #df_0 = df.loc[df[target_class] == -1].sample(n=nb_row,random_state=42)
    #df_1 = df.loc[df[target_class] == 1]
    #df = pd.concat([df_0, df_1])
    
    df = df.sample(frac=1).reset_index(drop=True) # shuffle the rows 

    X = df.iloc[:,id_col:-1] # Separate input from output 
    y = df.iloc[:,-1]  # Check rows and columns from .csv
                        # parameters might change


    corr_matrix = df.corr()

    convert_X(X)
    convert_y(y)
    # print(X)
    # print(y)
    X = np.c_[np.ones((X.shape[0], 1)), X] # add a bias column
    X = np.array([[float(i) for i in e] for e in X]) # convert to float (in case it's not)
    row_nb = X.shape[0]
    traning_nb = round(0.8*row_nb)
    X1 = X[:traning_nb] # training set
    X2 = X[traning_nb:] # validation set


    y = y.to_numpy() # convert to numpy type
    y = y.reshape(len(y),1) # convert to matrix
    y = np.array([[float(i) for i in e] for e in y]) # convert to float (in case it's not)
    y1 = y[:traning_nb] # training output
    y2 = y[traning_nb:] # validation output
    return X1,y1, X2,y2

def sigmoid(x, theta):
    z = np.dot(x, theta)
    return 1/(1+np.exp(-z))


def hypothesis(theta, x):
    return sigmoid(x, theta)


def cost_function(theta, x, y, n):
    h = hypothesis(theta, x)
    return -(1/n)*np.sum(y*np.log(h) + (1-y)*np.log(1-h))


def gradient(theta, X, y, n):
    h = hypothesis(theta, X)
    error = h - y
    return (1/n) * np.dot(X.T, error)

def predict(h):
    
    h1 = []
    for i in h:
        if i>=0.5:
            h1.append(1)
        else:
            h1.append(0)
    return h1

def accuracy(TP, TN, FP, FN):
    return round(100 * (TP + TN) / (TP + TN + FP + FN),4)

def precision_1(TP,FP):
    s = 0
    try:
        s = round(100 * TP / (TP + FP),4)
    except:
        ZeroDivisionError()
    return s

def precision_0(TN,FN):
    return round(100 * TN / (TN + FN),4)

def recall(TP,FN):
    return round(100 * TP / (TP + FN),4)

def specificity(TN,FP):
    return round(100 * TN / (TN + FP),4)

def f1_score(precision,recall):
    s = 0
    try:
        s = round((1/100) * 2 * (precision * recall) / (precision + recall),4)
    except:
        ZeroDivisionError()
    return s

def confusion_matrix(TP,FP,TN,FN):

    col = ["Positive", "Negative"]
    ind = ["Positive", "Negative"]

    matrix=np.array([[TP,FN],[FP,TN]])
    df = pd.DataFrame(matrix,columns=col,index=ind)
    fig = sn.heatmap(df,annot=True,cbar=True,fmt='g',cmap="flare")
    plt.xlabel("Predicted Class")
    plt.ylabel("Actual Class")
    plt.title("Confusion Matrix")
    plt.show()


def train(Xt, yt, Xv, yv, nt, nv, epoch, alpha, graph=True, disp=True):
    """
    Function that train the logistic regression model

    PARAMETERS : 

        - Xt : Training set            ¤     - Xv : Validation set
        - yt : Training set targets    ¤     - yv : Validation set targets
        - nt : number of training set  ¤     - nv : number of validation set
        - epoch                        ¤     - alpha : param of gradient descent
        - graph : display the graphs
    
    OUTPUT : 

        theta : vector of the optimal weights

    PROCESS : 

        for each epoch :
            we compute the new weigths (theta) from the training set 
            we compute the loss for the training set
            we compute the accuracy for the training set 
            we use the weights from training set to compute loss and accuracy of the validation set
        
        we return the final weights for the prediction
    """

    ###########################################################
    #####                SETTING VARIABLES                #####
    ###########################################################
    training_err = []
    validation_err = []

    training_acc = []
    training_pre_0 = []
    training_pre_1 = []
    training_rec = []
    training_spe = []
    training_f1 = []

    validation_acc = []
    validation_pre_0 = []
    validation_pre_1 = []
    validation_rec = []
    validation_spe = []
    validation_f1 = []

    last_t_loss = None
    theta = np.zeros((Xt.shape[1], 1))
  

    ###########################################################
    #####                 BEGIN TRAINING                  #####
    ###########################################################
    for e in tqdm(range(1, epoch + 1)):

        grad = gradient(theta,Xt,yt, nt)   # gradient computation
        theta = theta - alpha * grad * cost_function(theta, Xt, yt, nt)   # updating weights
        
        loss_t = cost_function(theta, Xt, yt, nt)  # compute loss with the current weight (training set)
        training_err.append(loss_t)

        out_t = hypothesis(theta, Xt)    # sigmoid application with the current weight (training set)
        yt_pred = predict(out_t)          # prediction h > 0.5 or h < 0.5
        
        TP, TN, FN, FP = 0,0,0,0

        for i in range(nt):
            
            if yt_pred[i] == yt[i] and yt[i] == 1:
                TP += 1
            if yt_pred[i] != yt[i] and yt_pred[i] == 1:
                FP += 1
            if yt_pred[i] == yt[i] and yt[i] == 0:
                TN += 1
            if yt_pred[i] != yt[i] and yt_pred[i] == 0:
                FN += 1
        

        acc_t = accuracy(TP, TN, FP, FN)   
        pre_0_t = precision_0(TN, FN)
        pre_1_t = precision_1(TP, FP)
        rec_t = recall(TP, FN)
        spe_t = specificity(TN, FP)
        f1_t = f1_score(pre_1_t, rec_t)

        training_acc.append(acc_t)
        training_pre_0.append(pre_0_t)
        training_pre_1.append(pre_1_t)
        training_rec.append(rec_t)
        training_spe.append(spe_t)
        training_f1.append(f1_t)
        
        if e == epoch:
            a,b,c,d = TP, FP, TN, FN

        loss_v = cost_function(theta, Xv, yv, nv)  # compute loss with the current weight (validation set)
        validation_err.append(loss_v)
        
        
        out_v = hypothesis(theta,Xv)  # sigmoid application with the current weight (validation set)
        yv_pred = predict(out_v)     # prediction h > 0.5 or h < 0.5
       
        TP, TN, FN, FP = 0,0,0,0

        for i in range(nv): 
            if yv_pred[i] == yv[i] and yv[i] == 1:
                TP += 1
            if yv_pred[i] != yv[i] and yv_pred[i] == 1:
                FP += 1
            if yv_pred[i] == yv[i] and yv[i] == 0:
                TN += 1
            if yv_pred[i] != yv[i] and yv_pred[i] == 0:
                FN += 1
        
        acc_v = accuracy(TP, TN, FP, FN)   
        pre_0_v = precision_0(TN, FN)
        pre_1_v = precision_1(TP, FP)
        rec_v = recall(TP, FN)
        spe_v = specificity(TN, FP)
        f1_v = f1_score(pre_1_v, rec_v)

        validation_acc.append(acc_v)
        validation_pre_0.append(pre_0_v)
        validation_pre_1.append(pre_1_v)
        validation_rec.append(rec_v)
        validation_spe.append(spe_v)
        validation_f1.append(f1_v)        
        

        if (e % (epoch / 10) == 0 or e == epoch) and disp == True:
            print(f'\n¤¤¤¤¤¤¤¤¤¤¤¤ EPOCH {e} ¤¤¤¤¤¤¤¤¤¤¤¤')
            if last_t_loss and last_t_loss < loss_t:
                print('   >>>>> LOSS INCREASING <<<<<   ')
            else:
                print(f' - Training loss : {round(loss_t,4)}')
                print(f' - Validation loss : {round(loss_v,4)}')
                print(f' - Loss difference : {round(((loss_t - loss_v)/loss_t),4)} % ')
                print(f' - Training accuracy : {training_acc[-1]} %')
                print(f' - Validation accuracy : {validation_acc[-1]} %')
                print(f' - Training 1 precision : {training_pre_1[-1]} %')
                print(f' - Validation 1 precision : {validation_pre_1[-1]} %')
                print(f' - Training 0 precision : {training_pre_0[-1]} %')
                print(f' - Validation 0 precision : {validation_pre_0[-1]} %')
                print(f' - Training recall : {training_rec[-1]} %')
                print(f' - Validation recall : {validation_rec[-1]} %')
                print(f' - Training specificity : {training_spe[-1]} %')
                print(f' - Validation specificity : {validation_spe[-1]} %')
                print(f' - Training f1-score : {training_f1[-1]}')
                print(f' - Validation f1-score : {validation_f1[-1]}')
                print(f'¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤')
            
            last_t_loss = loss_t
        
    confusion_matrix(a, b, c, d)


    if disp == True:
        print(f'\n¤¤¤¤¤¤¤¤¤¤¤¤  OPTIMAL THETA ¤¤¤¤¤¤¤¤¤¤¤¤\n')
        for i in range(theta.shape[0]):
            print(f'w{i} = {theta[i][0]}')




    if graph == True:
        ###########################################################
        #####                    PLOT LOSS                    #####
        ###########################################################
        plt.figure()
        plt.title('Loss')
        plt.plot(training_err, 'dodgerblue', label='Training loss')
        plt.plot(validation_err, 'darkorange', label='Validation loss')
        plt.xlabel('epoch')
        plt.ylabel('loss')
        plt.legend()
        plt.show()

        ###########################################################
        #####                  PLOT ACCURACY                  #####
        ###########################################################
        plt.figure()
        plt.title('Accuracy')
        plt.plot(training_acc, 'dodgerblue', label='Training accuracy')
        plt.plot(validation_acc, 'darkorange', label='Validation accuracy')
        plt.xlabel('epoch')
        plt.ylabel('accuracy')
        plt.legend()

        plt.figure()
        plt.title('f1-score')
        plt.plot(training_f1, 'dodgerblue', label='Training f1-score')
        plt.plot(validation_f1, 'darkorange', label='Validation f1-score')
        plt.xlabel('epoch')
        plt.ylabel('f1-score')
        plt.legend()

        plt.figure()
        plt.title('Recall')
        plt.plot(training_rec, 'dodgerblue', label='Training recall')
        plt.plot(validation_rec, 'darkorange', label='Validation recall')
        plt.xlabel('epoch')
        plt.ylabel('recall')
        plt.legend()

        plt.show()


    return theta, training_err, validation_err

# =======================================================

3. Load your dataset, analyse it then put the attributes and the targets in variables "samples" and "targets" respectively.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================
df = pd.read_csv('phishing_dataset.csv')

# Display basic information about the dataset
print(df.info())

# Display the first few rows of the dataset
print(df.head())

# Extract features (attributes) - excluding the last column
samples = df.iloc[:, :-1]

# Extract the target variable - the last column
target_column_name = df.columns[-1]
targets = df[target_column_name]


# =======================================================

4. Split data into training and testing sets.

In [None]:

# Vérifier si la série contient des zéros
if 0 in targets.values:
    print("Il y a des zéros dans la série.")
else:
    print("Il n'y a pas de zéros dans la série.")

# Si la série contient uniquement des 1 et -1, imprimer 1 et -1
if set(targets.unique()) == {1, -1}:
    print("La série contient uniquement des 1 et -1.")


# Remplacer tous les -1 par 0
targets.replace(-1, 0, inplace=True)


plt.hist(targets);

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================

training_set, training_output, validation_set, validation_output = init_frame("./phishing_dataset.csv") 

training_output[training_output == -1] = 0
validation_output[validation_output == -1] = 0

# =======================================================

5. Apply now the logistic regression to train and test the model then return the mean accuracy on the given test data by computing fist the model score then the error matrix: the confusion matrix. Comment on the obtained resut.

In [None]:
# EDIT THIS CELL
# ====================== Your code here =================
n_training = training_set.shape[0]
n_validation = validation_set.shape[0]

alpha = 0.001
epoch = 5000
w = train(training_set,training_output,validation_set, validation_output, n_training, n_validation, epoch, alpha,True,True)
# =======================================================

In [None]:
n_training = training_set.shape[0]
n_validation = validation_set.shape[0]

alpha = 0.01
epoch = 2000
w = train(training_set,training_output,validation_set, validation_output, n_training, n_validation, epoch, alpha,True,True)

Overall, the logistic regression has good results but we could try to test different model because the loss remain a bit high for a binary classification. However more iteration in the training would lead to overfitting as we notice the two loss diverging at the end of the iteration (see loss graph). 

Moreover, it is important to have the less False Positive and False Negative predictions. It could be annoying to have false alert or dangerous to miss alerts.

## 2. Exercise 2

## Logistic regression

I. Data collection:

   The first task is gathering data. We can find some websites offering malicious links while browsing. The second task is finding out clear URLs. This time, we use a data set that is already available, and that doesn’t need to be crawled. Hence, we gathered around 500,000 URLs out of which around 90,000 were malicious and others were legitimate/clean.

II. Preparing data:

   Since the URLs are different from our normal text documents, they need to undergo some amount of cleasing before we use them. Therefore, we tokenize them by removing slash, dots and coms.

1. Import `numpy`, `pandas` and `random` librairies. Then Write a sanitization function to get the relevant data from raw URLs.

In [None]:
#EDIT THIS CELL
def url_cleanse(web_url)
# ====================== Your code here =================



# =======================================================

This will give us the desired url data-set values to train the model and test it. The dataset will have two column structures; one for urls and one for labels.

We read the the datasets into dataframes and matrix which can be understood by the vectorizer.


In [None]:
url_csv = pd.read_csv('data_url.csv',',',error_bad_lines=False)
url_df = pd.DataFrame(url_csv)              # to convert into data frames                                                                                  
url_df = np.array(url_df)                   # to convert into array   
random.shuffle(url_df)
y = [d[1] for d in url_df]                  # all labels
urls = [d[0] for d in url_df]               # all urls corresponding to a label {G/B}

Hence, data can be understood by the vectorizer we prepared and later pass onto the term-frequency and inverse document frequency text extraction approach.

2. Use `TF-IDF` Machine Learning text feature extraction approach from the `scikit-learn` python module in order to pass the data to a custom vectorizer function. For more understanding consult 
https://scikit-learn.org/stable/modules/feature_extraction.html

In [None]:
#EDIT THIS CELL
# ====================== Your code here =================



# =======================================================

3. Split data into training and testing sets then use the logistic regression from `scikit-learn` python module to train and test the model.

In [None]:
#EDIT THIS CELL
# ====================== Your code here =================



# =======================================================

4. Give the mean accuracy on the given test data by computing the score. Comment on the result.

In [None]:
#EDIT THIS CELL
# ====================== Your code here =================



# =======================================================

5. Use the model to test on URLs ['hackthebox.eu,'google.com/search=VAD3R','wikipedia.co.uk'].
                               What do you remark? 

In [None]:
#EDIT THIS CELL
# ====================== Your code here =================



# =======================================================

## Support Vector Machine

Now, we will use the SVM approach to detect the malicious URLs.

1. Import the required librairies available in `scikit-learn`.

In [None]:
#EDIT THIS CELL
# ====================== Your code here =================



# =======================================================

2. Use the SVM classifier from `scikit-learn` python module to train and test the model.

In [None]:
#EDIT THIS CELL
# ====================== Your code here =================



# =======================================================

3. Once the model is trained with the SVM classifier, load the model and the vector to predict the URL : ['google.com/search=VAD3R']. Comment on the result.

In [None]:
#EDIT THIS CELL
# ====================== Your code here =================



# =======================================================

## 3. Exercise 3

Choose one of the previous datasets and apply a classification method of your choice to predict malicious URLs. Describe the different steps and the obtained results.

In [None]:
#EDIT THIS CELL
# ====================== Your code here =================



# =======================================================