<a href="https://colab.research.google.com/github/bombunx/INF2008-PhishingURLDetection/blob/main/notebook0766167bb2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
taruntiwarihp_phishing_site_urls_path = kagglehub.dataset_download('taruntiwarihp/phishing-site-urls')

print('Data source import complete.')


# Phishing Detection: Feature Extraction and Construction of a Machine Learning model

In this brief project, we'll develop a ML model to predict whether an URL is used for phishing. We'll start from a raw dataset with just two columns:

- **URL:** URL string
- **Label:** Binary variable ('bad' if the URL is malicious and 'good' otherwise)

In [None]:
#Importing necessary libraries
!pip install tldextract
import pandas as pd
import numpy as np
import seaborn as sns
import math
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import warnings
from scipy.stats import randint, uniform
import random
from sklearn.model_selection import KFold, cross_val_score
import re
import tldextract
from urllib.parse import urlparse
from collections import Counter
from scipy.stats import entropy
warnings.filterwarnings("ignore")

We import the raw dataset.

In [None]:
X = pd.read_csv('/kaggle/input/phishing-site-urls/phishing_site_urls.csv')

X['URL'].str.strip()  # Remove unnecessary whitespaces

X

## Study of Special Characters in 'bad' URLs

Besides dots (.) and slashes (/), there are other non-alphanumerical characters that appear in URLs and some of them are often associated to malicious websites. From now on we'll use the term ***special character*** to refer to non-alphanumerical characters different from dots and slashes. To study the frequencies of special characters in bad URLs we'll construct a dataframe with three columns:

- **Special Character:** this column contains special characters that appear in the bad URLs of the dataset.
- **Frequency in bad URLs:** number of bad URLs in which the special character appears.
- **Bad probability:** percentage of bad URLs in which the special character appears over all of the URLs in which the special character appears.
- **Danger Score:** the danger score of the URL is calculated as follows

$$\text{Danger score} = \text{Bad probability}\cdot \log(\text{Frequency in bad URLs})$$

In [None]:
special_df = pd.DataFrame()

special_chars = set()

def find_special_char(x):
    special_chars_in_x = re.findall(r'[^a-zA-Z0-9]', x)
    special_chars.update(special_chars_in_x)
    return None

X_bad = X[X['Label'] == 'bad']
X_bad['URL'].apply(find_special_char)

special_chars = list(special_chars)

special_chars.remove('.')
special_chars.remove('/')

special_df['Special Character'] = special_chars
special_df['Frequency in bad URLs'] = special_df['Special Character'].apply(lambda x: X_bad[X_bad['URL'].str.contains(re.escape(x), regex=True)].shape[0])
special_df['Bad probability'] = special_df['Frequency in bad URLs']/special_df['Special Character'].apply(lambda x: X[X['URL'].str.contains(re.escape(x), regex=True)].shape[0])
special_df['Score'] = special_df['Bad probability']*special_df['Frequency in bad URLs'].apply(math.log)

special_df.sort_values(by='Score', ignore_index=True, ascending=False, inplace=True)
special_df

From now on we'll call the top ten characters of this list ***dangerous characters***.

In [None]:
dangerous_chars = list(special_df['Special Character'].head(10))
print(dangerous_chars)
plt.bar(special_df['Special Character'].head(10), special_df['Score'].head(10), color = 'green')
plt.xlabel('Special Character')
plt.ylabel('Score')
plt.show()

## Study of TLDs in 'bad' URLs

Similarly to what we did before, we want to study the TLDs of our 'bad' URLs and extract the top 10 most dangerous of them. The approach will be basically the same as before.

In [None]:
TLD_df = pd.DataFrame()

TLD_list = pd.Series(X_bad['URL'].apply(lambda x: tldextract.extract(x).suffix)).unique()

TLD_df['TLD'] = TLD_list

TLD_df['Frequency in bad URLs'] = TLD_df['TLD'].apply(lambda x: X_bad[X_bad['URL'].str.contains(re.escape(x), regex=True)].shape[0])
TLD_df['Bad probability'] = TLD_df['Frequency in bad URLs']/TLD_df['TLD'].apply(lambda x: X[X['URL'].str.contains(re.escape(x), regex=True)].shape[0])
TLD_df['Score'] = TLD_df['Bad probability']*TLD_df['Frequency in bad URLs'].apply(math.log)

TLD_df.sort_values(by='Score', ignore_index=True, ascending=False, inplace=True)
TLD_df

In [None]:
dangerous_TLDs = list(TLD_df['TLD'].head(10))
print(dangerous_TLDs)
plt.bar(TLD_df['TLD'].head(10), TLD_df['Score'].head(10), color = 'green')
plt.xlabel('Dangerous TLD')
plt.ylabel('Score')
plt.show()

These are the top 10 most dangerous TLDs in our dataframe (we'll call them ***dangerous TLDs***). So a URL that contains one of these TLDs is more likely to be "bad".

## Feature Extraction

We'll extract the following features from the raw data:

- **URL length:** total length of the URL string.
- **Number of dots:** number of dots in the URL.
- **Number of slashes:**  number of slashes in the URL.
- **Percentage of numerical characters:** percentage of numerical characters in the URL.
- **Dangerous characters:** binary variable (True if there is a dangerous character in the URL and False otherwise)
- **Dangerous TLD:** binary variable (True if the TLD of the URL is dangerous and False otherwise)
- **Entropy:** Entropy of the URL .
- **IP address:** binary variable (True if there is an IP address in the URL and False otherwise).
- **Domain name length:** length of the main domain part (e.g. www.alligator.it has domain name lenght equal to 9).
- **Suspicious keywords:** binary variable (True if there are suspicious keywords in the domain or path such as "login", "secure", "verify", "bank" and False otherwise). We'll use the following list of suspicious words

    [secure, account, update, login, verify ,signin, bank,notify, click, inconvenient]
                      
     A reference for this list is *D. Ranganayakulu, Chellappan C.,Detecting Malicious URLs in E-mail – An Implementation, 2013, AASRI Procedia*.

- **Repetitions:**  binary variable (True if the domain contains a substring of three identical characters and False otherwise).
- **Redirections:** We remember that a double slash // in a URL corresponds to a redirection. Redirections are not necessarily malicious, but they may be if they aren't at the beginning at the URL. So Redirections is a binary variable (True if there is a // in a position higher than 7 and False otherwise, where by position of // we mean the position of the the character that preceds //). The bound 7 is chosen in order to avoid including the (generally safe) redirection of https:// and similar cases.


In [None]:

#1 URL length

X['URL length'] = X['URL'].apply(len)

#2 Numbers of dots

X['Number of dots'] = X['URL'].apply(lambda x: x.count('.'))

#3 Number of slashes

X['Number of slashes'] = X['URL'].apply(lambda x: x.count('/'))

#4 Percentage of numerical characters

X['Percentage of numerical characters'] = X['URL'].apply(lambda x: sum(c.isdigit() for c in x))/X['URL length']

#5 Dangerous characters

X['Dangerous characters'] = X['URL'].apply(lambda x: any(char in x for char in dangerous_chars))

#6 Dangerous TLD

X['Dangerous TLD'] = X['URL'].apply(lambda x: tldextract.extract(x).suffix in dangerous_TLDs)

#7 Entropy

def urlentropy(url):
    frequencies = Counter(url)
    prob = [frequencies[char] / len(url) for char in url]
    return entropy(prob, base=2)


X['Entropy'] = X['URL'].apply(urlentropy)

#8 IP Address

ip_pattern = r'[0-9]+(?:\.[0-9]+){3}'
X['IP Address'] = X['URL'].apply(lambda x: bool(re.search(ip_pattern, x)))

#9 Domain name length

X['Domain name length'] = X['URL'].apply(lambda x: len(tldextract.extract(x).domain))

#10 Suspicious keywords

sus_words = ['secure', 'account', 'update', 'login', 'verify' ,'signin', 'bank',
            'notify', 'click', 'inconvenient']

X['Suspicious keywords'] = X['URL'].apply(lambda x: sum([word in x for word in sus_words]) != 0)


#11 Repetitions

X['Repetitions'] = X['URL'].apply(lambda x: True if re.search(r'(.)\1{2,}', tldextract.extract(x).domain) else False)

#12 Redirections

def redirection(url):
  pos = url.rfind('//') #If the // is not found, it returns -1
  return pos>7

X['Redirections'] = X['URL'].apply(redirection)

#We print the new dataset

X

## Exploratory Data Analysis and Feature Engineering

We standardize numerical features (except the **Percentage of numerical characters**).

In [None]:
scaler = StandardScaler()

num_columns = ['URL length', 'Number of dots', 'Number of slashes', 'Domain name length', 'Entropy']

X[num_columns] = scaler.fit_transform(X[num_columns])

We turn the boolean features and the target Label  into numerical data. Moreover we drop the **URL** feature because it's no longer useful.

In [None]:
X['IP Address'] = X['IP Address'].astype(int)
X['Suspicious keywords'] = X['Suspicious keywords'].astype(int)
X['Repetitions'] = X['Repetitions'].astype(int)
X['Redirections'] = X['Redirections'].astype(int)
X['Dangerous characters'] = X['Dangerous characters'].astype(int)
X['Dangerous TLD'] = X['Dangerous TLD'].astype(int)
X['Label'] = (X['Label'] == 'good').astype(int)

X.drop(columns=['URL'], inplace=True)

X

To appreciate the correlation between features (and target) we print the heat maps of the correlation matrix and of the correlation vector with Label (our target).

In [None]:
corr_matrix = X.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5, annot_kws={"size": 6})
plt.show()
sns.heatmap(corr_matrix[['Label']].sort_values(by='Label').T, annot=True, cmap='coolwarm', linewidths=0.5, annot_kws={"size": 8})
plt.show()
print(corr_matrix[['Label']].sort_values(by='Label'))

**Entropy** and  **URL Length** are highly correlated, so we apply PCA.

In [None]:
pca = PCA(n_components=1)
X['Entropy and length (PCA)'] = pca.fit_transform(X[['Entropy', 'URL length']])
X.drop(columns=['Entropy', 'URL length'], inplace=True)

X

## Train-Test Split

Since we are working with a binary classification problem, it's important to make sure that both classes ('bad' and 'good') are approximately equally represented in the dataset.

In [None]:
X['Label'].value_counts(normalize=True)

To solve the disproportion in the dataset, we'll undersample the 'good' records (we'll conserve also the not sampled 'good' records for the testing step).

In [None]:
n_samples = X['Label'].value_counts()[0]
X_good = X[X['Label'] == 1]
X_bad = X[X['Label'] == 0]
X_goodsample = X_good.sample(n=n_samples, random_state=22)
X_goodmissing = X_good.drop(X_goodsample.index)

X = pd.concat([X_bad, X_goodsample], ignore_index=True)

X

We divide features from target.

In [None]:
y = X['Label']
X.drop(columns=['Label'], inplace=True)

We split the data into training set and test set. We also add the 'good' URL we discarded previously to the test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)

y_goodmissing = X_goodmissing['Label']
X_goodmissing.drop(columns=['Label'], inplace=True)

# Merging X_test and X_goodmissing

X_test = pd.concat([X_test, X_goodmissing], axis=0)

# Merging y_test and y_goodmissing

y_test = pd.concat([y_test, y_goodmissing], axis=0)

## Construction of a ML model

We try two model: XGBClassifier and RandomForestClassifier. To avoid overfitting, we'll perform a cross-validation with 3 folds.

In [None]:
kf = KFold(n_splits=3, shuffle=True, random_state=22)

xgb_model = XGBClassifier(random_state=22)
print(cross_val_score(xgb_model, X_train, y_train, cv=kf, scoring='accuracy').mean())

rf_model = RandomForestClassifier(random_state=22)
print(cross_val_score(rf_model, X_train, y_train, cv=kf, scoring='accuracy').mean())

The CV score of the RandomForestClassifier is slightly superior. We fit the two models and analyze the feature importances.

In [None]:
rf_model.fit(X_train, y_train)
importances = rf_model.feature_importances_
feature_names = X.columns
indices = np.argsort(importances)[::-1]

plt.title('Feature Importance (RandomForestClassifier)')
plt.bar(range(X.shape[1]), importances[indices], align='center')
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.show()


In [None]:
xgb_model.fit(X_train, y_train)
importances = xgb_model.feature_importances_
feature_names = X.columns
indices = np.argsort(importances)[::-1]

plt.title('Feature Importance (XGBClassifier)')
plt.bar(range(X.shape[1]), importances[indices], align='center')
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.show()

The two models, while having similar performances, give completely different feature importances. It's not entirely clear to us why this happens.

We test the two models.

In [None]:
from sklearn.metrics import accuracy_score

rf_pred = rf_model.predict(X_test)
xgb_pred = xgb_model.predict(X_test)

print(accuracy_score(y_test, rf_pred))
print(accuracy_score(y_test, xgb_pred))

The accuracy of the two models is basically the same (86.4 %).

## Hyperparameters Tuning

For computational reasons, we'll try to tune only the XGBClassifier model to improve its CV score (and hopefully its accuracy). We'll use a Bayesian Optimization method.

In [None]:
import optuna



def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 100, 400)
    max_depth = trial.suggest_int('max_depth', 3, 7)
    learning_rate = trial.suggest_loguniform('learning_rate', 1e-3, 0.3)
    subsample = trial.suggest_uniform('subsample', 0.6, 1.0)
    reg_alpha = trial.suggest_loguniform('reg_alpha', 1e-3, 10.0)


    model = XGBClassifier(
        random_state=22,
        n_estimators=n_estimators,
        max_depth=max_depth,
        learning_rate=learning_rate,
        subsample=subsample,
        reg_alpha=reg_alpha,
        use_label_encoder=False,
        eval_metric='mlogloss'
    )


    mean_score = cross_val_score(model, X, y, cv=kf, scoring='accuracy').mean()
    return mean_score


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50)


print("Best hyperparameters:", study.best_params)
print("Best accuracy:", study.best_value)

best_xgb_model =XGBClassifier(random_state=22, **study.best_params)

Let's see how the tuned XGBClassifiers performs on the test set.

In [None]:
best_xgb_model.fit(X_train, y_train)

best_xgb_pred = best_xgb_model.predict(X_test)

print(accuracy_score(y_test, best_xgb_pred))


## Conclusions

We managed to develop a ML model that predicts whether a URL is used for Phishing with an accuracy of **87%** (using only features extracted from the URL string). The importance of the features in the prediction of the result is not entirely clear (to us).