# URL Classification model using Machine Learning
### For Computer Science B.Sc. Ariel University

**By Yuval Yurzdichansky, Matan Aviv and Roy Simanovich**

## Introduction
In this project we will build a machine learning model that will classify URLs based on if they are malicious (phishing, malware, etc.) or benign. We will use a dataset of URLs that are labeled as either malicious or benign and use it to train a model that will be able to classify new URLs.

## Description
This is the interface of the model. You can choose one of the following models for classification:
1. Random Forest with 100 estimators - sklearn
2. Decision Tree - sklearn
3. K-Nearest Neighbors (KNN) - sklearn
4. Logistic Regression with 10,000 iterations - sklearn
5. Gaussian Naive Bayes - sklearn
6. Recurrent Neural Network (RNN) - tensorflow and keras

After choosing a model, you can input a URL and the model will classify it as either malicious or benign.

### Install and import libraries

In [47]:
# Some importent libraries installtion via pip
%pip install scikit-learn
%pip install pandas
%pip install numpy
%pip install tensorflow

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [48]:
# Importing the libraries for the project #

## Regular Libraries that's come with python
### Data Manipulation and Analysis
import pandas as pd
### Numerical Computing
import numpy as np
### Collection of Data Structures
import collections as col
### Regular Expressions
import re as regex
### URL Handling
import urllib as urlhndl
### Mathematical Operations
import math
### Socket Programming
import socket

## SKLearn Libraries for Machine Learning
### Splitting the data into training and testing
from sklearn.model_selection import train_test_split
### Metrics for Model Evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
### Preprocessing the Data
from sklearn.utils import resample
### TF-IDF Vectorizer for Text Data
from sklearn.feature_extraction.text import TfidfVectorizer
### Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
### Naive Bayes Classifier
from sklearn.naive_bayes import GaussianNB
### Logistic Regression Classifier
from sklearn.linear_model import LogisticRegression
### K-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier

## Tensorflow Libraries for Deep Learning
import tensorflow as tf

## Pickle for saving the model to disk
import pickle as pkl

### Feature extraction functions
We will use the following feature extraction functions to extract features from the URLs:
1. `get_url_length`: This function returns the length of the URL.
2. `get_dot_count`: This function returns the number of dots in the URL.
3. `get_hyphen_count_in_domain`: This function returns the number of hyphens in the domain of the URL.
4. `contains_security_sensitive_words`: This function returns 1 if the URL contains security sensitive words (e.g., "login", "signin", "account", etc.) and 0 otherwise.
5. `get_directory_length`: This function returns the length of the directory part of the URL.
6. `get_sub_directory_count`: This function returns the number of subdirectories in the URL.
7. `contains_ip`: This function returns 1 if the URL contains an IP address and 0 otherwise.
8. `get_token_count_in_path`: This function returns the number of tokens in the path part of the URL.
9. `get_largest_token_length`: This function returns the length of the largest token in the path part of the URL.
10. `get_avarage_token_length`: This function returns the average length of the tokens in the path part of the URL.
11. `get_file_length`: This function returns the length of the file part of the URL.
12. `get_dot_count_in_file`: This function returns the number of dots in the file part of the URL.
13. `get_delimiter_count_in_file`: This function returns the number of delimiters in the file part of the URL.
14. `get_arguments_length`: This function returns the length of the arguments part of the URL.
15. `get_number_of_arguments`: This function returns the number of arguments in the URL.
16. `get_length_of_largest_argument_value`: This function returns the length of the largest argument value in the URL.
17. `get_max_delimiters_in_arguments`: This function returns the maximum number of delimiters in the arguments part of the URL.
18. `get_domain_features`: This function returns the following features of the domain part of the URL: length, number of dots, number of hyphens, number of digits, number of special characters, entropy, whether the domain is a subdomain, and whether the domain is an IP address.
19. `get_special_character_count`: This function returns the number of special characters in the URL.
20. `get_entropy`: This function returns the entropy of the URL.
21. `check_url_shortened`: This function returns 1 if the URL is shortened and 0 otherwise.
22. `get_port_number`: This function returns the port number of the URL, if it exists, and -1 otherwise.
23. `get_subdomain_count`: This function returns the number of subdomains in the URL.
24. `get_suspicious_tld`: This function returns 1 if the top-level domain (TLD) of the URL is suspicious (e.g., "tk", "ml", "ga", etc.) and 0 otherwise.
25. `get_numeric_ratio`: This function returns the ratio of numeric characters in the URL.
26. `get_word_count`: This function returns the number of words in the URL.
27. `get_url_is_internationalized`: This function returns 1 if the URL is internationalized and 0 otherwise.

In [49]:
# Function to concatenate "https://" to benign URLs 
def add_https(url, label):
    if label == 0:
        return "https://" + url
    else:
        return url

def ensure_scheme(url):
    if not urlhndl.parse.urlparse(url).scheme:
        url = 'https://' + url
    return url

# Feature extraction functions
def get_url_length(url):
    return len(url)

def get_dot_count(url):
    return url.count('.')

def contains_security_sensitive_words(url):
    security_sensitive_words = [
    'login', 'password', 'admin', 'root', 'secret', 'private', 'secure', 'confidential', 
    'bank', 'creditcard', 'account', 'authentication', 'authorization', 'session', 'token', 
    'apikey', 'ssl', 'https', 'secure', 'encrypted', 'auth', 'signin', 'signup', 'verification', 
    'resetpassword', 'change-password', 'forgot-password', 'otp', '2fa', 'phishing', 'malware', 
    'virus', 'trojan', 'exploit', 'hacker', 'attack', 'security', 'vulnerable', 'injection', 
    'xss', 'csrf', 'dos', 'ddos', 'bruteforce', 'firewall', 'vpn', 'proxy', 'tor', 'security-question', 
    'privacy-policy']
    return int(any(word in url for word in security_sensitive_words))

def get_directory_length(url):
    url = ensure_scheme(url)
    path = urlhndl.parse.urlparse(url).path
    return len(path)

def get_sub_directory_count(url):
    url = ensure_scheme(url)
    path = urlhndl.parse.urlparse(url).path
    return path.count('/') - 1

def get_token_count_in_path(url):
    url = ensure_scheme(url)
    path = urlhndl.parse.urlparse(url).path
    tokens = path.split('/')
    return len(tokens) - 1

def get_largest_token_length(url):
    url = ensure_scheme(url)
    path = urlhndl.parse.urlparse(url).path
    tokens = path.split('/')
    if tokens:
        return max(len(token) for token in tokens)
    return 0

def get_average_token_length(url):
    url = ensure_scheme(url)
    path = urlhndl.parse.urlparse(url).path
    tokens = [token for token in path.split('/') if token]
    if tokens:
        return np.mean([len(token) for token in tokens])
    return 0

def get_file_length(url):
    url = ensure_scheme(url)
    path = urlhndl.parse.urlparse(url).path
    filename = path.split('/')[-1]
    return len(filename)

def get_dot_count_in_file(url):
    url = ensure_scheme(url)
    path = urlhndl.parse.urlparse(url).path
    filename = path.split('/')[-1]
    return filename.count('.')

def get_delimiter_count_in_file(url):
    path = urlhndl.parse.urlparse(url).path
    filename = path.split('/')[-1]
    delimiters = ['.', '_', '-']
    return sum(filename.count(delimiter) for delimiter in delimiters)

def get_arguments_length(url):
    url = ensure_scheme(url)
    query = urlhndl.parse.urlparse(url).query
    return len(query)

def get_number_of_arguments(url):
    url = ensure_scheme(url)
    query = urlhndl.parse.urlparse(url).query
    return len(urlhndl.parse.parse_qs(query))

def get_length_of_largest_argument_value(url):
    url = ensure_scheme(url)
    query = urlhndl.parse.urlparse(url).query
    params = urlhndl.parse.parse_qs(query)
    if params:
        return max(len(max(values, key=len)) for values in params.values())
    return 0

def get_max_delimiters_in_arguments(url):
    url = ensure_scheme(url)
    query = urlhndl.parse.urlparse(url).query
    params = urlhndl.parse.parse_qs(query)
    delimiters = ['&', '=', '-', '_']
    if params:
        return max(sum(value.count(delimiter) for delimiter in delimiters) for values in params.values() for value in values)
    return 0


def get_hyphen_count_in_domain(url):
    url = ensure_scheme(url)
    domain = urlhndl.parse.urlparse(url).netloc
    return domain.count('-')

def contains_ip(url):
    url = ensure_scheme(url)
    domain = urlhndl.parse.urlparse(url).netloc
    try:
        socket.inet_aton(domain)
        return 1
    except:
        return 0
    
def get_domain_features(url):
    url = ensure_scheme(url)
    domain = urlhndl.parse.urlparse(ensure_scheme(url)).netloc
    tokens = domain.split('.')
    
    # Domain Length
    domain_length = len(domain)
    
    # Count of Tokens in the Domain
    token_count = len(tokens)
    
    # Length of Largest Token in the Domain
    largest_token_length = max(len(token) for token in tokens) if tokens else 0
    
    # Average Domain Token Length
    average_token_length = sum(len(token) for token in tokens) / len(tokens) if tokens else 0
    
    return domain_length, token_count, largest_token_length, average_token_length

# New feature extraction functions
def get_special_character_count(url):
    special_characters = ['@', '=', '+', '*', '?', '&', '%', '$', '#', '!']
    return sum(url.count(char) for char in special_characters)

def get_entropy(url):
    # Count the frequency of each character in the string
    freq = col.Counter(url)
    # Calculate the probabilities
    probs = [count / len(url) for count in freq.values()]
    # Calculate the Shannon entropy
    entropy = -sum(p * math.log(p, 2) for p in probs if p > 0)
    return entropy

def check_url_shortened(url):
    shortened_services = ['bit.ly', 'tinyurl.com', 'goo.gl', 'ow.ly', 't.co']
    url = ensure_scheme(url)
    domain = urlhndl.parse.urlparse(url).netloc
    return int(domain in shortened_services)

def get_subdomain_count(url):
    url = ensure_scheme(url)
    domain_parts = urlhndl.parse.urlparse(url).netloc.split('.')
    # Count as subdomains any parts beyond the second-level domain and TLD
    return max(0, len(domain_parts) - 2)

def get_suspicious_tld(url):
    suspicious_tlds = ['xyz', 'top', 'loan', 'win', 'club']
    url = ensure_scheme(url)
    tld = urlhndl.parse.urlparse(url).netloc.split('.')[-1]
    return int(tld in suspicious_tlds)

def get_numeric_ratio(url):
    numeric_chars = sum(c.isdigit() for c in url)
    return numeric_chars / len(url) if len(url) > 0 else 0

def get_word_count(url):
    words = regex.findall(r'\w+', url)
    return len(words)

def is_https(url, timeout=0.5):
    return int(url.startswith("https"))

### Interface

#### Load the model

In [50]:
# Load the models and vectorizers
rf_model = pkl.load(open('random_forest_model.pkl', 'rb'))
tf_idf_vectorizer = pkl.load(open('tfidf_vectorizer.pkl', 'rb'))

#### Input a URL

In [51]:
# Request the URL from the user
url = input("Enter a URL: ")

# Preprocess the URL
url = ensure_scheme(url)
data = {'URL': [url]}
df = pd.DataFrame(data)

#### Prepare the URL for classification

In [52]:
# Extracting the features
features = df['URL'].apply(lambda x: pd.Series({
    'url_length': get_url_length(x),
    'dot_count': get_dot_count(x),
    'hyphen_count_domain': get_hyphen_count_in_domain(x),
    'security_sensitive_words': contains_security_sensitive_words(x),
    'directory_length': get_directory_length(x),
    'sub_directory_count': get_sub_directory_count(x),
    'token_count_path': get_token_count_in_path(x),
    'largest_token_length': get_largest_token_length(x),
    'average_token_length': get_average_token_length(x),
    'file_length': get_file_length(x),
    'contains_ip': contains_ip(x),
    'dot_count_in_file': get_dot_count_in_file(x),
    'delimiter_count_in_file': get_delimiter_count_in_file(x),
    'arguments_length': get_arguments_length(x),
    'number_of_arguments': get_number_of_arguments(x),
    'length_of_largest_argument_value': get_length_of_largest_argument_value(x),
    'max_delimiters_in_arguments': get_max_delimiters_in_arguments(x),
    'special_character_count': get_special_character_count(x),
    'entropy': get_entropy(x),
    'url_shortened': check_url_shortened(x),
    'subdomain_count': get_subdomain_count(x),
    'suspicious_tld': get_suspicious_tld(x),
    'numeric_ratio': get_numeric_ratio(x),
    'domain_length': get_domain_features(x)[0],
    'domain_token_count': get_domain_features(x)[1],
    'largest_domain_token_length': get_domain_features(x)[2],
    'average_domain_token_length': get_domain_features(x)[3],
    'word_count': get_word_count(x),
    'is_https': is_https(x)
}))

# Concatenate original DF with features
df = pd.concat([df, features], axis=1)

# TF-IDF Vectorization
data = tf_idf_vectorizer.transform(df['URL'])

# Convert the sparse matrix to a dataframe
data = pd.DataFrame(data.toarray(), columns=tf_idf_vectorizer.get_feature_names_out())

# Add the features to the dataframe
data = pd.concat([df, data], axis=1)

# Drop the URL column
data = data.drop('URL', axis=1)

#### Make a prediction

In [53]:
# Make predictions
prediction = rf_model.predict(data)

# Use threshold to determine if the URL is malicious
if round(rf_model.predict_proba(data)[0][1], 4) > 0.8:
    prediction = 1
else:
    prediction = 0

# Print the prediction
if prediction == 0:
    print("The model predicts that the URL is benign.")

else:
    print("The model predicts that the URL is malicious.")

# Print the probability of the prediction
print("The probability of the URL being malicious is", round(rf_model.predict_proba(data)[0][1], 4))

# Print the URL
print(url)

The model predicts that the URL is malicious.
The probability of the URL being malicious is 0.87
https://xn--procder-eya.com/


