# URL Classification model using Machine Learning
### For Computer Science B.Sc. Ariel University

**By Yuval Yurzdichinsky, Matan Aviv and Roy Simanovich**

## Introduction
In this project we will build a machine learning model that will classify URLs based on if they are malicious (phishing, malware, etc.) or benign. We will use a dataset of URLs that are labeled as either malicious or benign and use it to train a model that will be able to classify new URLs.

### Install and import libraries

In [2]:
# Some importent libraries installtion via pip
%pip install scikit-learn
%pip install pandas
%pip install numpy
%pip install gensim nltk
%pip install matplotlib
%pip install python-whois
%pip install tldextract

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Collecting python-whois
  Downloading python-whois-0.8.0.tar.gz (109 kB)
     ---------------------------------------- 0.0/109.6 kB ? eta -:--:--
     --- ------------------------------------ 10.2/109.6 kB ? eta -:--:--
     ---------- -------------------------- 30.7/109.6 kB 262.6 kB/s eta 0:00:01
     -------------------- ---------------- 61.4/109.6 kB 409.6 kB/s eta 0:00:01
     --------------------------------- -- 102.4/109.6 kB 590.8 kB/s eta 0:00:01
     ------------------------------------ 109.6/109.6 kB 528.8 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting future (from pyth

In [1]:
# Importing the libraries

## Regular Libraries that's come with python
import pandas as pd
import numpy as np
import socket as sock
import math as m
import re
import whois as who
import tldextract as tld
from collections import Counter
from datetime import datetime

## URL Parsing
from urllib.parse import urlparse, parse_qs

## SKLearn Libraries for Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from sklearn.utils import resample

## SKLean metrics for model evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Tensorflow Libraries for Deep Learning
import tensorflow as tf

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Feature extraction functions
We will use the following feature extraction functions to extract features from the URLs:
1. `get_url_length`: This function returns the length of the URL.
2. `get_dot_count`: This function returns the number of dots in the URL.
3. `get_hyphen_count_in_domain`: This function returns the number of hyphens in the domain of the URL.
4. `contains_security_sensitive_words`: This function returns 1 if the URL contains security sensitive words (e.g., "login", "signin", "account", etc.) and 0 otherwise.
5. `get_directory_length`: This function returns the length of the directory part of the URL.
6. `get_sub_directory_count`: This function returns the number of subdirectories in the URL.
7. `contains_ip`: This function returns 1 if the URL contains an IP address and 0 otherwise.
8. `get_token_count_in_path`: This function returns the number of tokens in the path part of the URL.
9. `get_largest_token_length`: This function returns the length of the largest token in the path part of the URL.
10. `get_avarage_token_length`: This function returns the average length of the tokens in the path part of the URL.
11. `get_file_length`: This function returns the length of the file part of the URL.
12. `get_dot_count_in_file`: This function returns the number of dots in the file part of the URL.
13. `get_delimiter_count_in_file`: This function returns the number of delimiters in the file part of the URL.
14. `get_arguments_length`: This function returns the length of the arguments part of the URL.
15. `get_number_of_arguments`: This function returns the number of arguments in the URL.
16. `get_length_of_largest_argument_value`: This function returns the length of the largest argument value in the URL.
17. `get_max_delimiters_in_arguments`: This function returns the maximum number of delimiters in the arguments part of the URL.
18. `get_domain_features`: This function returns the following features of the domain part of the URL: length, number of dots, number of hyphens, number of digits, number of special characters, entropy, whether the domain is a subdomain, and whether the domain is an IP address.
19. `get_special_character_count`: This function returns the number of special characters in the URL.
20. `get_entropy`: This function returns the entropy of the URL.
21. `check_url_shortened`: This function returns 1 if the URL is shortened and 0 otherwise.
22. `get_port_number`: This function returns the port number of the URL, if it exists, and -1 otherwise.
23. `get_subdomain_count`: This function returns the number of subdomains in the URL.
24. `get_suspicious_tld`: This function returns 1 if the top-level domain (TLD) of the URL is suspicious (e.g., "tk", "ml", "ga", etc.) and 0 otherwise.
25. `get_numeric_ratio`: This function returns the ratio of numeric characters in the URL.
26. `get_word_count`: This function returns the number of words in the URL.
27. `get_url_is_internationalized`: This function returns 1 if the URL is internationalized and 0 otherwise.

In [2]:
# Helper function to ensure that the URL has a scheme (http:// or https://)
def ensure_scheme(url):
    if not urlparse(url).scheme:
        url = 'http://' + url
    return url

# Feature extraction functions

## Getting the length of the URL
def get_url_length(url):
    return len(url)

## Getting the dot count in the URL
def get_dot_count(url):
    return url.count('.')

## Check if the URL contains security sensitive words
def contains_security_sensitive_words(url):
    security_sensitive_words = ['login', 'signin', 'auth', 'bank', 'update', 'account', 'verification', 'authenticate','authentication','verify','user']
    return int(any(word in url for word in security_sensitive_words))

## Get the directory length
def get_directory_length(url):
    url = ensure_scheme(url)
    path = urlparse(url).path
    return len(path)

## Get the subdirectory count
def get_sub_directory_count(url):
    url = ensure_scheme(url)
    path = urlparse(url).path
    return path.count('/') - 1

## Get the token count in the path
def get_token_count_in_path(url):
    url = ensure_scheme(url)
    path = urlparse(url).path
    tokens = path.split('/')
    return len(tokens) - 1

## Get the length of the largest token in the path
def get_largest_token_length(url):
    url = ensure_scheme(url)
    path = urlparse(url).path
    tokens = path.split('/')
    if tokens:
        return max(len(token) for token in tokens)
    return 0

## Get the average token length in the path
def get_average_token_length(url):
    url = ensure_scheme(url)
    path = urlparse(url).path
    tokens = [token for token in path.split('/') if token]
    if tokens:
        return np.mean([len(token) for token in tokens])
    return 0

## Get the file length
def get_file_length(url):
    url = ensure_scheme(url)
    path = urlparse(url).path
    filename = path.split('/')[-1]
    return len(filename)

## Get the dot count in the file
def get_dot_count_in_file(url):
    url = ensure_scheme(url)
    path = urlparse(url).path
    filename = path.split('/')[-1]
    return filename.count('.')

## Get the delimiter count in the file
def get_delimiter_count_in_file(url):
    path = urlparse(url).path
    filename = path.split('/')[-1]
    delimiters = ['.', '_', '-']
    return sum(filename.count(delimiter) for delimiter in delimiters)

## Get the arguments length
def get_arguments_length(url):
    url = ensure_scheme(url)
    query = urlparse(url).query
    return len(query)

## Get the number of arguments
def get_number_of_arguments(url):
    url = ensure_scheme(url)
    query = urlparse(url).query
    return len(parse_qs(query))

## Get the length of the largest argument value
def get_length_of_largest_argument_value(url):
    url = ensure_scheme(url)
    query = urlparse(url).query
    params = parse_qs(query)
    if params:
        return max(len(max(values, key=len)) for values in params.values())
    return 0

## Get the average argument value length
def get_max_delimiters_in_arguments(url):
    url = ensure_scheme(url)
    query = urlparse(url).query
    params = parse_qs(query)
    delimiters = ['&', '=', '-', '_']
    if params:
        return max(sum(value.count(delimiter) for delimiter in delimiters) for values in params.values() for value in values)
    return 0

## Get the hyphen count in the domain
def get_hyphen_count_in_domain(url):
    url = ensure_scheme(url)
    domain = urlparse(url).netloc
    return domain.count('-')

## Get the digit count in the domain
def contains_ip(url):
    url = ensure_scheme(url)
    domain = urlparse(url).netloc
    try:
        sock.inet_aton(domain)
        return 1
    except:
        return 0

## Get the domain length, token count, largest token length, and average token length
def get_domain_features(url):
    url = ensure_scheme(url)
    domain = urlparse(ensure_scheme(url)).netloc
    tokens = domain.split('.')
    
    # Domain Length
    domain_length = len(domain)
    
    # Count of Tokens in the Domain
    token_count = len(tokens)
    
    # Length of Largest Token in the Domain
    largest_token_length = max(len(token) for token in tokens) if tokens else 0
    
    # Average Domain Token Length
    average_token_length = sum(len(token) for token in tokens) / len(tokens) if tokens else 0
    
    return domain_length, token_count, largest_token_length, average_token_length

## Check if the URL contains special characters
def get_special_character_count(url):
    special_characters = ['@', '=', '+', '*', '?', '&', '%', '$', '#', '!']
    return sum(url.count(char) for char in special_characters)

## Get the entropy of the URL
def get_entropy(url):
    # Count the frequency of each character in the string
    freq = Counter(url)
    # Calculate the probabilities
    probs = [count / len(url) for count in freq.values()]
    # Calculate the Shannon entropy
    entropy = -sum(p * m.log(p, 2) for p in probs if p > 0)
    return entropy

## Check if the URL is shortened
def check_url_shortened(url):
    shortened_services = ['bit.ly', 'tinyurl.com', 'goo.gl', 'ow.ly', 't.co']
    url = ensure_scheme(url)
    domain = urlparse(url).netloc
    return int(domain in shortened_services)

## Get the port number, if specified
def get_port_number(url):
    url = ensure_scheme(url)
    port = urlparse(url).port
    return port if port else -1  # Return -1 if no port specified

## Get the subdomain count
def get_subdomain_count(url):
    url = ensure_scheme(url)
    domain_parts = urlparse(url).netloc.split('.')
    # Count as subdomains any parts beyond the second-level domain and TLD
    return max(0, len(domain_parts) - 2)

## Check if the URL TLD is suspicious (e.g., .xyz, .top, .loan, .win, .club)
def get_suspicious_tld(url):
    suspicious_tlds = ['xyz', 'top', 'loan', 'win', 'club']
    url = ensure_scheme(url)
    tld = urlparse(url).netloc.split('.')[-1]
    return int(tld in suspicious_tlds)

## Get the ratio of numeric characters in the URL
def get_numeric_ratio(url):
    numeric_chars = sum(c.isdigit() for c in url)
    return numeric_chars / len(url) if len(url) > 0 else 0

## Get the word count in the URL
def get_word_count(url):
    words = re.findall(r'\w+', url)
    return len(words)

## Check if the URL is internationalized (i.e., contains non-ASCII characters)
def get_url_is_internationalized(url):
    try:
        url.encode('ascii')
        return 0
    except UnicodeEncodeError:
        return 1

### Now load the dataset and do some preprocessing on it
The dataset is a CSV file with two columns: `url` and `label`. The `url` column contains the URL and the `label` column contains the label of the URL. The label is 1 if the URL is malicious and 0 if the URL is benign. Since the dataset is huge (more than 1 million rows), we will only use a small subset of it for this project (150,000 rows for each class, 300,000 rows in total).

In [3]:
# Load the dataset
df = pd.read_csv('url_dataset_updated.csv')

# Column name for the URL
url_column_name = 'URL'

# Remove 'http://' and 'https://' from all URLs
df[url_column_name] = df[url_column_name].str.replace('http://', '', regex=False)
df[url_column_name] = df[url_column_name].str.replace('https://', '', regex=False)

# Separate the dataset into malicious and benign
malicious_df = df[df['Label'] == 1]
benign_df = df[df['Label'] == 0]

# Randomly sample 150,000 entries from each
malicious_sampled_df = resample(malicious_df, n_samples=150000, random_state=42)
benign_sampled_df = resample(benign_df, n_samples=150000, random_state=42)

# Combine the sampled data
balanced_df = pd.concat([malicious_sampled_df, benign_sampled_df])

# Shuffle the combined dataset to mix malicious and benign URLs
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Drop all duplicates from balanced_df
balanced_df = balanced_df.drop_duplicates()

# Reset the index after dropping duplicates
balanced_df = balanced_df.reset_index(drop=True)

# Now, balanced_df contains the balanced dataset ready for further processing

## Start testing with the options
**NOTE:** We'll split the truncated dataset into a training set and a test set (80% training, 20% test) and use the training set to train the model and the test set to evaluate the model.

### TF-IDF Vectorizer with random forest classifier, no feature extraction
We will start by using the TF-IDF vectorizer to convert the URLs into numerical features and then use a random forest classifier to classify the URLs. We will not use any feature extraction functions for now. This is for reference only, as we will use feature extraction functions and better methods later.
* **Warning:** This will take a long time to run, as the dataset is huge, and TF-IDF vectorization is a slow process. Don't run this unless you have a powerful computer.

In [12]:
# For simplicity, we'll use TF-IDF on the URLs themselves. Advanced features can be added based on URL structure and content.
vectorizer = TfidfVectorizer()

# Prepare the data
X = balanced_df['URL']
y = balanced_df['Label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training
# Create a pipeline that first transforms the data using TfidfVectorizer then applies RandomForestClassifier
model = make_pipeline(TfidfVectorizer(), RandomForestClassifier(n_estimators=20, random_state=42))

# Train the model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9046580631323061
Confusion Matrix:
 [[27659   514]
 [ 4820 22953]]
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.98      0.91     28173
           1       0.98      0.83      0.90     27773

    accuracy                           0.90     55946
   macro avg       0.91      0.90      0.90     55946
weighted avg       0.91      0.90      0.90     55946



### Random forest classifier, with feature extraction
Next thing we will do is use the feature extraction functions to extract features from the URLs and then use a random forest classifier to classify the URLs. We will use the same training and test sets as before.

In [4]:
# Apply feature extraction
features = balanced_df['URL'].apply(lambda x: pd.Series({
    'url_length': get_url_length(x),
    'dot_count': get_dot_count(x),
    'hyphen_count_domain': get_hyphen_count_in_domain(x),
    'security_sensitive_words': contains_security_sensitive_words(x),
    'directory_length': get_directory_length(x),
    'sub_directory_count': get_sub_directory_count(x),
    'token_count_path': get_token_count_in_path(x),
    'largest_token_length': get_largest_token_length(x),
    'average_token_length': get_average_token_length(x),
    'file_length': get_file_length(x),
    'contains_ip': contains_ip(x),
    'dot_count_in_file': get_dot_count_in_file(x),
    'delimiter_count_in_file': get_delimiter_count_in_file(x),
    'arguments_length': get_arguments_length(x),
    'number_of_arguments': get_number_of_arguments(x),
    'length_of_largest_argument_value': get_length_of_largest_argument_value(x),
    'max_delimiters_in_arguments': get_max_delimiters_in_arguments(x),
    'special_character_count': get_special_character_count(x),
    'entropy': get_entropy(x),
    'url_shortened': check_url_shortened(x),
    'port_number': get_port_number(x),
    'subdomain_count': get_subdomain_count(x),
    'suspicious_tld': get_suspicious_tld(x),
    'numeric_ratio': get_numeric_ratio(x),
    'url_is_internationalized': get_url_is_internationalized(x),
    'domain_length': get_domain_features(x)[0],
    'domain_token_count': get_domain_features(x)[1],
    'largest_domain_token_length': get_domain_features(x)[2],
    'average_domain_token_length': get_domain_features(x)[3],
    'word_count': get_word_count(x)
}))


# Concatenate original DF with features
balanced_df_with_features = pd.concat([balanced_df, features], axis=1)

# Define X and y
X = balanced_df_with_features.drop(['Label', 'URL'], axis=1)
y = balanced_df_with_features['Label']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training
# Since your features are already numerical, directly use RandomForestClassifier without TfidfVectorizer
model = RandomForestClassifier(n_estimators=50, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.9091266578486398
Confusion Matrix:
 [[26480  1693]
 [ 3391 24382]]
Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.94      0.91     28173
           1       0.94      0.88      0.91     27773

    accuracy                           0.91     55946
   macro avg       0.91      0.91      0.91     55946
weighted avg       0.91      0.91      0.91     55946



## TensorFlow neural network, with feature extraction
Now let's use a neural network to classify the URLs. We will use the feature extraction functions to extract features from the URLs and then use a neural network to classify the URLs. We will use the same training and test sets as before. As the same, the training set is 80% and the test set is 20%.

### Preprocessing for Deep Learning Model

In [5]:
# Tokenization and sequence padding parameters
max_len = 100  # Adjust based on the length of the longest URL in your dataset
max_words = 60000  # Adjust based on the size of your vocabulary

# Tokenize the URLs
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(balanced_df_with_features['URL'])
sequences = tokenizer.texts_to_sequences(balanced_df_with_features['URL'])

# Pad the sequences
data = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len)

# Labels
labels = np.asarray(balanced_df_with_features['Label'])
labels = tf.keras.utils.to_categorical(labels)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

### Using RNN model with feature extraction

In [7]:
# RNN model definition
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(max_words, 32, input_length=max_len))
model.add(tf.keras.layers.SimpleRNN(32))
model.add(tf.keras.layers.Dense(2, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

# Define early stopping criteria
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_acc', patience=3, restore_best_weights=True)

# Train the model with early stopping, using the CPU since the model is too simple for GPU acceleration
with tf.device('/device:CPU:0'):
    history = model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.2, callbacks=[early_stopping])

# Predictions
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_test_classes = np.argmax(y_test, axis=1)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test_classes, y_pred_classes))
print("Confusion Matrix:\n", confusion_matrix(y_test_classes, y_pred_classes))
print("Classification Report:\n", classification_report(y_test_classes, y_pred_classes))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Accuracy: 0.9522217852929611
Confusion Matrix:
 [[27459   714]
 [ 1959 25814]]
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.97      0.95     28173
           1       0.97      0.93      0.95     27773

    accuracy                           0.95     55946
   macro avg       0.95      0.95      0.95     55946
weighted avg       0.95      0.95      0.95     55946



#### Optimize the TensorFlow neural network using RNN and Adam optimizer
Now let's optimize the neural network using RNN and the Adam optimizer. We will use the feature extraction functions to extract features from the URLs and then use a neural network with RNN and the Adam optimizer to classify the URLs. We will use the same training and test sets as before. Since the model now is much more complex, we'll use TensorFlow GPU acceleration to speed up the training process, if available.

In [8]:
# Tokenization and sequence padding parameters
max_len = 100  # Adjust based on the length of the longest URL in your dataset
max_words = 60000  # Adjust based on the size of your vocabulary

# Tokenize the URLs
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words, char_level=True)
tokenizer.fit_on_texts(balanced_df['URL'])
sequences = tokenizer.texts_to_sequences(balanced_df['URL'])

# Pad the sequences
data = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len)

# Labels
labels = np.asarray(balanced_df['Label'])
labels = tf.keras.utils.to_categorical(labels)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)


# Model definition
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(max_words, 32, input_length=max_len))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))) # Add return_sequences if stacking LSTM layers
model.add(tf.keras.layers.Dropout(0.5))  # Adjust dropout rate as needed
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)))  # Second LSTM layer, without return_sequences
model.add(tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.001)))  # Add L2 regularization
model.add(tf.keras.layers.Dropout(0.5))  # Adjust dropout rate as needed
model.add(tf.keras.layers.Dense(2, activation='softmax'))

model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['acc'])

# Early stopping
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_acc', patience=3, restore_best_weights=True)

# Train the model with early stopping
try:
    with tf.device('/device:GPU:0'): # Use GPU for training, as the model is more complex and very slow to train on CPU
        print("Training the model on GPU\n")
        history = model.fit(X_train, y_train, epochs=30, batch_size=128, validation_split=0.2, callbacks=[early_stopping])

# If you don't have a GPU, you can train the model on the CPU
except Exception as e:
    print("Could not train the model on GPU. Training on CPU.\n")
    history = model.fit(X_train, y_train, epochs=30, batch_size=128, validation_split=0.2, callbacks=[early_stopping])

# Predictions
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_test_classes = np.argmax(y_test, axis=1)

# Evaluation
print("Accuracy:", accuracy_score(y_test_classes, y_pred_classes))
print("Confusion Matrix:\n", confusion_matrix(y_test_classes, y_pred_classes))
print("Classification Report:\n", classification_report(y_test_classes, y_pred_classes))

Training the model on GPU

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
 240/1399 [====>.........................] - ETA: 1:14 - loss: 0.0958 - acc: 0.9636

In [10]:
# Save the trained model
model.save("RNN_and_Adam_trained_model.h5")

print("Model saved successfully.")

Model saved successfully.


#### Another optimization for the RNN model

In [None]:
# Tokenization and sequence padding parameters
max_len = 100  # Adjust based on the length of the longest URL in your dataset
max_words = 60000  # Adjust based on the size of your vocabulary

# Tokenize the URLs
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words, char_level=True)
tokenizer.fit_on_texts(balanced_df['URL'])
sequences = tokenizer.texts_to_sequences(balanced_df['URL'])

# Pad the sequences
data = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_len)

# Labels
labels = np.asarray(balanced_df['Label'])
labels = tf.keras.utils.to_categorical(labels)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

# RNN model definition
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Embedding(max_words, 32, input_length=max_len)) # Embedding layer
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, dropout=0.2, recurrent_dropout=0.2))) # Bidirectional LSTM layer

# Multi-layer dense network
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))

# Dropout layer
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.Dense(2, activation='softmax'))

# Use Adam optimizer with a lower learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['acc'])

# Define early stopping criteria
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_acc', patience=3, restore_best_weights=True)

# Train the model with early stopping
try:
    # Use GPU for training, as the model is more complex and very slow to train on CPU
    with tf.device('/device:GPU:0'):
        print("Training the model on GPU\n")
        history = model.fit(X_train, y_train, epochs=30, batch_size=128, validation_split=0.2, callbacks=[early_stopping])

# If you don't have a GPU, you can train the model on the CPU
except Exception as e:
    print("Could not train the model on GPU. Training on CPU.\n")
    history = model.fit(X_train, y_train, epochs=30, batch_size=128, validation_split=0.2, callbacks=[early_stopping])

# Predictions
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_test_classes = np.argmax(y_test, axis=1)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test_classes, y_pred_classes))
print("Confusion Matrix:\n", confusion_matrix(y_test_classes, y_pred_classes))
print("Classification Report:\n", classification_report(y_test_classes, y_pred_classes))