We have completed Exploratory Data Analysis (EDA) in a previous notebook and extracted the top 10 features. Now, Lets train a model using these features and compare the results.
Load Data:

Load the dataset that we used in the EDA process. Ensure it includes the top 10 features and the target variable.
Data Preprocessing:

If there are any missing values or categorical variables, we should handle them appropriately (impute missing values, encode categorical variables, etc.).
Feature Selection:

Ensure that the dataset includes only the top 10 features we extracted during EDA.
Train-Test Split:

Split the dataset into training and testing sets. This allows us to train the model on one subset of the data and evaluate its performance on another.
Model Training:

Choose a machine learning model based on the nature of our problem (classification or regression). Common choices include decision trees, random forests, support vector machines, or neural networks.
Train the model using the training dataset.
Model Evaluation:

Evaluate the model's performance on the testing dataset. Common metrics include accuracy, precision, recall, F1 score (for classification), or mean squared error (for regression).
Comparison:

If we have the results from a previous model or baseline, we should compare the performance metrics to see if the model trained on the top 10 features improves over the previous one.

In [8]:
# importing libraries
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns


In [9]:
# importing dataset
df = pd.read_csv('/content/urldata.csv')

In [None]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,url,label,result
0,0,https://www.google.com,benign,0.0
1,1,https://www.youtube.com,benign,0.0
2,2,https://www.facebook.com,benign,0.0
3,3,https://www.baidu.com,benign,0.0
4,4,https://www.wikipedia.org,benign,0.0
5,5,https://www.reddit.com,benign,0.0
6,6,https://www.yahoo.com,benign,0.0
7,7,https://www.google.co.in,benign,0.0
8,8,https://www.qq.com,benign,0.0
9,9,https://www.amazon.com,benign,0.0


In [None]:
# understanding the dataset
df.describe(include='all')

Unnamed: 0.1,Unnamed: 0,url,label,result
count,450176.0,450176,450176,450176.0
unique,,450176,2,
top,,https://www.google.com,benign,
freq,,1,345738,
mean,225087.5,,,0.231994
std,129954.761729,,,0.422105
min,0.0,,,0.0
25%,112543.75,,,0.0
50%,225087.5,,,0.0
75%,337631.25,,,0.0


In [10]:
#Removing the unnamed columns as it is not necesary.
urldata = df.drop('Unnamed: 0',axis=1)

In [None]:
urldata.head(10)

Unnamed: 0,url,label,result
0,https://www.google.com,benign,0
1,https://www.youtube.com,benign,0
2,https://www.facebook.com,benign,0
3,https://www.baidu.com,benign,0
4,https://www.wikipedia.org,benign,0
5,https://www.reddit.com,benign,0
6,https://www.yahoo.com,benign,0
7,https://www.google.co.in,benign,0
8,https://www.qq.com,benign,0
9,https://www.amazon.com,benign,0


In [11]:
urldata.shape

(450176, 3)

The dataset has:
- 450176 rows
- 3 columns

In [12]:
label_counts = urldata['label'].value_counts()
print(label_counts)


benign       345738
malicious    104438
Name: label, dtype: int64


In [None]:
# Basic data check
urldata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450176 entries, 0 to 450175
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     450176 non-null  object
 1   label   450176 non-null  object
 2   result  450176 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 10.3+ MB


In [None]:
# checking for missing values
urldata.isnull().sum()

url       0
label     0
result    0
dtype: int64

Observation- There are no missing or NUll values.

## Feature Engineering

The following features will be extracted from the URL for classification.

1. Length Features:
   - Length Of Url
   - Length of Hostname
   - Length Of Path
   - Length Of First Directory
   - Length Of Top Level Domain
   
2. Count Features:
   - Count Of '-'
   - Count Of '@'
   - Count Of '?'
   - Count Of '%'
   - Count Of '.'
   - Count Of '='
   - Count Of 'http'
   - Count of 'https'
   - Count Of 'www'
   - Count Of Digits
   - Count Of Letters
   - Count Of Number Of Directories


#### 1. Length features

In [None]:
!pip install tld

Collecting tld
  Downloading tld-0.13-py2.py3-none-any.whl (263 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/263.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/263.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m263.8/263.8 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tld
Successfully installed tld-0.13


In [6]:
from urllib.parse import urlparse
from tld import get_tld
import os.path

In [13]:
#Length of URL
urldata['url_length'] = urldata['url'].apply(lambda i: len(str(i)))
#hostname length
urldata['hostname_length'] = urldata['url'].apply(lambda i: len(urlparse(i).netloc))
urldata.head()

Unnamed: 0,url,label,result,url_length,hostname_length
0,https://www.google.com,benign,0,22,14
1,https://www.youtube.com,benign,0,23,15
2,https://www.facebook.com,benign,0,24,16
3,https://www.baidu.com,benign,0,21,13
4,https://www.wikipedia.org,benign,0,25,17


#### 2. Count features

In [14]:
urldata['count-'] = urldata['url'].apply(lambda i: i.count('-'))
urldata['count@'] = urldata['url'].apply(lambda i: i.count('@'))
urldata['count-https'] = urldata['url'].apply(lambda i : i.count('https'))
urldata['count-www'] = urldata['url'].apply(lambda i: i.count('www'))
def digit_count(url):
    digits = 0
    for i in url:
        if i.isnumeric():
            digits = digits + 1
    return digits
urldata['count-digits']= urldata['url'].apply(lambda i: digit_count(i))
def letter_count(url):
    letters = 0
    for i in url:
        if i.isalpha():
            letters = letters + 1
    return letters
urldata['count-letters']= urldata['url'].apply(lambda i: letter_count(i))
def no_of_dir(url):
    urldir = urlparse(url).path
    return urldir.count('/')
urldata['count_dir'] = urldata['url'].apply(lambda i: no_of_dir(i))

In [None]:
urldata.columns

Index(['url', 'count-', 'count@', 'count-https', 'count-www', 'count-digits',
       'count-letters', 'count_dir', 'label', 'url_length', 'hostname_length'],
      dtype='object')

Data after extracting Count Features

In [15]:
label = urldata['result']
# dropping tld column since we only need tld_length
urldata = urldata.drop("label",1)
urldata = urldata.drop("result",1)
urldata['label']=label

In [17]:
urldata.tail()

Unnamed: 0,url,url_length,hostname_length,count-,count@,count-https,count-www,count-digits,count-letters,count_dir,label
450171,http://ecct-it.com/docmmmnn/aptgd/index.php,43,11,1,0,0,0,0,34,3,1
450172,http://faboleena.com/js/infortis/jquery/plugin...,159,13,0,0,0,0,21,118,12,1
450173,http://faboleena.com/js/infortis/jquery/plugin...,147,13,0,0,0,0,20,109,12,1
450174,http://atualizapj.com/,22,14,0,0,0,0,0,17,1,1
450175,http://writeassociate.com/test/Portal/inicio/I...,143,18,1,0,0,1,9,118,7,1


In [None]:
# Saving the file for later use after feature engineering and EDA
# This file contains the processed data with additional features and insights.

urldata.to_csv('/content/complete_data.csv',index=False)

#Training Data

In [18]:
from sklearn.model_selection import train_test_split

# Separate data into Class 0 and Class 1
class_0_data = urldata[urldata['label'] == 0]
class_1_data = urldata[urldata['label'] == 1]
train_class_0 = class_0_data.head(1000)
train_class_1 = class_1_data.head(1000)

# The remaining records will be used for testing
test_class_0 = class_0_data.tail(len(class_0_data) - 1000)
test_class_1 = class_1_data.tail(len(class_1_data) - 1000)

# print("class 0 set size:", len(class_0_data))
# print("class 1 set size:", len(class_1_data))

# Concatenate the training and testing sets for both classes
training_data = pd.concat([train_class_0, train_class_1])
test_data = pd.concat([test_class_0, test_class_1])

# Shuffle the data to mix both classes
training_data = training_data.sample(frac=1, random_state=42).reset_index(drop=True)
test_data = test_data.sample(frac=1, random_state=42).reset_index(drop=True)

# Now, 'train_data' contains 50% of data from both classes for training
# 'test_data' contains the remaining data for testing later
print("Initial Train set size:", len(training_data))
print("Test set size for analysis and further training:", len(test_data))

# Concatenate the training and testing sets for both classes
train_data = pd.concat([train_class_0, train_class_1])
test_data = pd.concat([test_class_0, test_class_1])

# Shuffle the data to mix both classes
train_data = train_data.sample(frac=1, random_state=42).reset_index(drop=True)
test_data = test_data.sample(frac=1, random_state=42).reset_index(drop=True)

Initial Train set size: 2000
Test set size for analysis and further training: 448176


In [19]:
label_counts = training_data['label'].value_counts()
print(label_counts)
label_counts = test_data['label'].value_counts()
print(label_counts)

1    1000
0    1000
Name: label, dtype: int64
0    344738
1    103438
Name: label, dtype: int64


In [None]:
training_data.head()

Unnamed: 0,url,url_length,hostname_length,count-,count@,count-https,count-www,count-digits,count-letters,count_dir,label
0,http://www.collex.com.vn/online/onlinebanking/...,55,17,0,0,0,1,0,45,3,1
1,https://www.sciencedirect.com,29,21,0,0,1,1,0,24,0,0
2,http://coughcrops.co.za/invoice/b80876e0cd2fb0...,65,16,0,0,0,0,23,34,3,1
3,https://www.yy08047.com,23,15,0,0,1,1,5,13,0,0
4,https://pigce.edu.in/id/6a0d38012680659defdcd6...,57,12,0,0,1,0,20,29,3,1


In [20]:
training_data = training_data.drop("url",1)

In [21]:
Y_train = training_data['label']
X_train = training_data.drop('label', axis=1)

In [22]:
X_train.head()

Unnamed: 0,url_length,hostname_length,count-,count@,count-https,count-www,count-digits,count-letters,count_dir
0,55,17,0,0,0,1,0,45,3
1,29,21,0,0,1,1,0,24,0
2,65,16,0,0,0,0,23,34,3
3,23,15,0,0,1,1,5,13,0
4,57,12,0,0,1,0,20,29,3


#Training with Logistic Regression

In [25]:
# Importing
from sklearn.linear_model import LogisticRegression

In [26]:
log_model = LogisticRegression(warm_start=True)
log_model.fit(X_train,Y_train)

#Test *Data*

In [27]:
y_test = test_data['label']
x_test = test_data.drop('url',axis=1)
x_test = x_test.drop('label',axis=1)

In [28]:
test_data

Unnamed: 0,url,url_length,hostname_length,count-,count@,count-https,count-www,count-digits,count-letters,count_dir,label
0,https://www.chris.pirillo.com/russian-films-be...,65,21,5,0,1,1,0,52,2,0
1,https://www.uk.ask.com/wiki/Hodgson,35,14,0,0,1,1,0,27,2,0
2,https://www.oakridgefuneralcare.com/stories/20...,76,27,0,0,1,1,12,52,5,0
3,https://www.somethingnoir.com/,30,21,0,0,1,1,0,24,1,0
4,http://kghugheslaw.com/appr/new.php?cmd=login_...,193,15,0,0,0,0,72,107,2,1
...,...,...,...,...,...,...,...,...,...,...,...
448171,https://www.namesdatabase.com/schools/US/CA/Oa...,82,21,0,0,1,1,6,63,5,0
448172,http://busiclean.com/msds/ugo/trustpass.html,44,13,0,0,0,0,0,36,3,1
448173,https://www.amazon.com/Blood-Simple-John-Getz/...,59,14,3,0,1,1,6,42,3,0
448174,https://www.bassinusa.com/forum/ubbthreads.php...,72,17,0,0,1,1,5,55,2,0


In [29]:
x_test.head()

Unnamed: 0,url_length,hostname_length,count-,count@,count-https,count-www,count-digits,count-letters,count_dir
0,65,21,5,0,1,1,0,52,2
1,35,14,0,0,1,1,0,27,2
2,76,27,0,0,1,1,12,52,5
3,30,21,0,0,1,1,0,24,1
4,193,15,0,0,0,0,72,107,2


#Predictions

In [34]:
log_predictions = log_model.predict(x_test)
log_prob = log_model.predict_proba(x_test)
# print(x_test)

In [35]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
accuracy = accuracy_score(y_test, log_predictions)
print("Accuracy:", accuracy)


confidence_scores = log_model.decision_function(x_test)
# confidence_scores

# conf_matrix = confusion_matrix(y_test, log_predictions)
# print("Confusion Matrix:\n", conf_matrix)


# precision = precision_score(y_test, log_predictions)
# recall = recall_score(y_test, log_predictions)
# print("Precision:", precision)
# print("Recall:", recall)


# classification_rep = classification_report(y_test, log_predictions)
# print("Classification Report:\n", classification_rep)

Accuracy: 0.3034075898754061


saving file to pickle

In [None]:
import pickle

pickle.dump(log_model, open('LR_model.pkl', 'wb'))

#Logistic regression with Grid CV

In [36]:
from sklearn.model_selection import GridSearchCV

In [37]:
log_params = {
    'penalty': ['l2'],  # Use only 'l2' penalty for lbfgs solver
    'C': np.logspace(-3, 3, 7),  # Reduce the search space
    'solver': ['lbfgs'],  # Use only 'lbfgs' solver
    'warm_start': [True]
}


lr_gs = GridSearchCV(log_model, log_params, cv=3, verbose=1, n_jobs=-1)
lr_gs.fit(X_train,Y_train)
lr_gs_pred = lr_gs.predict(x_test)

accuracy = accuracy_score(y_test, lr_gs_pred)
print("Accuracy:", accuracy)


confidence_scores = log_model.decision_function(x_test)



Fitting 3 folds for each of 7 candidates, totalling 21 fits
Accuracy: 0.2328281746456749


In [38]:
y_test

0         0
1         0
2         0
3         0
4         1
         ..
448171    0
448172    1
448173    0
448174    0
448175    0
Name: label, Length: 448176, dtype: int64

In [39]:
# Extract the next 10 records from the test dataset
X_incremental = x_test.iloc[:10]  # Assuming the last column is the label
Y_incremental = y_test.iloc[:10]   # Assuming the last column is the label

# Perform incremental training
log_model2 = LogisticRegression(warm_start=True)
log_model2.fit(X_incremental, Y_incremental)


In [40]:
x_test2 = x_test.iloc[10:]
y_test2 = y_test.iloc[10:]

In [41]:
log_predictions = log_model2.predict(x_test2)
log_prob = log_model2.predict_proba(x_test2)
# print(x_test)

# from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
# y_pred = svm.predict(X_test)


accuracy = accuracy_score(y_test2, log_predictions)
print("Accuracy:", accuracy)
confidence_scores = log_model2.decision_function(x_test2)

Accuracy: 0.6225907364681836


In [42]:
# Extract the next 10 records from the test dataset
X_incremental2 = x_test2.iloc[:100]  # Assuming the last column is the label
Y_incremental2 = y_test2.iloc[:100]   # Assuming the last column is the label

# Perform incremental training
log_model2 = LogisticRegression(warm_start=True)
log_model2.fit(X_incremental2, Y_incremental2)


In [43]:
x_test3 = x_test2.iloc[100:]
y_test3 = y_test2.iloc[100:]

In [44]:
log_predictions = log_model2.predict(x_test3)
log_prob = log_model2.predict_proba(x_test3)
# print(x_test)

# from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
# y_pred = svm.predict(X_test)


accuracy = accuracy_score(y_test3, log_predictions)
print("Accuracy:", accuracy)
confidence_scores = log_model2.decision_function(x_test3)

Accuracy: 0.9829712586984953


#Training with SGD Classifier

In [45]:
from sklearn.linear_model import SGDClassifier

In [46]:
SGD_model = SGDClassifier(loss='log', warm_start=True, alpha=0.0001, l1_ratio=0.09, penalty='elasticnet')

In [48]:
SGD_model.fit(X_train,Y_train)

In [49]:
import pickle

pickle.dump(SGD_model, open('SGD_classifier.pkl', 'wb'))

In [50]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
sgd_pred = SGD_model.predict(x_test)


accuracy = accuracy_score(y_test, sgd_pred)
print("Accuracy:", accuracy)


confidence_scores = SGD_model.decision_function(x_test)


conf_matrix = confusion_matrix(y_test, sgd_pred)
print("Confusion Matrix:\n", conf_matrix)


precision = precision_score(y_test, sgd_pred)
recall = recall_score(y_test, sgd_pred)
print("Precision:", precision)
print("Recall:", recall)


# classification_rep = classification_report(y_test, log_predictions)
# print("Classification Report:\n", classification_rep)
# print(x_test)


Accuracy: 0.2773263182321231
Confusion Matrix:
 [[ 22834 321904]
 [  1981 101457]]
Precision: 0.23964654278499908
Recall: 0.9808484309441404


In [51]:

X_incremental = x_test.iloc[:10]  # Assuming the last column is the label
Y_incremental = y_test.iloc[:10]   # Assuming the last column is the label

# Perform incremental training# Perform incremental training
SGD_model2 = SGDClassifier(warm_start=True)
SGD_model2.fit(X_incremental, Y_incremental)


In [52]:
x_test = x_test.iloc[1000:]
y_test = y_test.iloc[1000:]

In [53]:

sgd_pred = SGD_model2.predict(x_test)


accuracy = accuracy_score(y_test, sgd_pred)
print("Accuracy:", accuracy)


confidence_scores = SGD_model2.decision_function(x_test)


conf_matrix = confusion_matrix(y_test, sgd_pred)
print("Confusion Matrix:\n", conf_matrix)


precision = precision_score(y_test, sgd_pred)
recall = recall_score(y_test, sgd_pred)
print("Precision:", precision)
print("Recall:", recall)


# classification_rep = classification_report(y_test, log_predictions)
# print("Classification Report:\n", classification_rep)
# print(x_test)


Accuracy: 0.7695515859527345
Confusion Matrix:
 [[343972      6]
 [103045    153]]
Precision: 0.9622641509433962
Recall: 0.0014825868718386015


#SGD with Grid Search CV

In [54]:
sgd_params = {
    'loss':['log'],
    'penalty':['elasticnet'],
    'alpha':np.logspace(-4, 4, 10),
    'l1_ratio':[0.05,0.06,0.07,0.08,0.09,0.1,0.12,0.13,0.14,0.15,0.2]
}

In [55]:
sgd_gs = GridSearchCV(SGD_model, sgd_params, cv=5, verbose=1, n_jobs=5)

In [56]:
sgd_gs.fit(X_train,Y_train)

Fitting 5 folds for each of 110 candidates, totalling 550 fits


In [57]:
sgd_gs.best_params_

{'alpha': 0.0001, 'l1_ratio': 0.14, 'loss': 'log', 'penalty': 'elasticnet'}

In [58]:
sgd_gs.score(x_test, y_test)

0.31261069467055475

In [59]:
sgd_gs2 = GridSearchCV(SGD_model2, sgd_params, cv=5, verbose=1, n_jobs=5)
sgd_gs2.fit(X_incremental, Y_incremental)
sgd_gs.score(x_test, y_test)

Fitting 5 folds for each of 110 candidates, totalling 550 fits


0.31261069467055475

#SVM Classifier

In [60]:
svm_model = SGDClassifier(loss='hinge', warm_start=True)
svm_sgd_params = {
    'alpha': [0.0001, 0.001, 0.01],  # Regularization parameter
    'penalty': ['l1', 'l2'],        # Penalty term
}

svm_sgd_gs = GridSearchCV(svm_model, svm_sgd_params, cv=5, verbose=1, n_jobs=-1)
svm_sgd_gs = svm_sgd_gs.fit(X_train,Y_train)
svm_pred = svm_sgd_gs.predict(x_test)


accuracy = accuracy_score(y_test, svm_pred)
print("Accuracy:", accuracy)

confidence_scores = log_model.decision_function(x_test)

conf_matrix = confusion_matrix(y_test, svm_pred)
print("Confusion Matrix:\n", conf_matrix)

precision = precision_score(y_test, svm_pred)
recall = recall_score(y_test, svm_pred)
print("Precision:", precision)
print("Recall:", recall)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
Accuracy: 0.2378459487986833
Confusion Matrix:
 [[  3216 340762]
 [    55 103143]]
Precision: 0.2323537693875942
Recall: 0.99946704393496


In [61]:
from sklearn.ensemble import RandomForestClassifier

# Example usage with warm start
rfc_model = RandomForestClassifier(warm_start=True)
rfc_model.fit(X_train, Y_train)
rfc_model.n_estimators += 10  # Increase the number of trees in a warm start fashion
rfc_model.fit(X_train, Y_train)  # Warm start with the previous solution


In [62]:
rfc_predicticons = rfc_model.predict(x_test)
rfc_prob = rfc_model.predict_proba(x_test)
# print(x_test)

In [63]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, classification_report
accuracy = accuracy_score(y_test, rfc_predicticons)
print("Accuracy:", accuracy)


confidence_scores = log_model.decision_function(x_test)
# confidence_scores

Accuracy: 0.23406443995205467


In [None]:
from keras.models import Sequential
from keras.layers import LSTM, Dense

# Assuming you have a Sequential model with LSTM layers
lstm_model = Sequential()

Y_train = Y_train.astype('int64')
Y_incremental = Y_incremental.astype('int64')
lstm_model.add(Dense(1, activation='sigmoid'))
lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Initial training
lstm_model.fit(X_train, Y_train, epochs=10)



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7b4616ec92d0>

In [None]:

# Use the model to predict labels for the test data
predictions = lstm_model.predict(x_test)
binary_predictions = (predictions > 0.5).astype('int64')

# Calculate accuracy
accuracy = accuracy_score(y_test, binary_predictions)

print(f"Accuracy: {accuracy * 100:.2f}%")


Accuracy: 32.82%


In [None]:
X_incremental = x_test.iloc[:1000]  # Assuming the last column is the label
Y_incremental = y_test.iloc[:1000]   # Assuming the last column is the label

# Incremental training (continue from the existing weights)
lstm_model.fit(X_incremental, Y_incremental, epochs=5)


x_test = x_test.iloc[1000:]  # Assuming the last column is the label
y_test = y_test.iloc[1000:]   # Assuming the last column is the label

# Use the model to predict labels for the test data
predictions = lstm_model.predict(x_test)
binary_predictions = (predictions > 0.5).astype('int64')

# Calculate accuracy
accuracy = accuracy_score(y_test, binary_predictions)

print(f"Accuracy: {accuracy * 100:.2f}%")



Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy: 78.04%


#Overall the  initial accuracy is almost in the range 26- 30% for all algos but the one given by logistic regression is the highest, we will proceed with it