**Phishing Email Detection - URL Dataset Preparation**

This dataset was obtained from Mendeley Data. It is titled "PhiUSIIL Phishing URL (Website)". It can be found [here](https://archive.ics.uci.edu/dataset/967/phiusiil+phishing+url+dataset).

It contains 235,795 URLs.

Prasad, A. & Chandra, S. (2024). PhiUSIIL Phishing URL (Website) [Dataset]. UCI Machine Learning Repository. https://doi.org/10.1016/j.cose.2023.103545.

In [None]:
#import numpy and pandas for mathematical computation and data manipulation respectively
import numpy as np
import pandas as pd
#import drive package to connect this colab file with the drive where the data will be retrived from
from google.colab import drive

In [None]:
import matplotlib.pyplot as plt

In [None]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import Counter

**Import the dataset from Google Drive**

In [None]:
#mount google drive to access the dataset directly from the drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/CNS Project II/Phishing Email Detection/PhiUSIIL_Phishing_URL_Dataset.csv', encoding = "ISO-8859-1")

**Exploratory Data Analysis (EDA)**

In [None]:
# basic info
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235795 entries, 0 to 235794
Data columns (total 56 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   ï»¿FILENAME                 235795 non-null  object 
 1   URL                         235795 non-null  object 
 2   URLLength                   235795 non-null  int64  
 3   Domain                      235795 non-null  object 
 4   DomainLength                235795 non-null  int64  
 5   IsDomainIP                  235795 non-null  int64  
 6   TLD                         235795 non-null  object 
 7   URLSimilarityIndex          235795 non-null  float64
 8   CharContinuationRate        235795 non-null  float64
 9   TLDLegitimateProb           235795 non-null  float64
 10  URLCharProb                 235795 non-null  float64
 11  TLDLength                   235795 non-null  int64  
 12  NoOfSubDomain               235795 non-null  int64  
 13  HasObfuscation

Unnamed: 0,URLLength,DomainLength,IsDomainIP,URLSimilarityIndex,CharContinuationRate,TLDLegitimateProb,URLCharProb,TLDLength,NoOfSubDomain,HasObfuscation,...,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef,label
count,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,...,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0
mean,34.573095,21.470396,0.002706,78.430778,0.845508,0.260423,0.055747,2.764456,1.164758,0.002057,...,0.237007,0.023474,0.486775,26.075689,6.333111,10.522305,65.071113,2.377629,49.262516,0.571895
std,41.314153,9.150793,0.051946,28.976055,0.216632,0.251628,0.010587,0.599739,0.600969,0.045306,...,0.425247,0.151403,0.499826,79.411815,74.866296,22.312192,176.687539,17.641097,161.02743,0.494805
min,13.0,4.0,0.0,0.155574,0.0,0.0,0.001083,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,23.0,16.0,0.0,57.024793,0.68,0.005977,0.050747,2.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
50%,27.0,20.0,0.0,100.0,1.0,0.079963,0.05797,3.0,1.0,0.0,...,0.0,0.0,0.0,8.0,2.0,6.0,12.0,0.0,10.0,1.0
75%,34.0,24.0,0.0,100.0,1.0,0.522907,0.062875,3.0,1.0,0.0,...,0.0,0.0,1.0,29.0,8.0,15.0,88.0,1.0,57.0,1.0
max,6097.0,110.0,1.0,100.0,1.0,0.522907,0.090824,13.0,10.0,1.0,...,1.0,1.0,1.0,8956.0,35820.0,6957.0,27397.0,4887.0,27516.0,1.0


In [None]:
df.head()

Unnamed: 0,ï»¿FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,URLSimilarityIndex,CharContinuationRate,TLDLegitimateProb,...,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef,label
0,521848.txt,https://www.southbankmosaics.com,31,www.southbankmosaics.com,24,0,com,100.0,1.0,0.522907,...,0,0,1,34,20,28,119,0,124,1
1,31372.txt,https://www.uni-mainz.de,23,www.uni-mainz.de,16,0,de,100.0,0.666667,0.03265,...,0,0,1,50,9,8,39,0,217,1
2,597387.txt,https://www.voicefmradio.co.uk,29,www.voicefmradio.co.uk,22,0,uk,100.0,0.866667,0.028555,...,0,0,1,10,2,7,42,2,5,1
3,554095.txt,https://www.sfnmjournal.com,26,www.sfnmjournal.com,19,0,com,100.0,1.0,0.522907,...,1,1,1,3,27,15,22,1,31,1
4,151578.txt,https://www.rewildingargentina.org,33,www.rewildingargentina.org,26,0,org,100.0,1.0,0.079963,...,1,0,1,244,15,34,72,1,85,1


In [None]:
# find duplicates
df.duplicated().sum()

0

In [None]:
# Keep only 'URL' and 'label' columns
df = df[['URL', 'label']]

In [None]:
# Data Cleaning
df = df.dropna()  # remove missing values
df = df.drop_duplicates()  # remove duplicates

In [None]:
# check the data types
df.dtypes

Unnamed: 0,0
URL,object
label,int64


In [None]:
df.head()

Unnamed: 0,URL,label
0,https://www.southbankmosaics.com,1
1,https://www.uni-mainz.de,1
2,https://www.voicefmradio.co.uk,1
3,https://www.sfnmjournal.com,1
4,https://www.rewildingargentina.org,1


In [None]:
# Separate the dataset into phishing and legitimate URLs
phishing_urls = df[df['label'] == 0]
legitimate_urls = df[df['label'] == 1]

In [None]:
# Randomly sample 11,322 URLs from each category
sampled_phishing = phishing_urls.sample(n=88678, random_state=42)
sampled_legitimate = legitimate_urls.sample(n=88678, random_state=42)

In [None]:
# Combine the sampled datasets
reduced_df = pd.concat([sampled_phishing, sampled_legitimate])

In [None]:
# Shuffle the combined dataset
reduced_df = reduced_df.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
# Print the new dataset sizes
print("Reduced dataset size:")
print(f"Number of Phishing URLs: {sum(reduced_df['label'] == 0)}")
print(f"Number of Legitimate URLs: {sum(reduced_df['label'] == 1)}")

Reduced dataset size:
Number of Phishing URLs: 88678
Number of Legitimate URLs: 88678


In [None]:
# Convert labels to binary format (1 (good)-> 2, 0 (bad) -> 3)
# Mapping the Labels column
reduced_df = reduced_df.replace([1, 0],[2,3])
reduced_df.head(10)

Unnamed: 0,urls,labels
0,https://help-meta-id-165485318.web.app/,3
1,https://www.front-porch-ideas-and-more.com,2
2,https://www.simongarden.ch,2
3,https://www.scottishfriendly.co.uk,2
4,https://st33-erd.web.app/,3
5,http://www.kannadagrahakarakoota.org,3
6,https://www.sfr.hotline-phone.ru/,3
7,https://www.memorialcare.org,2
8,http://www.malconnected.cloud,3
9,https://86u750988760000--querita.repl.co/carga...,3


In [None]:
# Change the remaining column headers
reduced_df.columns = ["urls", "labels"]
reduced_df.head(10)

Unnamed: 0,urls,labels
0,https://help-meta-id-165485318.web.app/,3
1,https://www.front-porch-ideas-and-more.com,2
2,https://www.simongarden.ch,2
3,https://www.scottishfriendly.co.uk,2
4,https://st33-erd.web.app/,3
5,http://www.kannadagrahakarakoota.org,3
6,https://www.sfr.hotline-phone.ru/,3
7,https://www.memorialcare.org,2
8,http://www.malconnected.cloud,3
9,https://86u750988760000--querita.repl.co/carga...,3


In [None]:
reduced_df.to_csv('/content/drive/MyDrive/CNS Project II/Phishing Email Detection/PED_URLDataset.csv', index=False)
print("Saved reduced dataset")

Saved reduced dataset
