## 1. Business Understanding

## 2. Data Understanding
##### Step 1: Data Collection and Initial Exploration

In [1]:
import pandas as pd

# Load the dataset
file_path = 'data/combined_phishing_data/combined_phishing_data.csv'
df = pd.read_csv(file_path)

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst 5 Rows of the Dataset:")
display(df.head())

print("\nData Types and Null Values:")
display(df.info())


  df = pd.read_csv(file_path)


Dataset Shape: (247225, 124)

First 5 Rows of the Dataset:


Unnamed: 0,URL,URLLength,DomainLength,IsDomainIP,nb_dots,nb_hyphens,nb_at,NoOfQMarkInURL,NoOfAmpersandInURL,NoOfEqualsInURL,...,IsResponsive,Crypto,Bank,HasSubmitButton,LargestLineLength,Pay,TLD,NoOfOtherSpecialCharsInURL,ObfuscationRatio,DegitRatioInURL
0,https://www.todayshomeowner.com/how-to-make-ho...,82,23,0,2.0,7.0,0.0,0,0,0,...,,,,,,,,,,
1,http://thapthan.ac.th/information/confirmation...,93,14,1,2.0,0.0,0.0,0,0,0,...,,,,,,,,,,
2,http://app.dialoginsight.com/T/OFC4/L2S/3888/B...,121,21,1,3.0,0.0,0.0,0,0,0,...,,,,,,,,,,
3,https://www.bedslide.com,24,16,0,2.0,0.0,0.0,0,0,0,...,,,,,,,,,,
4,https://tabs.ultimate-guitar.com/s/sex_pistols...,73,24,0,3.0,1.0,0.0,0,0,0,...,,,,,,,,,,



Data Types and Null Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 247225 entries, 0 to 247224
Columns: 124 entries, URL to DegitRatioInURL
dtypes: float64(107), int64(13), object(4)
memory usage: 233.9+ MB


None

> **Key Insights:**
> There are 247,225 rows and 124 columns.
>> **Column Definitions**
>> - ***url:*** Contains URLs, likely representing the samples. This column is of type object.
>> - ***target:*** Appears to be the target variable, with entries indicating whether each URL is phishing or legitimate. This column is also of type object.
>> - ***url_length and hostname_length:*** Numerical columns (int64) likely representing the length of the full URL and the hostname, respectively.
>> - ***tld:*** Represents the top-level domain of each URL, such as ".com" or ".org". This column has a small number of missing values (504,755 non-null out of 504,983).

##### Step 2: Checking for Missing Values, Unique Values, and Statistical Summaries

In [2]:
# Count missing values in each column
print("Missing Values per Column:")
missing_values = df.isnull().sum()
display(missing_values)

# Count unique values in each column
print("\nUnique Values per Column:")
unique_values = df.nunique()
display(unique_values)

# Statistical summary of numerical columns
print("\nStatistical Summary for Numerical Columns:")
display(df.describe())


Missing Values per Column:


URL                                0
URLLength                          0
DomainLength                       0
IsDomainIP                         0
nb_dots                       235795
                               ...  
Pay                            11430
TLD                            11430
NoOfOtherSpecialCharsInURL     11430
ObfuscationRatio               11430
DegitRatioInURL                11430
Length: 124, dtype: int64


Unique Values per Column:


URL                           246755
URLLength                        515
DomainLength                     109
IsDomainIP                         2
nb_dots                           19
                               ...  
Pay                                2
TLD                              695
NoOfOtherSpecialCharsInURL        74
ObfuscationRatio                 146
DegitRatioInURL                  575
Length: 124, dtype: int64


Statistical Summary for Numerical Columns:


Unnamed: 0,URLLength,DomainLength,IsDomainIP,nb_dots,nb_hyphens,nb_at,NoOfQMarkInURL,NoOfAmpersandInURL,NoOfEqualsInURL,nb_underscore,...,URLSimilarityIndex,IsResponsive,Crypto,Bank,HasSubmitButton,LargestLineLength,Pay,NoOfOtherSpecialCharsInURL,ObfuscationRatio,DegitRatioInURL
count,247225.0,247225.0,247225.0,11430.0,11430.0,11430.0,247225.0,247225.0,247225.0,11430.0,...,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0,235795.0
mean,35.800752,21.452822,0.009542,2.480752,0.99755,0.022222,0.034572,0.031401,0.072917,0.32266,...,78.430778,0.624513,0.023474,0.127089,0.414301,12789.53,0.237007,2.340198,0.000138,0.028616
std,42.431083,9.232625,0.097216,1.369686,2.087087,0.1555,0.205925,0.83625,0.938991,1.093336,...,28.976055,0.484249,0.151403,0.333074,0.492602,152201.1,0.425247,3.527603,0.003817,0.070897
min,12.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.155574,0.0,0.0,0.0,0.0,22.0,0.0,0.0,0.0,0.0
25%,24.0,16.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,57.024793,0.0,0.0,0.0,0.0,200.0,0.0,1.0,0.0,0.0
50%,28.0,20.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,100.0,1.0,0.0,0.0,0.0,1090.0,0.0,1.0,0.0,0.0
75%,35.0,24.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,...,100.0,1.0,0.0,0.0,1.0,8047.0,0.0,3.0,0.0,0.0
max,6097.0,214.0,1.0,24.0,43.0,4.0,4.0,149.0,176.0,18.0,...,100.0,1.0,1.0,1.0,1.0,13975730.0,1.0,499.0,0.348,0.684


> **Key Insights**
> - There is significant number of missing values in the `nb_dots`, `Pay`, `TLD`, `NoOfOtherSpecialCharsInURL`, `ObfuscationRatio`, and `DegitRatioInURL` columns.
> - The `URL` column contains 246,755 unique entries, indicating a diverse set of URLs.

## 3. Data Preparation
##### Step 1: Data Preprocessing

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Handle missing values - fill with a placeholder or drop if appropriate
df['tld'].fillna('unknown', inplace=True)

# Encode the target variable
label_encoder = LabelEncoder()
df['target'] = label_encoder.fit_transform(df['target'])

# Split features and target variable
X = df.drop(columns=['target'])
y = df['target']

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)


KeyError: 'tld'

> **Key Insights:**
> - Training and Test Set Shapes:
>> - The training set contains 403,986 samples and 7 features.
>> - The test set contains 100,997 samples and 7 features.

##### Step 2: Exploratory Data Analysis (EDA)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Distribution of the target variable
plt.figure(figsize=(8, 5))
sns.countplot(data=df, x='target')
plt.title('Distribution of Target Variable')
plt.xlabel('Target (0: Legitimate, 1: Phishing)')
plt.ylabel('Count')
plt.show()

# Print the count of each class in the target variable
target_counts = df['target'].value_counts()
print("\nCount of Target Classes:")
print(target_counts)

# 2. Summary statistics for URL length grouped by target variable
url_length_summary = df.groupby('target')['url_length'].describe()
print("\nSummary Statistics for URL Length by Target Variable:")
print(url_length_summary)

# Relationship between URL length and target
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='target', y='url_length')
plt.title('URL Length vs. Target Variable')
plt.xlabel('Target (0: Legitimate, 1: Phishing)')
plt.ylabel('URL Length')
plt.show()

# 3. Summary statistics for hostname length grouped by target variable
hostname_length_summary = df.groupby('target')['hostname_length'].describe()
print("\nSummary Statistics for Hostname Length by Target Variable:")
print(hostname_length_summary)

# Relationship between hostname length and target
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='target', y='hostname_length')
plt.title('Hostname Length vs. Target Variable')
plt.xlabel('Target (0: Legitimate, 1: Phishing)')
plt.ylabel('Hostname Length')
plt.show()



> **Key Insights:** 
> - There are  ***345,738*** instances of phishing URLs, ***54,807*** instances of legitimate URLs & ***104,438*** instances of unknown category.
> - The mean hostname length for legitimate URLs `0` is higher than that of phishing URLs `1`.
> - The standard deviation is also larger for legitimate URLs, indicating more variability in their lengths.