## 1. Business Understanding

## 2. Data Understanding
##### Step 1: Data Collection and Initial Exploration

In [1]:
import pandas as pd

# Load the dataset
file_path = 'data/phishing_dataset/phishing_dataset.csv'
df = pd.read_csv(file_path)

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nFirst 5 Rows of the Dataset:")
display(df.head())

print("\nData Types and Null Values:")
display(df.info())


Dataset Shape: (504983, 8)

First 5 Rows of the Dataset:


Unnamed: 0,url,target,url_length,hostname_length,tld,num_dots,has_at_symbol,https
0,https://docs.google.com/presentation/d/e/2PACX...,Phishing,178,15,com,3,False,True
1,https://btttelecommunniccatiion.weeblysite.com/,Phishing,47,38,com,2,False,True
2,https://kq0hgp.webwave.dev/,Phishing,27,18,dev,2,False,True
3,https://brittishtele1bt-69836.getresponsesite....,Phishing,50,41,com,2,False,True
4,https://bt-internet-105056.weeblysite.com/,Phishing,42,33,com,2,False,True



Data Types and Null Values:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 504983 entries, 0 to 504982
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   url              504983 non-null  object
 1   target           504983 non-null  object
 2   url_length       504983 non-null  int64 
 3   hostname_length  504983 non-null  int64 
 4   tld              504755 non-null  object
 5   num_dots         504983 non-null  int64 
 6   has_at_symbol    504983 non-null  bool  
 7   https            504983 non-null  bool  
dtypes: bool(2), int64(3), object(3)
memory usage: 24.1+ MB


None

> **Key Insights:**
> There are 504,983 rows and 8 columns.
>> **Column Definitions**
>> - ***url:*** Contains URLs, likely representing the samples. This column is of type object.
>> - ***target:*** Appears to be the target variable, with entries indicating whether each URL is phishing or legitimate. This column is also of type object.
>> - ***url_length and hostname_length:*** Numerical columns (int64) likely representing the length of the full URL and the hostname, respectively.
>> - ***tld:*** Represents the top-level domain of each URL, such as ".com" or ".org". This column has a small number of missing values (504,755 non-null out of 504,983).
>> - ***num_dots:*** An integer feature that likely counts the number of dots (.) in the URL.
>> - ***has_at_symbol and https:*** Boolean columns indicating whether the URL contains an "@" symbol and whether it uses HTTPS, respectively.

##### Step 2: Checking for Missing Values, Unique Values, and Statistical Summaries

In [2]:
# Count missing values in each column
print("Missing Values per Column:")
missing_values = df.isnull().sum()
display(missing_values)

# Count unique values in each column
print("\nUnique Values per Column:")
unique_values = df.nunique()
display(unique_values)

# Statistical summary of numerical columns
print("\nStatistical Summary for Numerical Columns:")
display(df.describe())


Missing Values per Column:


url                  0
target               0
url_length           0
hostname_length      0
tld                228
num_dots             0
has_at_symbol        0
https                0
dtype: int64


Unique Values per Column:


url                504933
target                  3
url_length            714
hostname_length       148
tld                  1550
num_dots               34
has_at_symbol           2
https                   2
dtype: int64


Statistical Summary for Numerical Columns:


Unnamed: 0,url_length,hostname_length,num_dots
count,504983.0,504983.0,504983.0
mean,60.923625,20.030528,2.582043
std,66.307073,8.636628,1.166853
min,8.0,0.0,0.0
25%,39.0,15.0,2.0
50%,52.0,19.0,2.0
75%,71.0,23.0,3.0
max,25523.0,240.0,40.0


> **Key Insights**
> - The `tld`(top-level-domain) column has 228 missing values
> - The dataset is diverse because `url` column has 504,933 unique URLs
> - In the `url_length` column the minimum characters are 8 and the maximum characters are 25,523, indicating that some entries have very long URLs.