Week 3
Deliverables
Data Cleaning Report
A detailed summary of how missing values were handled, including the chosen imputation techniques (e.g., mean, median, mode, or advanced methods).
Documentation of duplicate entries identified and removed, including the percentage of duplicates in the original dataset.
Feature Encoding Summary

Explanation of encoding techniques applied to categorical variables (e.g., one-hot encoding, label encoding, or ordinal encoding).
List of transformed features and their encoded representations.
Normalization/Scaling Report

Description of normalization or scaling techniques used for numerical features (e.g., Min-Max scaling, StandardScaler, or RobustScaler).
Before-and-after comparison of numerical feature distributions to illustrate the effect of scaling.
Data Splitting Report

Details of the data split, including the size and composition of training and testing sets (e.g., number of records in each set and percentage split).
Confirmation of the stratified split (if applicable) to maintain the target variable's distribution across training and testing sets.
Preprocessed Dataset

Final cleaned, encoded, normalized, and split dataset, ready for model training and evaluation.



In [44]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [50]:
df=pd.read_csv("dataset_phishing.csv")
df.head()

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,http://www.crestonwood.com/router.php,37,19,0,3,0,0,0,0,0,...,0,1,0,45,-1,0,1,1,4,legitimate
1,http://shadetreetechnology.com/V4/validation/a...,77,23,1,1,0,0,0,0,0,...,1,0,0,77,5767,0,0,1,2,phishing
2,https://support-appleld.com.secureupdate.duila...,126,50,1,4,1,0,1,2,0,...,1,0,0,14,4004,5828815,0,1,0,phishing
3,http://rgipt.ac.in,18,11,0,2,0,0,0,0,0,...,1,0,0,62,-1,107721,0,0,3,legitimate
4,http://www.iracing.com/tracks/gateway-motorspo...,55,15,0,2,2,0,0,0,0,...,0,1,0,224,8175,8725,0,0,6,legitimate


In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11430 entries, 0 to 11429
Data columns (total 89 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   url                         11430 non-null  object 
 1   length_url                  11430 non-null  int64  
 2   length_hostname             11430 non-null  int64  
 3   ip                          11430 non-null  int64  
 4   nb_dots                     11430 non-null  int64  
 5   nb_hyphens                  11430 non-null  int64  
 6   nb_at                       11430 non-null  int64  
 7   nb_qm                       11430 non-null  int64  
 8   nb_and                      11430 non-null  int64  
 9   nb_or                       11430 non-null  int64  
 10  nb_eq                       11430 non-null  int64  
 11  nb_underscore               11430 non-null  int64  
 12  nb_tilde                    11430 non-null  int64  
 13  nb_percent                  114

In [62]:
missing_values=df.isnull().sum()
print("Missing values:\n",missing_values)

Missing values:
 url                0
length_url         0
length_hostname    0
ip                 0
nb_dots            0
                  ..
web_traffic        0
dns_record         0
google_index       0
page_rank          0
status             0
Length: 89, dtype: int64


In [66]:
duplicate_rows=df.duplicated().sum()
print("Duplicate Row:\n",duplicate_rows)

Duplicate Row:
 0


#feature Encoding



In [70]:
from sklearn.preprocessing import LabelEncoder

categorical_columns=df.select_dtypes(include=["object"]).columns
encoded_df=df.copy()

label_encoders={}
for col in categorical_columns:
    le=LabelEncoder()
    encoded_df[col]=le.fit_transform(encoded_df[col])
    label_encoders[col]=le


for col,le in label_encoders.items():
    print(f"Encoding for{col}:{dict(zip(le.classes_,le.transform(le.classes_)))}")

Encoding forstatus:{'legitimate': 0, 'phishing': 1}


#Normalization

In [76]:
from sklearn.preprocessing import MinMaxScaler
numerical_columns=encoded_df.select_dtypes(include=["int64","float64"]).columns.difference(["status"])
scaler=MinMaxScaler()
normalized_df=encoded_df.copy()
normalized_df[numerical_columns]=scaler.fit_transform(encoded_df[numerical_columns])

print("Before Normalization:\n",df[numerical_columns].describe())
print("After Normalization:\n",normalized_df[numerical_columns].describe())

Before Normalization:
        abnormal_subdomain  avg_word_host  avg_word_path  avg_words_raw  \
count        11430.000000   11430.000000   11430.000000   11430.000000   
mean             0.021610       7.678075       5.092425       7.258882   
std              0.145412       3.578435       7.147050       4.145827   
min              0.000000       1.000000       0.000000       2.000000   
25%              0.000000       5.250000       0.000000       5.250000   
50%              0.000000       7.000000       4.857143       6.500000   
75%              0.000000       9.000000       6.714286       8.000000   
max              1.000000      39.000000     250.000000     128.250000   

       brand_in_path  brand_in_subdomain   char_repeat    dns_record  \
count   11430.000000        11430.000000  11430.000000  11430.000000   
mean        0.004899            0.004112      2.927472      0.020122   
std         0.069827            0.063996      4.768936      0.140425   
min         0.000000  

#Data Splitting

In [80]:
from sklearn.model_selection import train_test_split

x=normalized_df.drop(columns=["status"])
y=normalized_df["status"]

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42,stratify=y)

print("Training Set Size:",x_train.shape[0])
print("Testing Set Size:",x_test.shape[0])
print("Training Target Distibution :\n",y_train.value_counts(normalize=True))
print("Testing Target Distribution:\n",y_test.value_counts(normalize=True))

Training Set Size: 8001
Testing Set Size: 3429
Training Target Distibution :
 status
1    0.500062
0    0.499938
Name: proportion, dtype: float64
Testing Target Distribution:
 status
0    0.500146
1    0.499854
Name: proportion, dtype: float64
