# Phish Hook 
**(Data Collection, Data Preprocessing & Exploratory Data Analysis)**

# 1.0 Data Collection
The dataset used in this project is “Malicious And Benign URLs” from the Kaggle website, https://www.kaggle.com/datasets/siddharthkumar25/malicious-and-benign-urls 

This dataset was acquired from various sources such as PhishTank. This dataset consists of 345,000 legitimate and 104,000 malicious URLs. Each has been categorized with class labels **‘0’ for benign** and **‘1’ for malicious**. 

In [1]:
pip install --upgrade pip

Note: you may need to restart the kernel to use updated packages.


In [1]:
pip install -r requirements.txt

Collecting tensorflow==2.11.0
  Using cached tensorflow-2.11.0-cp39-cp39-win_amd64.whl (1.9 kB)
Collecting imblearn==0.0
  Using cached imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Collecting tensorflow-intel==2.11.0
  Using cached tensorflow_intel-2.11.0-cp39-cp39-win_amd64.whl (266.3 MB)
Installing collected packages: imblearn, tensorflow-intel, tensorflow
Successfully installed imblearn-0.0 tensorflow-2.11.0 tensorflow-intel-2.11.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
# Check if GPU is being used.

import tensorflow as tf
tf.test.gpu_device_name()

''

In [28]:
import pandas as pd

# Loading the downloaded dataset
df = pd.read_csv("urldata.csv")
df.head(10)

Unnamed: 0.1,Unnamed: 0,url,label,result
0,0,https://www.google.com,benign,0
1,1,https://www.youtube.com,benign,0
2,2,https://www.facebook.com,benign,0
3,3,https://www.baidu.com,benign,0
4,4,https://www.wikipedia.org,benign,0
5,5,https://www.reddit.com,benign,0
6,6,https://www.yahoo.com,benign,0
7,7,https://www.google.co.in,benign,0
8,8,https://www.qq.com,benign,0
9,9,https://www.amazon.com,benign,0


In [29]:
#Removing the unnamed columns as it is not necesary.
df = df.drop('Unnamed: 0',axis=1)

#Show info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450176 entries, 0 to 450175
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     450176 non-null  object
 1   label   450176 non-null  object
 2   result  450176 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 10.3+ MB


In [30]:
df.shape

(450176, 3)

In [31]:
# Printing number of benign and malcious URLs
df["label"].value_counts()

benign       345738
malicious    104438
Name: label, dtype: int64

# 2.0 Data Preprocessing
No data cleaning since there are no missing values.

## 2.1 Feature Extraction
The initial dataset only consists of legit and malicious URLs, together with labels and results. In this stage, useful features are extracted from the URLs to improve the dataset further. A total of **19 features** are extracted to make the dataset more suitable for training ML models. 

These extracted features are categorized into 
1. length-based
2. count-based 
3. binary features

### 2.1.1 Length Features
Features:
1. Length of URL
2. Length of Hostname
3. Length of Path
4. Length of First Directory
5. Length of Top Level Domain


In [32]:
#Importing dependencies
from urllib.parse import urlparse
import os.path

# changing dataframe variable
urldata = df

In [33]:
#Length of URL (Phishers can use long URL to hide the doubtful part in the address bar)
urldata['url_length'] = urldata['url'].apply(lambda i: len(str(i)))

#Length of Hostname
urldata['hostname_length'] = urldata['url'].apply(lambda i: len(urlparse(i).netloc))

#Length of Path
urldata['path_length'] = urldata['url'].apply(lambda i: len(urlparse(i).path))
     

In [34]:
#Length of First Directory
def fd_length(url):
    urlpath= urlparse(url).path
    try:
        return len(urlpath.split('/')[1])
    except:
        return 0

urldata['fd_length'] = urldata['url'].apply(lambda i: fd_length(i))

In [35]:
pip install tld




In [36]:
#Length of Top Level Domain

from tld import get_tld

urldata['tld'] = urldata['url'].apply(lambda i: get_tld(i,fail_silently=True))

def tld_length(tld):
    try:
        return len(tld)
    except:
        return -1
    
df['tld_length'] = urldata['tld'].apply(lambda i: tld_length(i))

In [37]:
#Removing the tld column as it is not necesary.
urldata = urldata.drop('tld',axis=1)

# printing first few rows
urldata.head(10)

Unnamed: 0,url,label,result,url_length,hostname_length,path_length,fd_length,tld_length
0,https://www.google.com,benign,0,22,14,0,0,3
1,https://www.youtube.com,benign,0,23,15,0,0,3
2,https://www.facebook.com,benign,0,24,16,0,0,3
3,https://www.baidu.com,benign,0,21,13,0,0,3
4,https://www.wikipedia.org,benign,0,25,17,0,0,3
5,https://www.reddit.com,benign,0,22,14,0,0,3
6,https://www.yahoo.com,benign,0,21,13,0,0,3
7,https://www.google.co.in,benign,0,24,16,0,0,5
8,https://www.qq.com,benign,0,18,10,0,0,3
9,https://www.amazon.com,benign,0,22,14,0,0,3


### 2.1.2 Count Features
Features:
1. Count of ‘-’
2. Count of ‘@’
3. Count of ‘?’
4. Count of ‘%’
5. Count of ‘.’
6. Count of ‘=’
7. Count of 'HTTP'
8. Count of 'HTTPS'
8. Count of 'www'
9. Count of Letters
10. Count of Digits
11. Count of Number Of Directories


In [48]:
# Count of how many times a special character appears in url

## Spammers have jumped on the little-used soft hyphen (or SHY character) to fool URL filtering devices.
urldata['count-'] = urldata['url'].apply(lambda i: i.count('-'))

## Using “@” symbol in the URL leads the browser to ignore everything preceding the “@” symbol. 
urldata['count@'] = urldata['url'].apply(lambda i: i.count('@'))

## Using more "?" in URL denotes a query string that contains the data to be passed to the server.
urldata['count?'] = urldata['url'].apply(lambda i: i.count('?'))

## Malicious websites generally contain more spaces in their URL hence more number of %.
urldata['count%'] = urldata['url'].apply(lambda i: i.count('%'))

## Each domain is separated by dot (.). Phishing websites generally use more than two sub-domains in the URL. 
urldata['count.'] = urldata['url'].apply(lambda i: i.count('.'))

## Using "=" indicates passing of variable values from one form page to another which is risky.
urldata['count='] = urldata['url'].apply(lambda i: i.count('='))

## Phishing websites have more than one HTTP in their URL whereas safe sites have only one HTTP.
urldata['count_http'] = urldata['url'].apply(lambda i : i.count('http'))

## Generally malicious URLs do not use HTTPS protocols as it generally requires user credentials and ensures that the website is safe for transactions. 
urldata['count_https'] = urldata['url'].apply(lambda i : i.count('https'))

## Most malicious URLs has no or more than one www.
urldata['count_www'] = urldata['url'].apply(lambda i: i.count('www'))


In [49]:
def letter_count(url):
    letters = 0
    for i in url:
        if i.isalpha():
            letters = letters + 1
    return letters
urldata['count_letters']= urldata['url'].apply(lambda i: letter_count(i))

In [50]:
def digit_count(url):
    digits = 0
    for i in url:
        if i.isnumeric():
            digits = digits + 1
    return digits
urldata['count_digits']= urldata['url'].apply(lambda i: digit_count(i))

In [51]:
def no_of_dir(url):
    urldir = urlparse(url).path
    return urldir.count('/')
urldata['count_dir'] = urldata['url'].apply(lambda i: no_of_dir(i))

In [58]:
# droping 3 duplicated as I change new naming conventions midway
urldata.drop("count-http",axis=1,inplace=True)
urldata.drop("count-https",axis=1,inplace=True)
urldata.drop("count-www",axis=1,inplace=True)

# printing first few rows
urldata.head(10)

Unnamed: 0,url,label,result,url_length,hostname_length,path_length,fd_length,tld_length,count-,count@,...,count.,count=,count_letters,count_digits,count_dir,use_of_ip,short_url,count_http,count_https,count_www
0,https://www.google.com,benign,0,22,14,0,0,3,0,0,...,2,0,17,0,0,1,1,1,1,1
1,https://www.youtube.com,benign,0,23,15,0,0,3,0,0,...,2,0,18,0,0,1,1,1,1,1
2,https://www.facebook.com,benign,0,24,16,0,0,3,0,0,...,2,0,19,0,0,1,1,1,1,1
3,https://www.baidu.com,benign,0,21,13,0,0,3,0,0,...,2,0,16,0,0,1,1,1,1,1
4,https://www.wikipedia.org,benign,0,25,17,0,0,3,0,0,...,2,0,20,0,0,1,1,1,1,1
5,https://www.reddit.com,benign,0,22,14,0,0,3,0,0,...,2,0,17,0,0,1,-1,1,1,1
6,https://www.yahoo.com,benign,0,21,13,0,0,3,0,0,...,2,0,16,0,0,1,1,1,1,1
7,https://www.google.co.in,benign,0,24,16,0,0,5,0,0,...,3,0,18,0,0,1,1,1,1,1
8,https://www.qq.com,benign,0,18,10,0,0,3,0,0,...,2,0,13,0,0,1,1,1,1,1
9,https://www.amazon.com,benign,0,22,14,0,0,3,0,0,...,2,0,17,0,0,1,1,1,1,1


### 2.1.3 Binary Features
Features:
1. Use of IP address
2. Use of URL shortening


In [59]:
# Use of IP address
## Generally cyber attackers use an IP address in place of the domain name to hide the identity of the website. 
## This feature will check whether the URL has IP address or not.

import re

def having_ip_address(url):
    match = re.search(
        '(([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.'
        '([01]?\d\d?|2[0-4]\d|25[0-5])\/)|'  # IPv4
        '((0x[0-9a-fA-F]{1,2})\.(0x[0-9a-fA-F]{1,2})\.(0x[0-9a-fA-F]{1,2})\.(0x[0-9a-fA-F]{1,2})\/)' # IPv4 in hexadecimal
        '(?:[a-fA-F0-9]{1,4}:){7}[a-fA-F0-9]{1,4}', url)  # Ipv6
    if match:
        # print match.group()
        return -1
    else:
        # print 'No matching pattern found'
        return 1
    
urldata['use_of_ip'] = urldata['url'].apply(lambda i: having_ip_address(i))

In [60]:
# Use of url shortening service
## This feature is created to identify whether the URL uses URL shortening services 
## like bit. \ly, goo.gl, go2l.ink, etc.

def shortening_service(url):
    match = re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
                      'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
                      'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
                      'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
                      'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
                      'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
                      'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
                      'tr\.im|link\.zip\.net',
                      url)
    if match:
        return -1
    else:
        return 1
urldata['short_url'] = urldata['url'].apply(lambda i: shortening_service(i))

In [61]:
# printing first few rows
urldata.head(10)

Unnamed: 0,url,label,result,url_length,hostname_length,path_length,fd_length,tld_length,count-,count@,...,count.,count=,count_letters,count_digits,count_dir,use_of_ip,short_url,count_http,count_https,count_www
0,https://www.google.com,benign,0,22,14,0,0,3,0,0,...,2,0,17,0,0,1,1,1,1,1
1,https://www.youtube.com,benign,0,23,15,0,0,3,0,0,...,2,0,18,0,0,1,1,1,1,1
2,https://www.facebook.com,benign,0,24,16,0,0,3,0,0,...,2,0,19,0,0,1,1,1,1,1
3,https://www.baidu.com,benign,0,21,13,0,0,3,0,0,...,2,0,16,0,0,1,1,1,1,1
4,https://www.wikipedia.org,benign,0,25,17,0,0,3,0,0,...,2,0,20,0,0,1,1,1,1,1
5,https://www.reddit.com,benign,0,22,14,0,0,3,0,0,...,2,0,17,0,0,1,-1,1,1,1
6,https://www.yahoo.com,benign,0,21,13,0,0,3,0,0,...,2,0,16,0,0,1,1,1,1,1
7,https://www.google.co.in,benign,0,24,16,0,0,5,0,0,...,3,0,18,0,0,1,1,1,1,1
8,https://www.qq.com,benign,0,18,10,0,0,3,0,0,...,2,0,13,0,0,1,1,1,1,1
9,https://www.amazon.com,benign,0,22,14,0,0,3,0,0,...,2,0,17,0,0,1,1,1,1,1


In [62]:
# printing info about current dataset
urldata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 450176 entries, 0 to 450175
Data columns (total 22 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   url              450176 non-null  object
 1   label            450176 non-null  object
 2   result           450176 non-null  int64 
 3   url_length       450176 non-null  int64 
 4   hostname_length  450176 non-null  int64 
 5   path_length      450176 non-null  int64 
 6   fd_length        450176 non-null  int64 
 7   tld_length       450176 non-null  int64 
 8   count-           450176 non-null  int64 
 9   count@           450176 non-null  int64 
 10  count?           450176 non-null  int64 
 11  count%           450176 non-null  int64 
 12  count.           450176 non-null  int64 
 13  count=           450176 non-null  int64 
 14  count_letters    450176 non-null  int64 
 15  count_digits     450176 non-null  int64 
 16  count_dir        450176 non-null  int64 
 17  use_of_ip 

## 2.2 Save new dataset as.csv file


In [63]:
urldata.to_csv("data/url_processed.csv")