## Contents<a id='3.1_Contents'></a>
* [2 Exploratory Data Analysis](#2_Exploratory_Data_Analysis)
  * [2.1 Introduction](#2.1_Introduction)
  * [2.2 Imports](#2.2-Imports)
  * [2.3 Load the data](#2.3-Load-the-data)
  * [2.4 Removing prefixes](#2.4-Removing-prefixes)
  * [2.5 Extracting the host part](#2.5-Extracting-the-host-part)
  * [2.6 Feature Creation](#2.6-Feature-Creation)
    * [2.6.1 URL and ext_host length](#2.6.1-URL-and-ext_host-length)
    * [2.6.2 URL containing IP address](#2.6.2-URL-containing-IP-address)
    * [2.6.3 Containing java script queries](#2.6.3-Containing-java-script-queries)
    * [2.6.4 Finding number-dot pattern](#2.6.4-Finding-number-dot-pattern)
    * [2.6.5 Capital letters in host](#2.6.5-Capital-letters-in-host)
    * [2.6.6 Capital letters in url](#2.6.6-Capital-letters-in-url)
    * [2.6.7 Calculating Shannon entropy](#2.6.7-Calculating-Shannon-entropy)
      * [2.6.7.1 Calculating Shannon entropy for url](#2.6.7.1-Calculating-Shannon-entropy-for-url)
      * [2.6.7.2 Calculating Shannon entropy for ext_host](#2.6.7.2-Calculating-Shannon-entropy-for-ext_host)
    * [2.6.8 Calculating alphabet based ratios](#2.6.8-Calculating-alphabet-based-ratios)
      * [2.6.8.1 Calculating character to number ratio in url](#2.6.8.1-Calculating-character-to-number-ratio-in-url)
      * [2.6.8.2 Calculating character to special characters ratio in url](#2.6.8.2-Calculating-character-to-special-characters-ratio-in-url)
      * [2.6.8.3 Calculating character to number ratio in host](#2.6.8.3-Calculating-character-to-number-ratio-in-host)
      * [2.6.8.4 Calculating character to special characters ratio in host](#2.6.8.4-Calculating-character-to-special-characters-ratio-in-host)
    * [2.6.9 Detecting obfuscation](#2.6.9-Detecting-obfuscation)
      * [2.6.9.1 Detecting obfuscation in url](#2.6.9.1-Detecting-obfuscation-in-url)
      * [2.6.9.2 Detecting obfuscation in host](#2.6.9.2-Detecting-obfuscation-in-host)
    * [2.6.10 Calculating entropy of random substring](#2.6.10-Calculating-entropy-of-random-substring)
      * [2.6.10.1 Calculating entropy of random substring in url](#2.6.10.1-Calculating-entropy-of-random-substring-in-url)
      * [2.6.10.2 Calculating entropy of random substring in host](#2.6.10.2-Calculating-entropy-of-random-substring-in-host)
    * [2.6.11 The number of individual special chracters](#2.6.11-The-number-of-individual-special-chracters)

# 2.1 Introduction

In search for a comprehensive dataset containing both phishing and legitimate urls, previously we came across some datasets that has many useful features. But one of the dataset contained the problem of encoding. Most probably the problem was due to a data corruption which may occure during the data scraping phase. There was another dataset which features can not explained how they are generated.
This dataset that we are working with is added a year ago in the Kaggle the author made a good effort to make this comprehensive.
As we have seen previously our dataset does not have much missing values or inconsistencies that needed to taken care of. But our data does not contains any features also that can be used to predict the phishing urls. So, we start feature engineering.

# 2.2 Imports

In [5]:
import pandas as pd
import numpy as np
import re
from math import log2
from nltk.corpus import words
import nltk

# 2.3 Load the data

In [7]:
df = pd.read_csv('../data/processed/url_new_cleaned.csv', index_col=0)

In [8]:
df.head()

Unnamed: 0,url,status
0,0000111servicehelpdesk.godaddysites.com,0
1,000011accesswebform.godaddysites.com,0
2,00003.online,0
3,0009servicedeskowa.godaddysites.com,0
4,000n38p.wcomhost.com,0


# 2.4 Removing prefixes

This step is performed to make the observations more balanced in terms of content. As the urls are scraped from the internet there might be some errors during the scraping process that might have caused the loss of https:// or www from the beginning. As we want to make our predictive model robust we do not want to bias our prediction based on the presence of the https:// or www.

In [10]:
# Remove prefixes (http://, https://)
df['url'] = df['url'].str.replace(r'^(http://|https://|https|http)', '', regex=True)

In [11]:
# Remove '.' 
df['url'] = df['url'].str.lstrip('.')

In [12]:
# Remove 'www.' prefix explicitly
df['url'] = df['url'].str.replace(r'^(www|www1|www2|www3|www4)', '', regex=True)

In [13]:
# Remove '.' 
df['url'] = df['url'].str.lstrip('.')

In [14]:
df[(df['url'].str.startswith('http', 'www')) & (df.status == 1)]

Unnamed: 0,url,status
712113,http2.github.io/http2-spec/,1
712118,httpd.apache.org/docs/2.4/programs/httpd.html,1


In [15]:
df[(df['url'].str.startswith('http', 'www')) & (df.status == 0)]

Unnamed: 0,url,status
28727,httpdmcaremoval-9742697748.info-protech.be/con...,0
29135,httpsservices.runescape.com-n.ru/,0
29160,httpdmcaremoval-2883901933.info-protech.be/con...,0
29883,httpsservices.runescape.com-no.ru/,0
29989,httpsservices.runescape.com-ov.ru/,0
...,...,...
731033,https-impots-gouve-fr.com/impots.gouv/Impots-e...,0
737627,https.www.payypal.com.resolution.center.limita...,0
737628,https.www.payypal.com.resolution.center.limita...,0
737629,https.www.payppal.com.resolution.center.limita...,0


# 2.5 Extracting the host part

Some of the urls contains path and some do not. So, we want to extract the host part (domain, subdomain, top level domain) that must be common on every url and generate some feature based on the host part to make the model more generalized on unseen data.

In [17]:
# Extracted the host part (containing domain, subdomian, tld, port etc.)
df['ext_host'] = df['url'].str.split('/').str[0]

In [18]:
df.tail()

Unnamed: 0,url,status,ext_host
822005,zzufg.com,0,zzufg.com
822006,zzu.li,0,zzu.li
822007,zzz.co.uk,0,zzz.co.uk
822008,zzzoolight.co.za,0,zzzoolight.co.za
822009,zzzoolight.co.za0-i-fdik.000webhostapp.com,0,zzzoolight.co.za0-i-fdik.000webhostapp.com


In [19]:
def host_end(url):
    '''To find any trailing dots in the extracted host'''
    return url.endswith('.')

In [20]:
df[df['ext_host'].apply(host_end)]

Unnamed: 0,url,status,ext_host
12297,ijhriogjef2831g.club.,0,ijhriogjef2831g.club.
180458,n-change.net.,0,n-change.net.
181355,summa.cash.,0,summa.cash.
226306,ssl-allegro.comuf.com./allegro.html,0,ssl-allegro.comuf.com.
230077,update.visaeurope.com.card.verified.us.delphia...,0,update.visaeurope.com.card.verified.us.delphia...
273138,bkent.net./Doc/simple5.htm,1,bkent.net.
678856,syydettyjendatumm.brockalumni.com.,0,syydettyjendatumm.brockalumni.com.
678857,komunistycznymi.afshinnejad.com.,0,komunistycznymi.afshinnejad.com.
678858,nubeculaminor-blossgestellter.f-oaks.com.,0,nubeculaminor-blossgestellter.f-oaks.com.
678859,asseveravronnakiewietsblom.shopdentalsupply.com.,0,asseveravronnakiewietsblom.shopdentalsupply.com.


In [21]:
def remove_trailing_dots(url):
    '''A function to remove any trailing dots in the extracted domain'''
    return url.rstrip('.')
df['ext_host'] = df['ext_host'].apply(remove_trailing_dots)

In [22]:
df[df['ext_host'].apply(host_end)]

Unnamed: 0,url,status,ext_host


In [23]:
df.ext_host.isnull().sum()

0

# 2.6 Feature Creation

### 2.6.1 URL and ext_host length

In [26]:
# A function to check
def check_bool(url, tf, num):
    return df[(df[url] == tf) & (df['status'] == num)]

In [27]:
# A function to check
def check_num(url, no, num):
    return df[(df[url] == no) & (df['status'] == num)]

In [28]:
df['len_url'] = df['url'].apply(len)

In [29]:
df['len_host'] = df['ext_host'].apply(len)

### 2.6.2 URL containing IP address

In [31]:
# Function to check for IP addresses
def contains_ip(url):
    # Regex to match an IPv4 address
    ipv4_pattern = r'(\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b)'
    # Regex to match an IPv6 address (optional)
    ipv6_pattern = r'(\[[0-9a-fA-F:]+\])'
    # Combine both (optional if you want to match both IPv4 and IPv6)
    combined_pattern = rf'{ipv4_pattern}|{ipv6_pattern}'
    return bool(re.search(combined_pattern, url))

# Apply the function to the DataFrame
df['contains_ip'] = df['url'].apply(contains_ip)

In [32]:
df['contains_ip'] = df['contains_ip'].astype(int)

### 2.6.3 Containing java script queries

JavaScript-related query parameters in URLs often serve specific functions. Parameters like callback, load, or js are often used for loading dynamic JavaScript resources or APIs. Query strings like query might collect search keywords or other data. Parameters may point to .js files directly. While these actions are done for legitimate purpose but they can also be mimicked to legitimate services to trick users. Manipulated to inject malicious scripts or redirect to unauthorized resources.

Phishing URLs often include overly complex or meaningless query strings. While legitimate URLs tend to use structured, predictable, and meaningful queries.

High entropy in query parameters might indicate randomization or obfuscation, common in phishing URLs to bypass detection systems.

Shannon entropy is a measure of uncertainty or variability in a system, based on the probabilities of different situations. It was developed by Claude Shannon in 1948 to measure the richness and evenness of plant and animal species.

In [34]:
query_pattern = r'\?([^=]+)=([^&]+)'  # Detects query parameters
fragment_pattern = r'#([^/]+)'  # Detects fragment identifiers
js_param_pattern = r'[?&](js|callback|load|query)=[^&]+'  # Detects JavaScript-like parameters
js_file_pattern = r'\.js$'  # Detects .js in the URL path

# Combine all patterns with | (alternation)
combined_pattern = f'{query_pattern}|{fragment_pattern}|{js_param_pattern}|{js_file_pattern}'

# Function to check for JavaScript-related patterns in the URL
def check_js(url):
    return bool(re.search(combined_pattern, url))

# Apply the function to the DataFrame
df['has_js'] = df['url'].apply(check_js)

In [35]:
df['has_js'] = df['has_js'].astype(int)

In [36]:
# Calculating the entropy of the all the patterns found in a url

In [37]:
query_pattern = r'\?([^=]+)=([^&]+)'  # Detects query parameters
fragment_pattern = r'#([^/]+)'  # Detects fragment identifiers
js_param_pattern = r'[?&](js|callback|load|query)=[^&]+'  # Detects JavaScript-like parameters
js_file_pattern = r'\.js$'  # Detects .js in the URL path

# Combine all patterns with | (alternation)
combined_pattern = f'{query_pattern}|{fragment_pattern}|{js_param_pattern}|{js_file_pattern}'

# Function to extract all matches
def extract_matches(url):
    matches = re.findall(combined_pattern, url)
    return matches if matches else []

# Function to calculate entropy
def calculate_entropy(matches):
    if not matches:
        return 0  # No matches, no entropy
    combined = ''.join(''.join(match) for match in matches)  # Flatten the list of tuples into a string
    freq = {char: combined.count(char) for char in set(combined)}  # Count character frequency
    total = len(combined)  # Total characters
    entropy = -sum((count / total) * log2(count / total) for count in freq.values())  # Shannon entropy
    return entropy

# Apply the functions to the DataFrame
df['query_js'] = df['url'].apply(extract_matches)  # Store matched patterns as a list
df['js_entropy'] = df['query_js'].apply(calculate_entropy)  # Calculate entropy of the matches

Most probably not much significant feature

### 2.6.4 Finding number-dot pattern

In [40]:
def find_long_number_dot_patterns(url):
    # Regex to match number-and-dot patterns
    pattern = r'(\d+(\.\d+)+)'
    matches = re.findall(pattern, url)
    # Extract only the patterns and filter based on length
    filtered_matches = [(match[0], len(match[0])) for match in matches if len(match[0]) > 4]
    return filtered_matches

# Apply the function to the 'url' column to extract patterns
df['num_dotnum_pat_4'] = df['url'].apply(find_long_number_dot_patterns)

# Create a column that stores the number of such patterns found in each URL
df['num_pat_4'] = df['num_dotnum_pat_4'].apply(lambda x: len(x))

### 2.6.5 Capital letters in host

In [42]:
def count_capital_letters(url):
    # Find all capital letters using a regex
    return len(re.findall(r'[A-Z]', url))
df['host_len_capital'] = df['ext_host'].apply(count_capital_letters)

In [43]:
def ratio_capital(len_capital, ext_host):
    if len_capital == 0:
        return 0
    else:
        return len_capital / len(ext_host)
df['host_ratio_capital'] = df.apply(lambda x: ratio_capital(x['host_len_capital'], x['ext_host']), axis=1)

### 2.6.6 Capital letters in url

In [45]:
def count_capital_letters(url):
    # Find all capital letters using a regex
    return len(re.findall(r'[A-Z]', url))
df['url_len_capital'] = df['url'].apply(count_capital_letters)

In [46]:
def ratio_capital(len_capital, ext_host):
    if len_capital == 0:
        return 0
    else:
        return len_capital / len(ext_host)
df['url_ratio_capital'] = df.apply(lambda x: ratio_capital(x['url_len_capital'], x['url']), axis=1)

### 2.6.7 Calculating Shannon entropy

#### 2.6.7.1 Calculating Shannon entropy for url

In [49]:
# Define the entropy calculation function
def calculate_entropy(url):
    # Calculate the frequency of each character
    probabilities = [float(url.count(char)) / len(url) for char in set(url)]
    # Compute Shannon entropy
    entropy = -sum([p * np.log2(p) for p in probabilities])
    return entropy

# Apply the entropy function to the 'url' column
df['url_entropy'] = df['url'].apply(calculate_entropy)

#### 2.6.7.2 Calculating Shannon entropy for ext_host

In [51]:
# Define the entropy calculation function
def calculate_entropy(url):
    # Calculate the frequency of each character
    probabilities = [float(url.count(char)) / len(url) for char in set(url)]
    # Compute Shannon entropy
    entropy = -sum([p * np.log2(p) for p in probabilities])
    return entropy

# Apply the entropy function to the 'url' column
df['host_entropy'] = df['ext_host'].apply(calculate_entropy)

# 2.6.8 Calculating alphabet based ratios

### 2.6.8.1 Calculating character to number ratio in url

In [54]:
# Function to calculate character-to-number ratio
def calculate_char_number_ratio(url):
    chars = sum(x.isalpha() for x in url)
    digits = sum(x.isdigit() for x in url)
    # Avoid division by zero, return a ratio of 0 if no digits found
    return chars / digits if digits != 0 else 0

# Apply the function to the 'url' column and create a new column 'char_number_ratio'
df['url_char_num_ratio'] = df['url'].apply(calculate_char_number_ratio)

### 2.6.8.2 Calculating character to special characters ratio in url

In [56]:
# Function to calculate character-to-number ratio
def calculate_spe_char_ratio(url):
    chars = sum(x.isalpha() for x in url)
    digits = sum(x.isdigit() for x in url)
    char_digit = chars + digits
    special_excluding_digits = len(url) - char_digit
    # Avoid division by zero, return a ratio of 0 if no digits found
    return char_digit / special_excluding_digits if special_excluding_digits != 0 else 0

# Apply the function to the 'url' column and create a new column 'char_number_ratio'
df['url_spe_char_ratio'] = df['url'].apply(calculate_spe_char_ratio)

### 2.6.8.3 Calculating character to number ratio in host

In [58]:
df['host_char_num_ratio'] = df['ext_host'].apply(calculate_char_number_ratio)

### 2.6.8.4 Calculating character to special characters ratio in host

In [60]:
df['host_spe_char_ratio'] = df['ext_host'].apply(calculate_spe_char_ratio)

# 2.6.9 Detecting obfuscation

### 2.6.9.1 Detecting obfuscation in url

In [63]:
# Function to detect obfuscation
def detect_obfuscation(url):
    obfuscation_patterns = [
        r'%[0-9a-fA-F]{2}',  # URL encoding
        r'[a-fA-F0-9]{4,}',  # Hexadecimal
    ]
    # Check if any of the obfuscation patterns match
    obfuscation_detected = any(re.search(pattern, url) for pattern in obfuscation_patterns)
    return 0 if obfuscation_detected else 1  # 0 for obfuscated, 1 for not obfuscated

df['url_obfuscation_status'] = df['url'].apply(detect_obfuscation)

Code Explanation
1. Regular Expression Patterns
The obfuscation_patterns contains two regex patterns:

r'%[0-9a-fA-F]{2}':

What it matches: This pattern looks for URL encoding.
Breakdown:
% matches the literal percent symbol, which is used in URL encoding.
[0-9a-fA-F] matches any hexadecimal digit (0-9, a-f, or A-F).
{2} specifies that exactly 2 hexadecimal characters should follow %.
Example matches:
https://example.com/page%20name (contains %20, which is URL encoding for a space).
https://example.com/file%3Fquery (contains %3F, which is URL encoding for ?).
r'[a-fA-F0-9]{4,}':

What it matches: This pattern identifies long sequences of hexadecimal characters (4 or more).
Breakdown:
[a-fA-F0-9] matches any hexadecimal digit (0-9, a-f, or A-F).
{4,} means the preceding pattern must appear at least 4 times, with no upper limit.
Example matches:
https://example.com/page/deadbeef (contains deadbeef, a hexadecimal string).
https://example.com/file/123abc (contains 123abc, another hexadecimal string).

### 2.6.9.2 Detecting obfuscation in host

In [66]:
df['host_obfuscation_status'] = df['ext_host'].apply(detect_obfuscation)

# 2.6.10 Calculating entropy of random substring

### 2.6.10.1 Calculating entropy of random substring in url

In [69]:
# Function to find random alphanumeric substrings in URL
def find_random_substrings(url):
    # Regex for long random alphanumeric substrings (threshold of 5 or more characters)
    random_pattern = re.compile(r'[a-zA-Z0-9]{5,}')  # You can adjust the threshold if needed
    matches = random_pattern.findall(url)
    return matches

# Applying the function to the 'url' column and create a new column 'random_substrings'
df['url_random_substrings'] = df['url'].apply(find_random_substrings)

In [70]:
def cal_substring_entropy(substrings):
    if not substrings:
        return 0
    total_entropy = sum(calculate_entropy(sub) for sub in substrings)
    return total_entropy

In [71]:
df['url_sub_entropy'] = df['url_random_substrings'].apply(cal_substring_entropy)

### 2.6.10.2 Calculating entropy of random substring in host

In [73]:
# Applying the function to the 'url' column and create a new column 'random_substrings'
df['host_random_substrings'] = df['ext_host'].apply(find_random_substrings)

In [74]:
df['host_sub_entropy'] = df['host_random_substrings'].apply(cal_substring_entropy)

# 2.6.11 The number of individual special chracters

### 2.6.11.1 Number of hypens

In [78]:
def count_hypens(url):
    return url.count('-')
df['n_hypens'] = df['url'].apply(count_hypens)

### 2.6.11.2 Number of underscores

In [80]:
def count_uscore(url):
    return url.count('_')
df['n_uscores'] = df['url'].apply(count_uscore)

### 2.6.11.3 Number of semicolons

In [82]:
def count_semicolon(url):
    return url.count(';')
df['n_semicolon'] = df['url'].apply(count_semicolon)

### 2.6.11.4 Number of equal signs

In [84]:
def count_equsign(url):
    return url.count('=')
df['n_equal_sign'] = df['url'].apply(count_equsign)

# 2.6.12 Applying Similarity Index to top level domain 

### 2.6.12.1 Extracting the top level domain

In [166]:
# Extracting the top level domain
def tld_url(url):
    parts = url.split('.')
    return parts[-1] if len(parts) > 0 else None
df['tlds'] = df['ext_host'].apply(tld_url)