# **PhishGuard: Phishing Website Detection and Classification Using URL & DNS Features**

## Author Information

- Author_name = "Alisha Minj"
- Affiliation = "UMBC Data Science Master’s Degree Capstone"
- Github_link = "https://github.com/DATA-606-2023-FALL-THURSDAY/Minj_Alisha"
- Linkedin_link = "https://www.linkedin.com/in/alisha-minj"


## **Background**

The digital landscape, though filled with opportunities, is riddled with threats that pose significant challenges to individual and institutional security. One such prevalent threat is phishing. Phishing websites, designed to deceive and extract confidential information from unsuspecting users, leverage the trust that is typically associated with legitimate domains. As the vastness and complexity of the internet grow, manual detection and mitigation of such threats become increasingly untenable. This project, "PhishGuard", endeavors to bridge this gap by employing URL and DNS features to robustly detect and classify domains based on their potential malicious intent.

The creation and deployment of an effective detection model holds paramount importance in today's digital age. It extends benefits not only to individual users but also to businesses, organizations, and governments. By proactively identifying and classifying potential threats, we can significantly reduce the risk of data breaches and other security compromises, fostering a safer digital environment for all.


## Data Sources

Two primary sources drive this project's dataset:
1. **PhishTank**: An open-source service that provides an hourly updated list of phishing URLs. 
    - [PhishTank Data](https://www.phishtank.com/developer_info.php)
    
    
2. **University of New Brunswick**: They offer a collection encompassing benign, spam, phishing, malware, and defacement URLs. 
    - [UNB Dataset](https://www.unb.ca/cic/datasets/url-2016.html)


## **Data Collection and Loading**

Data plays a crucial role in building efficient machine learning models. For this project, we rely on two primary sources to obtain our datasets:


## **Phishing URLs**
     
Source: PhishTank

PhishTank, an open community project, facilitates users to identify and report phishing URLs, ensuring that a comprehensive, updated list of phishing URLs is available for researchers and analysts.


- **Loading Phishing URLs:**


In [1]:
# Importing necessary libraries
import pandas as pd

In [2]:
# PhishTank provides an up-to-date dataset of live phishing URLs.
# For this project, we're downloading the latest set of valid phishing URLs.
#!wget http://data.phishtank.com/data/online-valid.csv


In [3]:
# URL of the dataset
data_url = "https://raw.githubusercontent.com/DATA-606-2023-FALL-THURSDAY/Minj_Alisha/main/data/PhishingData.csv"

# Loading the data into a pandas DataFrame
phish_data = pd.read_csv(data_url)

# Displaying the first few rows of the dataset
phish_data.head()

Unnamed: 0,phish_id,url,phish_detail_url,submission_time,verified,verification_time,online,target
0,6557033,http://u1047531.cp.regruhosting.ru/acces-inges...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T22:01:43+00:00,yes,2020-05-09T22:03:07+00:00,yes,Other
1,6557032,http://hoysalacreations.com/wp-content/plugins...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T22:01:37+00:00,yes,2020-05-09T22:03:07+00:00,yes,Other
2,6557011,http://www.accsystemprblemhelp.site/checkpoint...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:54:31+00:00,yes,2020-05-09T21:55:38+00:00,yes,Facebook
3,6557010,http://www.accsystemprblemhelp.site/login_atte...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:53:48+00:00,yes,2020-05-09T21:54:34+00:00,yes,Facebook
4,6557009,https://firebasestorage.googleapis.com/v0/b/so...,http://www.phishtank.com/phish_detail.php?phis...,2020-05-09T21:49:27+00:00,yes,2020-05-09T21:51:24+00:00,yes,Microsoft


The displayed table offers a glimpse of the dataset structure, illustrating features such as the unique identifier (phish_id), the URL in question (url), the detailed link for verification (phish_detail_url), the submission time, verification attributes, and the targeted brand or entity (target).


## **Legitimate URLs**

Source: University of New Brunswick

For the legitimate URLs, we rely on a dataset from the University of New Brunswick, which has been collected to serve as a counter-part to phishing URLs. These URLs are from trustworthy sites and do not pose any security threat.


- **Loading Legitimate URL**

In [4]:
# Loading Benign URLs Data from the URL
# URL of the benign dataset
benign_data_url = "https://raw.githubusercontent.com/DATA-606-2023-FALL-THURSDAY/Minj_Alisha/main/data/BenignURLs.csv"

# Loading the benign data into a pandas DataFrame
legit_data = pd.read_csv(benign_data_url)

# Renaming the column to maintain consistency
legit_data.columns = ['url']

# Displaying the first few rows of the benign dataset
legit_data.head()


Unnamed: 0,url
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


The dataset consists of URLs that have been flagged as non-malicious and serve as a standard for legitimate online entities.


## **Data Overview**

Once we've loaded the datasets, it's beneficial to get a preliminary understanding of our data.

A preliminary understanding of the datasets is achieved by examining the number of URLs and the distribution between phishing and legitimate entries.


In [5]:
print(f"Total Phishing URLs Loaded: {len(phish_data)}")
#print(f"Total Phishing URLs Selected for Analysis: {len(phishing_subset)}")
print(f"Total Legitimate URLs Loaded: {len(legit_data)}")


Total Phishing URLs Loaded: 14858
Total Legitimate URLs Loaded: 35377


The dataset contains more than twice as many legitimate URLs compared to phishing URLs, indicating an imbalance that may need addressing in the modeling phase.

Findings:

From the output, it's evident we have loaded 14,858 phishing URLs and a more substantial 35,377 legitimate URLs. This discrepancy highlights the challenge of managing imbalanced datasets. While we have a vast volume of URLs at our disposal, the differential distribution emphasizes the necessity to consider balanced datasets when moving to the modeling phase.


**Data Quality Check**

Before diving into feature extraction, it's essential to ensure the quality and integrity of the data.

Assessing the quality and reliability of our data is pivotal before any intensive analysis.


- **Basic Statistics**

In [6]:
# Phishing URLs
print("Number of Phishing URLs:", len(phish_data))
print("Unique Phishing URLs:", phish_data['url'].nunique())



Number of Phishing URLs: 14858
Unique Phishing URLs: 14855


In [7]:
# Removing duplicate URLs
phish_data.drop_duplicates(subset='url', keep='first', inplace=True)

# Verifying the removal
print("Number of Phishing URLs after removing duplicates:", len(phish_data))
print("Unique Phishing URLs after removing duplicates:", phish_data['url'].nunique())


Number of Phishing URLs after removing duplicates: 14855
Unique Phishing URLs after removing duplicates: 14855


In [8]:
# Legitimate URLs
print("Number of Legitimate URLs:", len(legit_data))
print("Unique Legitimate URLs:", legit_data['url'].nunique())


Number of Legitimate URLs: 35377
Unique Legitimate URLs: 35377


By examining the total URLs against unique entries, we can identify potential redundancies or duplicates in the data, which are evident in the phishing dataset.


- **Missing Values**

Ensuring the data's completeness is paramount. Missing or NaN values can disrupt the analysis and modeling process.


In [9]:
# Phishing URLs
print("Missing values in Phishing dataset:", phish_data.isnull().sum())

# Legitimate URLs
print("Missing values in Legitimate dataset:", legit_data.isnull().sum())


Missing values in Phishing dataset: phish_id             0
url                  0
phish_detail_url     0
submission_time      0
verified             0
verification_time    0
online               0
target               0
dtype: int64
Missing values in Legitimate dataset: url    0
dtype: int64


The datasets appear clean without any missing values, indicating that each URL is accompanied by its respective attributes without any gaps.


- **Duplicate Values**

Redundant entries need to be addressed to maintain the dataset's integrity.




In [10]:
# Phishing URLs
initial_count = len(phish_data)
phish_data.drop_duplicates(subset="url", keep=False, inplace=True)
final_count = len(phish_data)

# Indicate the number of unique values and confirm no duplicates
print(f"{final_count} unique phishing URLs remain. No duplicate values.")


14855 unique phishing URLs remain. No duplicate values.


In [11]:
# Legitimate URLs
initial_count = len(legit_data)
phish_data.drop_duplicates(subset="url", keep=False, inplace=True)
final_count = len(legit_data)

# Indicate the number of unique values and confirm no duplicates
print(f"{final_count} unique phishing URLs remain. No duplicate values.")
legit_data.drop_duplicates(subset ="url", keep = False, inplace = True)


35377 unique phishing URLs remain. No duplicate values.


- **Data Distribution**


In [12]:
# Taking a subset of 14,855 from both datasets to maintain balance
phishing_subset = phish_data.sample(n=14855, random_state=42)
legit_subset = legit_data.sample(n=14855, random_state=42)

print(f"Total URLs: {len(phishing_subset) + len(legit_subset)}")


Total URLs: 29710


- **Sample URL Preview**

Sampling a few URLs from each dataset provides a firsthand look into the nature of entries.


In [13]:
# Phishing URLs
print("Sample Phishing URLs:")
print(phish_data.sample(5))

# Legitimate URLs
print("\nSample Legitimate URLs:")
print(legit_data.sample(5))


Sample Phishing URLs:
       phish_id                                                url  \
13934   5843400     https://apple.com-recaptcha.edgemasterint.com/   
4084    6539084  http://clsrockies.com/dropbox-loginclient/drop...   
6900    6512675  http://www.testing-adjusting-balancing.com/wp-...   
3520    6542447  https://irannail.com/override/classes/assets/b...   
5106    6531163  http://rohtolab.com/public/templates/mmp/web/-...   

                                        phish_detail_url  \
13934  http://www.phishtank.com/phish_detail.php?phis...   
4084   http://www.phishtank.com/phish_detail.php?phis...   
6900   http://www.phishtank.com/phish_detail.php?phis...   
3520   http://www.phishtank.com/phish_detail.php?phis...   
5106   http://www.phishtank.com/phish_detail.php?phis...   

                 submission_time verified          verification_time online  \
13934  2018-11-14T13:06:19+00:00      yes  2018-12-23T23:41:07+00:00    yes   
4084   2020-04-30T14:03:14+00:00      

The displayed samples provide a taste of what the datasets contain, from the structure of URLs to the targeted entities in the case of phishing URLs.


In [14]:
# importing required packages for this section
from urllib.parse import urlparse,urlencode
import ipaddress
import re

In [15]:
'''

# 1.Domain of the URL (Domain) 
def getDomain(url):  
  domain = urlparse(url).netloc
  if re.match(r"^www.",domain):
	       domain = domain.replace("www.","")
  return domain
     

# 2.Checks for IP address in URL (Have_IP)
def havingIP(url):
  try:
    ipaddress.ip_address(url)
    ip = 1
  except:
    ip = 0
  return ip

# 3.Checks the presence of @ in URL (Have_At)
def haveAtSign(url):
  if "@" in url:
    at = 1    
  else:
    at = 0    
  return at

# 4.Finding the length of URL and categorizing (URL_Length)
def getLength(url):
  if len(url) < 54:
    length = 0            
  else:
    length = 1            
  return length

# 5.Gives number of '/' in URL (URL_Depth)
def getDepth(url):
  s = urlparse(url).path.split('/')
  depth = 0
  for j in range(len(s)):
    if len(s[j]) != 0:
      depth = depth+1
  return depth
     

# 6.Checking for redirection '//' in the url (Redirection)
def redirection(url):
  pos = url.rfind('//')
  if pos > 6:
    if pos > 7:
      return 1
    else:
      return 0
  else:
    return 0
     

# 7.Existence of “HTTPS” Token in the Domain Part of the URL (https_Domain)
def httpDomain(url):
  domain = urlparse(url).netloc
  if 'https' in domain:
    return 1
  else:
    return 0

#listing shortening services
shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"
     


# 8. Checking for Shortening Services in URL (Tiny_URL)
def tinyURL(url):
    match=re.search(shortening_services,url)
    if match:
        return 1
    else:
        return 0

# 9.Checking for Prefix or Suffix Separated by (-) in the Domain (Prefix/Suffix)
def prefixSuffix(url):
    if '-' in urlparse(url).netloc:
        return 1            # phishing
    else:
        return 0            # legitimate


!pip install python-whois

# importing required packages for this section
import re
from bs4 import BeautifulSoup
import whois
import urllib
import urllib.request
from datetime import datetime
     

# 11.DNS Record availability (DNS_Record)
# obtained in the featureExtraction function itself
     

# 12.Web traffic (Web_Traffic)
def web_traffic(url):
  try:
    #Filling the whitespaces in the URL if any
    url = urllib.parse.quote(url)
    rank = BeautifulSoup(urllib.request.urlopen("http://data.alexa.com/data?cli=10&dat=s&url=" + url).read(), "xml").find(
        "REACH")['RANK']
    rank = int(rank)
  except TypeError:
        return 1
  if rank <100000:
    return 1
  else:
    return 0

# 13.Survival time of domain: The difference between termination time and creation time (Domain_Age)  
def domainAge(domain_name):
  creation_date = domain_name.creation_date
  expiration_date = domain_name.expiration_date
  if (isinstance(creation_date,str) or isinstance(expiration_date,str)):
    try:
      creation_date = datetime.strptime(creation_date,'%Y-%m-%d')
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1
  if ((expiration_date is None) or (creation_date is None)):
      return 1
  elif ((type(expiration_date) is list) or (type(creation_date) is list)):
      return 1
  else:
    ageofdomain = abs((expiration_date - creation_date).days)
    if ((ageofdomain/30) < 6):
      age = 1
    else:
      age = 0
  return age
     

# 14.End time of domain: The difference between termination time and current time (Domain_End) 
def domainEnd(domain_name):
  expiration_date = domain_name.expiration_date
  if isinstance(expiration_date,str):
    try:
      expiration_date = datetime.strptime(expiration_date,"%Y-%m-%d")
    except:
      return 1
  if (expiration_date is None):
      return 1
  elif (type(expiration_date) is list):
      return 1
  else:
    today = datetime.now()
    end = abs((expiration_date - today).days)
    if ((end/30) < 6):
      end = 0
    else:
      end = 1
  return end


# importing required packages for this section
import requests

# 15. IFrame Redirection (iFrame)
def iframe(response):
  if response == "":
      return 1
  else:
      if re.findall(r"[|]", response.text):
          return 0
      else:
          return 1

# 16.Checks the effect of mouse over on status bar (Mouse_Over)
def mouseOver(response): 
  if response == "" :
    return 1
  else:
    if re.findall("", response.text):
      return 1
    else:
      return 0

# 17.Checks the status of the right click attribute (Right_Click)
def rightClick(response):
  if response == "":
    return 1
  else:
    if re.findall(r"event.button ?== ?2", response.text):
      return 0
    else:
      return 1


# 18.Checks the number of forwardings (Web_Forwards)    
def forwarding(response):
  if response == "":
    return 1
  else:
    if len(response.history) <= 2:
      return 0
    else:
      return 1
     

#Function to extract features
def featureExtraction(url,label):

  features = []
  #Address bar based features (10)
  features.append(getDomain(url))
  features.append(havingIP(url))
  features.append(haveAtSign(url))
  features.append(getLength(url))
  features.append(getDepth(url))
  features.append(redirection(url))
  features.append(httpDomain(url))
  features.append(tinyURL(url))
  features.append(prefixSuffix(url))
  
  #Domain based features (4)
  dns = 0
  try:
    domain_name = whois.whois(urlparse(url).netloc)
  except:
    dns = 1

  features.append(dns)
  features.append(web_traffic(url))
  features.append(1 if dns == 1 else domainAge(domain_name))
  features.append(1 if dns == 1 else domainEnd(domain_name))
  
  # HTML & Javascript based features (4)
  try:
    response = requests.get(url)
  except:
    response = ""
  features.append(iframe(response))
  features.append(mouseOver(response))
  features.append(rightClick(response))
  features.append(forwarding(response))
  features.append(label)
  
  return features

legit_subset.shape

for i in range(0, 14855):
    url = legit_subset['url'].iloc[i]
    ...


legit_subset = legit_subset.reset_index(drop=True)
for i in range(0, 14855):
    url = legit_subset['url'][i]
    ...


legi_features = []
label = 0

# Reset the index for the subset dataframe
legit_subset = legit_subset.reset_index(drop=True)

# Loop through each URL in the legit_subset dataframe
for i in range(0, 14855):
    url = legit_subset['url'][i]
    try:
        # Append the features extracted from the URL to the legi_features list
        legi_features.append(featureExtraction(url, label))
    except Exception as e:
        # If there's an error with extracting features for a specific URL, print the error and continue
        print(f"Error processing URL {url}: {e}")
        continue

# You should now have a list of features for each legitimate URL in legi_features


#converting the list to dataframe
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection', 
                      'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Web_Traffic', 
                      'Domain_Age', 'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']

legitimate = pd.DataFrame(legi_features, columns= feature_names)
legitimate.head()

# Storing the extracted legitimate URLs fatures to csv file
legitimate.to_csv('legitimate.csv', index= False)
     
'''

'\n\n# 1.Domain of the URL (Domain) \ndef getDomain(url):  \n  domain = urlparse(url).netloc\n  if re.match(r"^www.",domain):\n\t       domain = domain.replace("www.","")\n  return domain\n     \n\n# 2.Checks for IP address in URL (Have_IP)\ndef havingIP(url):\n  try:\n    ipaddress.ip_address(url)\n    ip = 1\n  except:\n    ip = 0\n  return ip\n\n# 3.Checks the presence of @ in URL (Have_At)\ndef haveAtSign(url):\n  if "@" in url:\n    at = 1    \n  else:\n    at = 0    \n  return at\n\n# 4.Finding the length of URL and categorizing (URL_Length)\ndef getLength(url):\n  if len(url) < 54:\n    length = 0            \n  else:\n    length = 1            \n  return length\n\n# 5.Gives number of \'/\' in URL (URL_Depth)\ndef getDepth(url):\n  s = urlparse(url).path.split(\'/\')\n  depth = 0\n  for j in range(len(s)):\n    if len(s[j]) != 0:\n      depth = depth+1\n  return depth\n     \n\n# 6.Checking for redirection \'//\' in the url (Redirection)\ndef redirection(url):\n  pos = url.rfind

## **Feature Extraction**

Feature extraction is a pivotal step in the data preprocessing pipeline, especially when dealing with URLs. Unlike structured data, URLs come in a textual format which isn't directly consumable by most machine learning models. By extracting meaningful features from these URLs, we can transform this unstructured data into a structured format, making it suitable for modeling.


- **URL Length**

The length of a URL can be a simple yet effective feature. Phishing URLs tend to be longer than legitimate ones as attackers often embed malicious domains within sub-domains to disguise them.

In [16]:
phish_data['url_length'] = phish_data['url'].apply(len)
legit_data['url_length'] = legit_data['url'].apply(len)


In [17]:
print('Phishing Data')
print(phish_data[['url', 'url_length']].head())


Phishing Data
                                                 url  url_length
0  http://u1047531.cp.regruhosting.ru/acces-inges...          66
1  http://hoysalacreations.com/wp-content/plugins...          88
2  http://www.accsystemprblemhelp.site/checkpoint...          50
3  http://www.accsystemprblemhelp.site/login_atte...          67
4  https://firebasestorage.googleapis.com/v0/b/so...         137


In [18]:
print('Legitimate Data')
print(legit_data[['url', 'url_length']].head())


Legitimate Data
                                                 url  url_length
0  http://1337x.to/torrent/1110018/Blackhat-2015-...          83
1  http://1337x.to/torrent/1122940/Blackhat-2015-...          83
2  http://1337x.to/torrent/1124395/Fast-and-Furio...          83
3  http://1337x.to/torrent/1145504/Avengers-Age-o...          83
4  http://1337x.to/torrent/1160078/Avengers-age-o...          83


**Insight**: As observed from our sample, phishing URLs can vary significantly in length, while legitimate URLs seem to exhibit more uniformity.

**Phishing Data**: The lengths of phishing URLs from our sample vary widely. For instance, while the shortest URL in the displayed dataset has a length of 50 characters, the longest reaches up to 137 characters. This indicates that phishing URLs can come in various lengths, and there's no immediate pattern discernible from this small subset.

**Legitimate Data**: The legitimate URLs from the sample seem to have a more consistent length, around 83 characters. This uniformity might be attributed to the specific source or platform from which these URLs are taken (in this case, all URLs seem to be from 1337x.to).


- **Presence of Sub-domains**

The number of sub-domains can also be indicative. As mentioned, attackers often use multiple sub-domains to hide malicious intent.


In [19]:
# Code to extract the sub-domain count
print('Phishing Data')
phish_data['sub_domain_count'] = phish_data['url'].apply(lambda x: x.count('.'))
print(phish_data[['url', 'sub_domain_count']].head())


print('Legitimate Data')
legit_data['sub_domain_count'] = legit_data['url'].apply(lambda x: x.count('.'))
print(legit_data[['url', 'sub_domain_count']].head())


Phishing Data
                                                 url  sub_domain_count
0  http://u1047531.cp.regruhosting.ru/acces-inges...                 3
1  http://hoysalacreations.com/wp-content/plugins...                 1
2  http://www.accsystemprblemhelp.site/checkpoint...                 3
3  http://www.accsystemprblemhelp.site/login_atte...                 3
4  https://firebasestorage.googleapis.com/v0/b/so...                 5
Legitimate Data
                                                 url  sub_domain_count
0  http://1337x.to/torrent/1110018/Blackhat-2015-...                 1
1  http://1337x.to/torrent/1122940/Blackhat-2015-...                 1
2  http://1337x.to/torrent/1124395/Fast-and-Furio...                 1
3  http://1337x.to/torrent/1145504/Avengers-Age-o...                 1
4  http://1337x.to/torrent/1160078/Avengers-age-o...                 1


Insight: The varied range of sub-domains in phishing URLs contrasts with the consistent structure of the legitimate URLs in our sample.

In this sample, phishing URLs exhibit a varied range of sub-domains, from 1 up to 5. Conversely, the legitimate URLs consistently show just 1 sub-domain. This disparity might indicate that phishing URLs often embed content within multiple sub-domains to obfuscate malicious intent, while the legitimate URLs from our chosen source tend to have a straightforward structure.


- **Presence of HTTPS**

While HTTPS doesn't guarantee a website's legitimacy, the absence of it could be a red flag, given the modern web's emphasis on encrypted connections.

In [20]:
# Displaying the top rows of each dataset with the HTTPS presence feature
print('Phishing Data - HTTPS Presence')
phish_data['https_presence'] = phish_data['url'].str.startswith('https')
print(phish_data[['url', 'https_presence']].head())

print('\nLegitimate Data - HTTPS Presence')
legit_data['https_presence'] = legit_data['url'].str.startswith('https')
print(legit_data[['url', 'https_presence']].head())


Phishing Data - HTTPS Presence
                                                 url  https_presence
0  http://u1047531.cp.regruhosting.ru/acces-inges...           False
1  http://hoysalacreations.com/wp-content/plugins...           False
2  http://www.accsystemprblemhelp.site/checkpoint...           False
3  http://www.accsystemprblemhelp.site/login_atte...           False
4  https://firebasestorage.googleapis.com/v0/b/so...            True

Legitimate Data - HTTPS Presence
                                                 url  https_presence
0  http://1337x.to/torrent/1110018/Blackhat-2015-...           False
1  http://1337x.to/torrent/1122940/Blackhat-2015-...           False
2  http://1337x.to/torrent/1124395/Fast-and-Furio...           False
3  http://1337x.to/torrent/1145504/Avengers-Age-o...           False
4  http://1337x.to/torrent/1160078/Avengers-age-o...           False


Insight: Both phishing and legitimate datasets showcase URLs without HTTPS. This underlines that while HTTPS presence can be a good feature, it isn't solely indicative of a URL's legitimacy.

From the displayed results, we observe that the majority of URLs in the phishing dataset sample, except the last one, do not start with HTTPS, implying a lack of a secured connection. Similarly, the sample from the legitimate dataset also predominantly lacks HTTPS. This provides an initial indication that the presence or absence of HTTPS is not an exclusive characteristic of phishing or legitimate sites, and further analysis is crucial.


- **URL Tokenization**

Tokenizing URLs can help in extracting meaningful terms, which can further aid during the modeling phase, especially if techniques like TF-IDF (Term Frequency-Inverse Document Frequency) are employed.


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer on phishing data and transform the URLs
phish_vectorized = vectorizer.fit_transform(phish_data['url'])

# Use the fitted vectorizer to transform the legitimate URLs
legit_vectorized = vectorizer.transform(legit_data['url'])

# Converting the sparse matrix to dense format for the first five phishing URLs
phish_dense_sample = phish_vectorized[:5].todense()

# Fetching the feature names (terms) from the vectorizer
feature_names = vectorizer.get_feature_names_out()

# Displaying the top terms for the first phishing URL based on the TF-IDF score
df_phish_sample = pd.DataFrame(phish_dense_sample, columns=feature_names)
top_terms_phish = df_phish_sample.iloc[0].nlargest(5)
print("Top terms for the first phishing URL based on TF-IDF score:")
print(top_terms_phish)

# Similarly, for legitimate URLs:
legit_dense_sample = legit_vectorized[:5].todense()
df_legit_sample = pd.DataFrame(legit_dense_sample, columns=feature_names)
top_terms_legit = df_legit_sample.iloc[0].nlargest(5)
print("\nTop terms for the first legitimate URL based on TF-IDF score:")
print(top_terms_legit)


Top terms for the first phishing URL based on TF-IDF score:
3facd       0.397233
u1047531    0.397233
20200104    0.342948
inges       0.342948
t452        0.342948
Name: 0, dtype: float64

Top terms for the first legitimate URL based on TF-IDF score:
2015    0.519832
720p    0.519832
dl      0.481934
to      0.340997
web     0.321426
Name: 0, dtype: float64


Insight: The terms with the highest TF-IDF scores differ between phishing and legitimate URLs, emphasizing distinct structures and contents between the two.

The displayed terms represent the highest TF-IDF scores for the first URL in both the phishing and legitimate datasets. These terms provide insights into the components of the URLs deemed most significant by the TF-IDF vectorizer. For instance, in the phishing sample, terms like "u1047531" and "regruhosting" stand out, while in the legitimate sample, terms like "1337x" and "torrent" are predominant. This variation in terms between the two samples hints at the distinct structures and contents of phishing vs. legitimate URLs.

**URL Tokenization and Analysis**

URLs can be split into various components like scheme, domain, subdomain, and path. By tokenizing URLs, we can extract information about its structure, which might reveal patterns typical for phishing URLs.


In [22]:
from urllib.parse import urlparse

def tokenize_url(url):
    parsed_url = urlparse(url)
    return {
        "scheme": parsed_url.scheme,
        "netloc": parsed_url.netloc,
        "path": parsed_url.path,
        "params": parsed_url.params,
        "query": parsed_url.query,
        "fragment": parsed_url.fragment
    }

phish_data['tokenized_url'] = phish_data['url'].apply(tokenize_url)
legit_data['tokenized_url'] = legit_data['url'].apply(tokenize_url)

# Display some sample outputs:
print(phish_data['tokenized_url'].head())


0    {'scheme': 'http', 'netloc': 'u1047531.cp.regr...
1    {'scheme': 'http', 'netloc': 'hoysalacreations...
2    {'scheme': 'http', 'netloc': 'www.accsystemprb...
3    {'scheme': 'http', 'netloc': 'www.accsystemprb...
4    {'scheme': 'https', 'netloc': 'firebasestorage...
Name: tokenized_url, dtype: object


**Lexical Analysis**

Phishing URLs may have certain lexical patterns that distinguish them from legitimate URLs, like longer lengths or more subdomains.


In [23]:
phish_data['url_length'] = phish_data['url'].apply(len)
phish_data['num_special_chars'] = phish_data['url'].apply(lambda x: sum([c in set(['@', '&', '$']) for c in x]))

# Displaying some sample outputs:
print(phish_data[['url', 'url_length', 'num_special_chars']].head())


                                                 url  url_length  \
0  http://u1047531.cp.regruhosting.ru/acces-inges...          66   
1  http://hoysalacreations.com/wp-content/plugins...          88   
2  http://www.accsystemprblemhelp.site/checkpoint...          50   
3  http://www.accsystemprblemhelp.site/login_atte...          67   
4  https://firebasestorage.googleapis.com/v0/b/so...         137   

   num_special_chars  
0                  0  
1                  0  
2                  0  
3                  1  
4                  1  


**Host-based and Domain Analysis**

Information related to domain registration can be insightful. Domains that are very new or have been recently updated can be considered suspicious.


In [24]:
!pip install python-whois




In [25]:
from urllib.parse import urlparse

In [26]:
import whois
import time

def domain_info(domain):
    try:
        time.sleep(1)  # Adding a delay of 1 second between requests
        w = whois.whois(domain)
        return w
    except whois.WhoisCommandFailed:
        return "WhoisCommandFailed"
    except Exception as e:
        return str(e)

phish_data['domain_info'] = phish_data['url'].apply(lambda x: domain_info(urlparse(x).netloc))


Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed


AttributeError: module 'whois' has no attribute 'WhoisCommandFailed'

In [None]:
# Using WHOIS for domain information (you may need to install the python-whois package)
import whois

def domain_info(domain):
    try:
        w = whois.whois(domain)
        return w
    except:
        return None

phish_data['domain_info'] = phish_data['url'].apply(lambda x: domain_info(urlparse(x).netloc))



Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11002] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11002] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11001] getaddrinfo failed
Error trying to connect to socket: closing socket - [Errno 11002] getaddrinfo failed


**Content-based Analysis**

Objective:
Analyze the content a URL points to for indicators of phishing.

In [None]:
import requests
from bs4 import BeautifulSoup

def get_content(url):
    try:
        response = requests.get(url, timeout=5)
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup.text
    except:
        return None


In [None]:
# This can take time and might not be practical for a huge dataset.
# phish_data['content'] = phish_data['url'].apply(get_content)


**Third-party Data Enrichment**

Objective:
Utilize third-party services to gather more detailed insights about URLs or domains.


## **Data Pre-processing**


In [None]:
import numpy as np
import pandas as pd

# Assuming you've already loaded your data into phish_data and legit_data

# Step 1: Randomly select a subset of legitimate URLs to match the phishing count
subset_legit_data = legit_data.sample(n=len(phish_data))

# Step 2: Assign labels
phish_data['label'] = 1  # phishing
subset_legit_data['label'] = 0  # legitimate

# Step 3: Concatenate
combined_data = pd.concat([phish_data, subset_legit_data], axis=0)

# Step 4: Shuffle
combined_data = combined_data.sample(frac=1).reset_index(drop=True)

# Display the first few rows of the combined data
print("First few rows of the processed dataset:")
display(combined_data.head())

# Display the shape of the dataset
print(f"\nShape of the dataset: {combined_data.shape}")

# Display counts of each label
print("\nCounts of each label:")
display(combined_data['label'].value_counts())

# Display a simple statistic: mean length of URLs (assuming a 'url' column exists)
print(f"\nAverage URL length: {combined_data['url'].apply(len).mean():.2f} characters")


**Align Columns Before Concatenating**
    
Ensure that both datasets have the exact same columns in the same order. If they have different columns, decide on a common set of columns to use, or fill in missing columns in each dataset with placeholder values.


In [None]:
import numpy as np

# Making sure both dataframes have the same columns
missing_cols_legit = set(phish_data.columns) - set(legit_data.columns)
missing_cols_phish = set(legit_data.columns) - set(phish_data.columns)

# Adding missing columns to both dataframes
for col in missing_cols_legit:
    legit_data[col] = np.nan
for col in missing_cols_phish:
    phish_data[col] = np.nan



1. **Combining Datasets**

Before we can pre-process our data, we need to merge the two datasets (phishing and legitimate) into a single dataframe.

1.1 Combine Datasets

We'll begin by concatenating the phishing and legitimate datasets. Afterwards, we'll shuffle them to ensure randomness.


In [None]:
import pandas as pd

# Concatenate and shuffle
data = pd.concat([phish_data, legit_data]).sample(frac=1).reset_index(drop=True)
data.head()


1. Combining and Scaling Features
1.1 Combine Datasets
Combine phishing and legitimate datasets and shuffle them.


In [None]:

import pandas as pd

data = pd.concat([phish_data, legit_data]).sample(frac=1).reset_index(drop=True)


from sklearn.preprocessing import StandardScaler

# Extracting the feature matrix 'X' and target 'y'
X = data.drop('label', axis=1)
y = data['label']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


## **Data Pre-processing**

The purpose of data pre-processing is to convert the raw data into a clean data set, ensuring the quality and integrity of the data before feeding it into the modeling algorithms.

- **Feature Scaling**
        
Given the disparity in magnitude between different features (e.g., URL length versus TF-IDF scores), scaling becomes essential, especially if you intend to use distance-based algorithms.

Feature Scaling 
Identifying the Need for Scaling: Before delving into the actual scaling process, we identified the necessity for this step. Given the range and type of the features extracted, such as URL length and TF-IDF scores, some features could disproportionately influence the outcome of distance-based machine learning algorithms due to their magnitude. Thus, scaling was deemed essential.

Choosing an Appropriate Scaling Technique: Among various scaling techniques available (like Min-Max scaling, Robust scaling, etc.), we chose the Standard Scaler. This scaler standardizes the features by removing the mean and scaling to unit variance.

Initialization and Application:

We initialized the StandardScaler from the sklearn.preprocessing module.
The scaler was then fit using the phishing dataset. This means the scaler calculated the mean and standard deviation from the phishing dataset's features.
Using this fit scaler, we transformed both the phishing and legitimate datasets to get standardized values.
Validation and Verification:

After scaling, we displayed the top rows of both datasets to verify the success of the transformation.
The scaled features for both datasets are observed to be around a mean of zero, verifying the proper functioning of the scaler.
Observations from Scaled Data: From the scaled data, certain patterns and variations between the phishing and legitimate datasets were observed, which could potentially assist in better differentiation during modeling.

By implementing feature scaling, we have ensured that no particular feature would unduly influence any model, especially those sensitive to feature magnitudes. This paves the way for the next steps in the machine learning pipeline, such as model selection and training.


In [None]:
# Use Python code to scale the phishing data
from sklearn.preprocessing import StandardScaler

scaler_phish = StandardScaler()
phish_data_numeric = data[data['label'] == 'phishing'].select_dtypes(include=[np.number])

phish_data_scaled = scaler_phish.fit_transform(phish_data_numeric)


# Display the first few rows of the scaled phishing data
print("Scaled Phishing Data:")
print(pd.DataFrame(phish_data_scaled, columns=phish_data_numeric.columns).head())

# Display the first few rows of the scaled legitimate data
print("\nScaled Legitimate Data:")
print(pd.DataFrame(legit_data_scaled, columns=legit_data_numeric.columns).head())


Observation: The scaled data reveals some interesting patterns. For both phishing and legitimate datasets, the url_length, sub_domain_count, and https_presence features have been standardized such that they have values around a mean of zero. Notably, the legitimate data appears more consistent in its scaled values, especially in terms of sub-domains and HTTPS presence. In contrast, the phishing data displays more variability. This difference in consistency and variability can be a potential differentiator when classifying URLs.


- **Data Balancing**

The imbalance between phishing and legitimate URLs can lead to biased models. Let's first check the class distribution and then decide on the method for balancing.

In classification tasks, if one class outnumbers the other class (or classes), a model can be heavily biased towards the majority class, leading to poor performance on the minority class. This phenomenon is commonly seen in fraud detection, spam filtering, and, in our case, phishing URL detection.

Explanation:
Over-sampling is a technique where you generate more samples in the minority class, usually by duplicating or creating synthetic samples using methods like SMOTE (Synthetic Minority Over-sampling Technique).

Under-sampling involves reducing the number of samples in the majority class, but it can result in loss of data.


- **Update Both Libraries**

Begin by updating both imblearn and sklearn to their latest versions. This might resolve any version incompatibility issues.


In [None]:
pip install --upgrade pip


!pip install imbalanced-learn --upgrade


!pip install scikit-learn --upgrade


pip cache purge

!pip install imbalanced-learn --upgrade --no-cache-dir
!pip install scikit-learn --upgrade --no-cache-dir


pip install imbalanced-learn --upgrade -i https://pypi.python.org/simple/


!pip install -U imbalanced-learn scikit-learn

In [None]:
import imblearn
import sklearn

print("Imbalanced Learn Version:", imblearn.__version__)
print("Scikit Learn Version:", sklearn.__version__)


from imblearn.over_sampling import SMOTE


# Instantiate SMOTE with a specific random state for reproducibility
smote = SMOTE(random_state=42)



from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



# Resample the training data
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

!pip install imbalanced-learn==X.X.X
!pip install scikit-learn==Y.Y.Y

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)

# Resample the training data
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)



If there's a significant imbalance, consider techniques like:

a. Oversampling:
Increasing the count of the minority class by replicating samples.

b. Undersampling:
Reducing the count of the majority class by randomly removing samples.

c. SMOTE (Synthetic Minority Over-sampling Technique):
Generating synthetic samples for the minority class.


In [None]:
# If using SMOTE
from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)


**Splitting**

Partition the data into training, validation, and testing sets.

Splitting the dataset into training and testing subsets is a crucial step in building and evaluating machine learning models. This ensures that we have a set of unseen data to validate the performance of our model. Typically, we use 70-80% of the data for training and the remaining 20-30% for testing.

Explanation:
The training set is used to train the machine learning model. It's like giving the model a set of example problems to learn from.

The testing set is used to evaluate the model's performance. This set is not shown to the model during training. It's like giving the model a quiz on what it has learned.


In [None]:


from sklearn.model_selection import train_test_split

# Combining the data for splitting
combined_data = pd.concat([phish_data_scaled, legit_data_scaled], axis=0)

# Labels: 1 for phishing and 0 for legitimate
labels = np.concatenate([np.ones(phish_data_scaled.shape[0]), np.zeros(legit_data_scaled.shape[0])])

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(combined_data, labels, test_size=0.2, random_state=42)


