# Guardian: URL Feature Extraction
*Offensive Hacking Techical and Strategies Course - 4Y1S SLIIT*

## 1.Objective:

A phishing website is a common social engineering method that mimics trustful uniform resource locators (URLs) and webpages. The objective of this notebook is to collect data & extract the selctive features form the URLs.

This assignment is worked on Kaggle notebook

## 2.Collection of Data:

For this assignment, I consider urls of type legitimate(0) and phisihig(1) to avoid getting into the risk of data imbalance, I am considering a margin value of 16,000 phishing URLs & 8,000 URLs each type. and inputed dataset only contain url and lable as good(0) and bad(1).

Original dataset is extracted from: https://www.kaggle.com/datasets/aksingh2411/dataset-of-malicious-and-benign-webpages . 

In [1]:
#importing required packages for this module
import pandas as pd

In [2]:
#loading the Legitimate URLs data to dataframe
data0 = pd.read_csv("/kaggle/input/urldata/legitimate.csv")
data0.head()

Unnamed: 0,url
0,http://www.dutchthewiz.com/freeware/
1,http://www.collectiblejewels.com
2,http://www.deadlinedata.com
3,http://www.mil.fi/maavoimat/kalustoesittely/00...
4,http://www.avclub.com/content/node/24539


In [3]:
data0.shape

(8000, 1)

In [4]:
#loading the Phishing URLs data to dataframe
data1 = pd.read_csv("/kaggle/input/urldata/phishing.csv")
data1.head()

Unnamed: 0,url
0,http://www.blackmistress.com/
1,http://www.blogs-de-sexe.com
2,http://www.muschi-feuchte.de/
3,http://new-playboy.jp/
4,http://www.vintagefemdom.com/


In [5]:
data1.shape

(8000, 1)

No the dataframe (data0) is ready to extract features

## 3. Feature Extraction

In this step, features are extracted from the URLs dataset.

The extracted features are categorized into:
1. Address Bar based Features
2. Domain based Features
3. HTML & Javascript based Features

More details on features can be read through [Guardian:Readme-github]().

In [6]:
!pip install python-whois
!pip install requests
!pip install bs4

[0m

In [7]:
import re
import whois
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

def extract_features(url,label):
    features = {}

    #listing shortening services
    shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"


    # Address Bar Based Features (9)
    parsed_url = urlparse(url)
    # Domain of the URL (Domain)
    domain = parsed_url.netloc
    features['Domain'] = domain
    # Checks for IP address in URL (IP_address)
    features['IP Address'] = int(bool(re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', url)))
    # Checks the presence of @ in URL
    features['@ Symbol'] = int('@' in domain)
    # Finding the length of URL and categorizing (URL_Length)
    features['URL Length'] = 0 if len(url) <= 62 else 1
    # Gives number of '/' in URL (URL_Depth)
    features['URL Depth'] = url.count('/')
    # Checking for redirection '//' in the url (Redirection)
    features['Redirection'] = int('//' in domain)
    # Existence of “HTTPS” Token in the Domain Part of the URL (https_Domain)
    features['HTTP/HTTPS in Domain'] = int('http' in domain or 'https' in domain)
    # Checking for Shortening Services in URL (Tiny_URL)
    features['Shortening Service'] = int(bool(re.search(shortening_services, domain)))
    # Checking for Prefix or Suffix Separated by (-) in the Domain (Prefix/Suffix)
    features['Prefix/Suffix - in Domain'] = int('-' in domain)

    # Domain Based Features (4)
    try:
        w = whois.whois(domain)
        # Check for DNS record
        features['DNS Record'] = int(not bool(w.domain_name))
        # Check Web traffic with Alexa (Web_Traffic)
        url = f"https://www.alexa.com/siteinfo/{domain}"
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        rank_element = soup.select_one(".rank-global")
        rank = rank_element.get_text(strip=True) if rank_element else ''
        if rank == '' or int(rank) > 100000:
            features['Web Traffic'] = 1
        else:
            features['Web Traffic'] = 0
        # Checking for Domain age
        if isinstance(w.creation_date, list) and isinstance(w.updated_date, list):
            domain_age = (w.expiration_date[0] - w.creation_date[0]).days
        else:
            domain_age = 1
        features['Domain Age'] = 0 if domain_age and domain_age >= 365 else 1
        # End time of domain: The difference between termination time and current time (Domain_End)
        if isinstance(w.expiration_date, list) and isinstance(w.updated_date, list):
            end_period = (w.expiration_date[0] - w.updated_date[0]).days
        else:
            end_period = 1
        features['End Period of Domain'] = 0 if end_period and end_period <= 183 else 1
    except:
        features['DNS Record'] = 1
        features['Web Traffic'] = 1
        features['Domain Age'] = 1
        features['End Period of Domain'] = 1

    # HTML and JavaScript based Features
    try:
        response = requests.get(url)
        html = response.text
        # IFrame Redirection (iFrame)
        iframe_found = bool(re.findall(r'<iframe|<frame', html, re.IGNORECASE))
        respond_found = bool(re.findall(r'respond', html, re.IGNORECASE))
        features['IFrame Redirection'] = 1 if not iframe_found or respond_found else 0
        # Checks the effect of mouse over on status bar (Mouse_Over)
        if not html or re.search(r'onmouseover', html, re.IGNORECASE): 
            features['Status Bar Customization'] = 1 
        else: 
            features['Status Bar Customization'] = 0
        # Checks the status of the right click attribute (Right_Click)
        features['Disabling Right Click'] = int(bool(re.search(r'event\.button\s*==\s*2', html, re.IGNORECASE)))
        # Checks the number of forwardings (Web_Forwards)
        forwarding_tags = re.findall(r'<meta\s*http-equiv\s*=\s*["\']?refresh', html, re.IGNORECASE)
        features['Website Forwarding'] = int(len(forwarding_tags) > 1)
    except:
        features['IFrame Redirection'] = 1
        features['Status Bar Customization'] = 1
        features['Disabling Right Click'] = 1
        features['Website Forwarding'] = 1
        
    # use to add label in feature extraction dataset preparation
    features['Label']=label
    return features

In [8]:
#Extracting the feautres & storing them in a list
legit_features = []

for i in range(0, 8000):
    url = data0['url'][i]
    legit_features.append(extract_features(url,0))

Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno -2] Name or service not k

In [17]:
feature_names = ['Domain', 'IP Address', '@ Symbol', 'URL Length', 'URL Depth','Redirection', 
                      'HTTP/HTTPS in Domain', 'Shortening Service', 'Prefix/Suffix - in Domain', 'DNS Record', 'Web Traffic', 
                      'Domain Age', 'End Period of Domain', 'IFrame Redirection', 'Status Bar Customization','Disabling Right Click', 'Website Forwarding', 'Label']

data0 = pd.DataFrame(legit_features, columns= feature_names)
data0.head()

Unnamed: 0,Domain,IP Address,@ Symbol,URL Length,URL Depth,Redirection,HTTP/HTTPS in Domain,Shortening Service,Prefix/Suffix - in Domain,DNS Record,Web Traffic,Domain Age,End Period of Domain,IFrame Redirection,Status Bar Customization,Disabling Right Click,Website Forwarding,Label
0,www.dutchthewiz.com,0,0,0,4,0,0,0,0,0,1,1,0,1,0,0,0,0
1,www.collectiblejewels.com,0,0,0,2,0,0,0,0,0,1,0,0,1,0,0,0,0
2,www.deadlinedata.com,0,0,0,2,0,0,0,0,0,1,0,1,1,0,0,0,0
3,www.mil.fi,0,0,0,5,0,0,0,0,0,1,1,0,1,0,0,0,0
4,www.avclub.com,0,0,0,5,0,0,0,0,1,1,1,1,1,0,0,0,0


In [11]:
data0.shape

(8000, 18)

In [14]:
output_file_path = '/kaggle/working/legit_url_with_features.csv'  
data0.to_csv(output_file_path, index=False)

In [16]:
#Extracting the feautres & storing them in a list
phishing_features = []

for i in range(0, 8000):
    url = data1['url'][i]
    phishing_features.append(extract_features(url,1))

Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - [Errno -2] Name or service not known
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - timed out
Error trying to connect to socket: closing socket - [Errno 111] Connection refused
Error trying to connect to socket: closing socket - [Errno 113] No route to host
Error trying to connect to socket: closing

In [18]:
feature_names = ['Domain', 'IP Address', '@ Symbol', 'URL Length', 'URL Depth','Redirection', 
                      'HTTP/HTTPS in Domain', 'Shortening Service', 'Prefix/Suffix - in Domain', 'DNS Record', 'Web Traffic', 
                      'Domain Age', 'End Period of Domain', 'IFrame Redirection', 'Status Bar Customization','Disabling Right Click', 'Website Forwarding', 'Label']

data1 = pd.DataFrame(phishing_features, columns= feature_names)
data1.head()

Unnamed: 0,Domain,IP Address,@ Symbol,URL Length,URL Depth,Redirection,HTTP/HTTPS in Domain,Shortening Service,Prefix/Suffix - in Domain,DNS Record,Web Traffic,Domain Age,End Period of Domain,IFrame Redirection,Status Bar Customization,Disabling Right Click,Website Forwarding,Label
0,www.blackmistress.com,0,0,0,3,0,0,0,0,0,1,1,0,1,0,0,0,1
1,www.blogs-de-sexe.com,0,0,0,2,0,0,0,1,1,1,1,1,1,1,1,1,1
2,www.muschi-feuchte.de,0,0,0,3,0,0,0,1,0,1,1,0,1,0,0,0,1
3,new-playboy.jp,0,0,0,3,0,0,0,1,1,1,1,1,1,1,1,1,1
4,www.vintagefemdom.com,0,0,0,3,0,0,0,0,0,1,1,0,1,0,0,0,1


In [19]:
data1.shape

(8000, 18)

In [20]:
output_file_path = '/kaggle/working/phishing_url_with_features.csv'  
data1.to_csv(output_file_path, index=False)

In [21]:
# Combine the datasets into a single dataframe
df_combined = pd.concat([data0, data1])

# Save the dataset into a new CSV file
df_combined.to_csv("/kaggle/working/url_dataset.csv", index=False)

In [22]:
df_combined.head()

Unnamed: 0,Domain,IP Address,@ Symbol,URL Length,URL Depth,Redirection,HTTP/HTTPS in Domain,Shortening Service,Prefix/Suffix - in Domain,DNS Record,Web Traffic,Domain Age,End Period of Domain,IFrame Redirection,Status Bar Customization,Disabling Right Click,Website Forwarding,Label
0,www.dutchthewiz.com,0,0,0,4,0,0,0,0,0,1,1,0,1,0,0,0,0
1,www.collectiblejewels.com,0,0,0,2,0,0,0,0,0,1,0,0,1,0,0,0,0
2,www.deadlinedata.com,0,0,0,2,0,0,0,0,0,1,0,1,1,0,0,0,0
3,www.mil.fi,0,0,0,5,0,0,0,0,0,1,1,0,1,0,0,0,0
4,www.avclub.com,0,0,0,5,0,0,0,0,1,1,1,1,1,0,0,0,0


In [23]:
df_combined.tail()

Unnamed: 0,Domain,IP Address,@ Symbol,URL Length,URL Depth,Redirection,HTTP/HTTPS in Domain,Shortening Service,Prefix/Suffix - in Domain,DNS Record,Web Traffic,Domain Age,End Period of Domain,IFrame Redirection,Status Bar Customization,Disabling Right Click,Website Forwarding,Label
7995,www.1stallamericangirls.com,0,0,0,5,0,0,0,0,1,1,1,1,1,1,1,1,1
7996,www.osanpo.tv,0,0,0,3,0,0,0,0,1,1,1,0,1,0,0,0,1
7997,mitglied.lycos.de,0,0,0,4,0,0,0,0,0,1,1,0,1,0,0,0,1
7998,www.russian-whores.com,0,0,0,3,0,0,0,1,1,1,1,1,1,1,1,1,1
7999,www.sweetbabyescorts.com,0,0,0,2,0,0,0,0,0,1,1,0,1,0,0,0,1


In [24]:
df_combined.shape

(16000, 18)