### *Problem we are solving : To explore frameworks that can resist adversarial phishing websites crafted using Generative Adversarial Networks (GANs). This research will focus on improving robustness through adversarial training combined with multi-modal feature extraction (e.g., URL, HTML content, metadata).*


This notebook contains the feature Engineering part where we create 17 features from each URL

In [1]:
import pandas as pd

In [2]:
# to stay away from warnings
import warnings
warnings.filterwarnings('ignore')

I have retrieved phishing data from the Japan Computer Emergency Response Team Coordination Center (JPCERT/CC) for making it adptable to new evolving phishing attacks

In [3]:
data1 = pd.read_csv("/content/202409.csv")
data2 = pd.read_csv("/content/202410.csv")

In [4]:
data2.drop(columns=['date','description'],inplace= True)

In [5]:
data1.shape
data2.shape

(4729, 1)

In [6]:
#Collecting 5,000 phishing URLs randomly from phishing urls collected last year
phish_set1 = data1.sample(n = 1500, random_state = 12).copy()
phish_set1 = phish_set1.reset_index(drop=True)
phish_set1.head()

Unnamed: 0,date,URL,description
0,2024/09/03 10:17:00,https://ecrbnvhafhugtkrbfbbgypv.kejiadmin.cn/c...,ヤマト運輸
1,2024/09/05 10:51:00,https://tkiqcybvzaodsquksskzgthq.xmcm10.cn/cao...,Amazon
2,2024/09/04 15:17:00,https://cistaturko.wixsite.com/biglobe,BIGLOBE
3,2024/09/06 18:27:00,https://thunderteacher.com/,TS CUBIC CARD
4,2024/09/03 10:55:00,https://meenchanrinwile.iusacom.com,アプラス


In [7]:
phish_set2 = data2.sample(n = 3500, random_state = 12).copy()
phish_set2 = phish_set2.reset_index(drop=True)
phish_set2.head()

Unnamed: 0,URL
0,https://pazc606s.schwenk.cn/caonimazne2qcg.co.jp/
1,https://expy-jp.tokyo/smart_login.php
2,http://xerjqc.cn/KOWETcfd587FTH69cgFfeh345345/
3,https://www.jwwu.cn/
4,https://tmp168.com/?token=SAISON-CARD/login-gilt


In [8]:
#concatenate two sets of randomly sampled datas
phish_set = pd.concat([phish_set1,phish_set2],axis = 0, ignore_index= False)
phish_set.head()

Unnamed: 0,date,URL,description
0,2024/09/03 10:17:00,https://ecrbnvhafhugtkrbfbbgypv.kejiadmin.cn/c...,ヤマト運輸
1,2024/09/05 10:51:00,https://tkiqcybvzaodsquksskzgthq.xmcm10.cn/cao...,Amazon
2,2024/09/04 15:17:00,https://cistaturko.wixsite.com/biglobe,BIGLOBE
3,2024/09/06 18:27:00,https://thunderteacher.com/,TS CUBIC CARD
4,2024/09/03 10:55:00,https://meenchanrinwile.iusacom.com,アプラス


In [9]:
print(type(phish_set))

<class 'pandas.core.frame.DataFrame'>


In [10]:
phish_set1_urls = phish_set1[['URL']]
phish_set2_urls = phish_set2[['URL']]

# Concatenate the URL columns
phish_set_combined_urls = pd.concat([phish_set1_urls, phish_set2_urls], ignore_index=True)

# Display the combined DataFrame
print(phish_set_combined_urls)

# Optional: Save to a CSV file
phish_set_combined_urls.to_csv('combined_urls.csv', index=False)

                                                    URL
0     https://ecrbnvhafhugtkrbfbbgypv.kejiadmin.cn/c...
1     https://tkiqcybvzaodsquksskzgthq.xmcm10.cn/cao...
2                https://cistaturko.wixsite.com/biglobe
3                           https://thunderteacher.com/
4                   https://meenchanrinwile.iusacom.com
...                                                 ...
4995           https://www.nslow.t-catas.xyz/index.html
4996                     https://www.hdsunshine100.com/
4997                               https://www.gbcw.cn/
4998                     https://www.bLuebayantigua.com
4999  http://turdcotutaally.faqserv.com/SFREfgr85tbA...

[5000 rows x 1 columns]


In [11]:
phish_set.shape

(5000, 3)

The Legitimate URLs from the ISCX-URL2016 dataset are benign, safe URLs collected from Alexa's top-ranked websites.

Description: These are legitimate, non-malicious URLs from trusted, popular websites, typically ranked by Alexa.
Source: The URLs were extracted using a Heritrix web crawler, which initially crawled around half a million URLs. Duplicate URLs and domain-only URLs were removed, and then the extracted URLs were verified using VirusTotal to ensure they are benign.
Size: Over 35,300 benign URLs.

Note : Alexa is deprecated for usage but this are legitimate urls so this can be preprocessed anytime

https://www.unb.ca/cic/datasets/url-2016.html



In [12]:
legit = pd.read_csv("/content/Legit_datasets.csv")
legit.columns = ['URLs']
legit.head()

Unnamed: 0,URLs
0,http://1337x.to/torrent/1110018/Blackhat-2015-...
1,http://1337x.to/torrent/1122940/Blackhat-2015-...
2,http://1337x.to/torrent/1124395/Fast-and-Furio...
3,http://1337x.to/torrent/1145504/Avengers-Age-o...
4,http://1337x.to/torrent/1160078/Avengers-age-o...


In [13]:
legit.shape

(35377, 1)


Collecting 5,000 Legitimate URLs randomly

In [14]:
legiurl = legit.sample(n = 5000, random_state = 12).copy()
legiurl = legiurl.reset_index(drop=True)
legiurl.head()

Unnamed: 0,URLs
0,http://graphicriver.net/search?date=this-month...
1,http://ecnavi.jp/redirect/?url=http://www.cros...
2,https://hubpages.com/signin?explain=follow+Hub...
3,http://extratorrent.cc/torrent/4190536/AOMEI+B...
4,http://icicibank.com/Personal-Banking/offers/o...


### **Feature Extraction from URLs:**

**1. Address Bar-Based Features :**
In this step, we extract specific features from the URLs that provide important information about the structure and behavior of the URL. These features fall under the category of "Address Bar-Based Features" and are derived from the components of the URL. The following features are particularly relevant to understanding potential malicious URLs:

* Domain of the URL :
The domain name is a critical component in identifying the origin of the URL. Extracting the domain allows us to analyze whether the domain is known or suspicious. This can help in detecting phishing attacks where malicious domains are used.

* IP Address in the URL :
URLs that directly contain an IP address, instead of a domain name, can indicate suspicious behavior. In many legitimate websites, domains are used instead of raw IP addresses, so detecting IP addresses can be a red flag for phishing or other malicious activities.

* Presence of "@" Symbol in the URL :
The @ symbol in URLs is commonly used in email addresses and is sometimes seen in suspicious URLs. Phishing sites may use this symbol to hide part of the domain or mislead users into clicking the URL.

* Length of the URL :
The length of a URL can be indicative of its nature. Malicious URLs are often longer due to the use of obfuscation techniques, which may include random strings or misleading parameters.

* Depth of the URL :
The depth of the URL refers to the number of subdirectories or folders in the path of the URL. A higher depth could indicate an unusual or suspicious structure, which is often seen in phishing or malicious sites designed to mimic legitimate websites.

* Redirection "//" in the URL :
URLs with multiple consecutive slashes (e.g., //) after the domain part may indicate a redirection or an attempt to obscure the true destination. Phishing sites sometimes use this technique to disguise their real nature.

* HTTP/HTTPS in the Domain Name :
URLs that contain http or https in the domain name (e.g., examplehttp.com) can indicate a misrepresentation of a legitimate website. Such URLs attempt to mimic well-known and trusted domains by inserting these protocols into the domain part.

* Use of URL Shortening Services (e.g., TinyURL) :
URLs shortened by services like TinyURL are commonly used to mask the original destination. While they can be used for legitimate purposes, they are also often associated with phishing attacks where the attacker hides the true URL destination to deceive users.

* Prefix or Suffix "-" in the Domain :
The presence of hyphens (-) in domain names can be indicative of attempts to mimic legitimate domain names by inserting these characters in specific positions. For example, phishing sites may use a hyphen in an attempt to look like a well-known brand.

Each of these features can be extracted from the URL and used to assess the likelihood of a URL being malicious. By examining these attributes, we can build a set of features that can help in classifying URLs as legitimate or phishing. The next step involves coding these features to automatically extract them from a given set of URLs.

In [15]:
# importing required packages for this section
from urllib.parse import urlparse,urlencode
import ipaddress
import re

*Domain of the URL*

In [16]:
def getDomain(url):
    domain = urlparse(url).netloc
    if re.match(r"^www.",domain):
        domain = domain.replace("www.","")
    return domain

*IP Address in the URL*

In [17]:
def haveIP(url):
    try:
        ipaddress.ip_address(url)
        ip = 1
    except:
        ip = 0
    return ip

"@" *Symbol in URL*

In [18]:
def haveAt(url):
    if "@" in url:
        at = 1
    else:
        at = 0
    return at

Length of the URL

In [19]:
def getLength(url):
    if len(url) < 60:
        length = 0
    else:
        length = 1
    return length

*Depth of URL*

In [20]:
urlparse("https://netflix.com/login/auth=0").path.split('/')

['', 'login', 'auth=0']

In [21]:
def getDepth(url):
    s = urlparse(url).path.split('/')
    depth = 0
    for j in range(len(s)):
        if len(s[j]) != 0:
            depth = depth + 1
    return depth

*Redirection "//" in URL*

In [22]:
url = "http://github.com/GregaVrbancic/Phishing-Dataset"
url.rfind('//')
# it indicates the index value

5

In [23]:
def redirection(url):
    pos = url.rfind('//')
    if pos > 6: #checks for http
        if pos > 1: #checks for https
            return 1 # denotes phishing
        else:
            return 0
    else:
        return 0
        # indicates legitimate

"http/https"*in Domain name*

In [24]:
a = urlparse("httpsnetfilx.com")
print(a)

ParseResult(scheme='', netloc='', path='httpsnetfilx.com', params='', query='', fragment='')


In [25]:
def httpDomain(url):
    domain = urlparse(url).netloc
    if 'https' in domain:
        return 1 # returns phishing
    else:
        return 0 # return legit

*Using URL Shortening Services* “TinyURL”

In [26]:
#listing shortening services
shortening_services = r"bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|" \
                      r"yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|" \
                      r"short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|" \
                      r"doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|db\.tt|" \
                      r"qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|q\.gs|is\.gd|" \
                      r"po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|x\.co|" \
                      r"prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|" \
                      r"tr\.im|link\.zip\.net"

In [27]:
def tinyURL(url):
    matches = re.search(shortening_services,url)
    if matches:
        return 1 #phishing
    else:
        return 0 #legitimate

*Prefix or Suffix* "-" in Domain

In [28]:
def prefixSuffix(url):
    if '-' in urlparse(url).netloc:
        return 1 #phishing
    else:
        return 0 #legitimate

### **Domain-Based Features for URL Analysis**
When analyzing URLs for phishing detection or classification, domain-based features play a critical role in identifying suspicious patterns. Some of the features that can be extracted from a domain include:

* DNS Record: This feature checks whether the domain has a valid DNS (Domain Name System) record. Phishing sites may not have valid DNS records, as they may be temporary or fake domains.

* Age of Domain: The age of the domain can be an important factor. Legitimate websites typically have older domains, while phishing websites may often use newly registered domains to evade detection.

* End Period of Domain: This feature checks the expiration date of a domain. Phishing websites often use domains that are either set to expire soon or are very new, while legitimate websites have longer expiration periods.

Each of these features can be implemented using Python, often with the help of external libraries or APIs to fetch domain information.

In [29]:
!pip install python-whois

Collecting python-whois
  Downloading python_whois-0.9.5-py3-none-any.whl.metadata (2.6 kB)
Downloading python_whois-0.9.5-py3-none-any.whl (104 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/104.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m102.4/104.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.2/104.2 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-whois
Successfully installed python-whois-0.9.5


In [30]:
# importing required packages for this section
import re
from bs4 import BeautifulSoup
import whois
import urllib
import urllib.request
from datetime import datetime

*DNS Record*

In [31]:
!pip install dnspython

Collecting dnspython
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/313.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m204.8/313.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython
Successfully installed dnspython-2.7.0


In [32]:
import dns.resolver

In [33]:
a = dns.resolver.resolve("swiggy.com", 'A')
print(a)

<dns.resolver.Answer object at 0x79d19756ae10>


In [34]:
import whois
import socket
from urllib.parse import urlparse

def check_dns_records(url):
    try:
        # Extract the domain from the URL
        domain = urlparse(url).netloc

        # Perform DNS resolution (optional, for early detection of invalid domains)
        socket.gethostbyname(domain)

        # Perform WHOIS query
        domain_info = whois.whois(domain)

        return 0  # Return 0 for legitimate domains

    except socket.gaierror as e:
        return 1

    except socket.timeout as e:

        return 1

    except whois.parser.PywhoisError as e:

        return 1

    except Exception as e:
        return 1

In [35]:
check_dns_records("https://www.nmpncim.cn/?jxcdphxm")

1

*Age of Domain*

In [36]:
whois.whois("abpdhyeebroilppjvu.dxpzsto.cn/caonima=*").expiration_date

datetime.datetime(2025, 11, 5, 5, 33, 13)

In [37]:
domain_name = urlparse(url).netloc

In [38]:
from datetime import datetime
from urllib.parse import urlparse

def domainAge(url):
    domain_name = urlparse(url).netloc
    try:
        domain_info = whois.whois(domain_name)
    except whois.parser.PywhoisError:
        return 1  # Treat missing WHOIS as suspicious

    if domain_info is None:
        return 1  # Invalid WHOIS data or DNS resolution failure (suspicious)

    creation_date = domain_info.creation_date
    expiration_date = domain_info.expiration_date

    # Handle cases where dates are not available or are lists
    if creation_date is None or expiration_date is None:
        return 1  # If dates are missing, it's suspicious

    # If dates are lists, take the first element
    if isinstance(creation_date, list):
        creation_date = creation_date[0]
    if isinstance(expiration_date, list):
        expiration_date = expiration_date[0]

    # If dates are strings, parse them
    if isinstance(creation_date, str):
        try:
            creation_date = datetime.strptime(creation_date, '%Y-%m-%d')
        except ValueError:
            return 1  # Error parsing date (suspicious)

    if isinstance(expiration_date, str):
        try:
            expiration_date = datetime.strptime(expiration_date, '%Y-%m-%d')
        except ValueError:
            return 1  # Error parsing date (suspicious)

    # Calculate domain age
    today = datetime.now()
    age_in_days = (expiration_date - creation_date).days

    # If domain age is less than 6 months (180 days), it's considered suspicious
    if age_in_days < 180:
        return 1  # Domain is less than 6 months old (suspicious)

    return 0  # Domain is legitimate

In [39]:
domainAge("https://tkiqcybvzaodsquksskzgthq.xmcm10.cn")

1

*End Period of Domain*

In [40]:
from datetime import datetime
from urllib.parse import urlparse

def domainEnd(url):
    domain_name = urlparse(url).netloc
    try:
        domain_info = whois.whois(domain_name)
    except whois.parser.PywhoisError:
        return 1  # Treat missing WHOIS as suspicious

    # ... rest of your function (from line 7 onwards) ...
    if domain_info is None:
        return 1  # Invalid WHOIS data or DNS resolution failure (suspicious)

    expiration_date = domain_info.expiration_date

    if expiration_date is None:
        return 1  # No expiration date found, treat as suspicious

    # If expiration_date is a list, take the first element
    if isinstance(expiration_date, list):
        expiration_date = expiration_date[0]

    # Handle string dates and convert to datetime
    if isinstance(expiration_date, str):
        try:
            expiration_date = datetime.strptime(expiration_date, "%Y-%m-%d")
        except ValueError:
            return 1  # Error parsing date (suspicious)

    # If expiration_date is still not a datetime object, return 1 (suspicious)
    if not isinstance(expiration_date, datetime):
        return 1  # Invalid expiration date format (suspicious)

    # Calculate remaining days until expiration
    today = datetime.now()
    remaining_days = (expiration_date - today).days

    if remaining_days <= 0:
        return 1  # Domain expired (suspicious)

    remaining_months = remaining_days / 30  # Convert days to months
    if remaining_months < 6:
        return 1  # Less than 6 months remaining (suspicious)

    return 0  # Domain is legitimate


In [41]:
domainEnd("https://tkiqcybvzaodsquksskzgthq.xmcm10.cn")

1

In [42]:
domainEnd("https://ajeab.com/")

1

## **HTML and JavaScript-based Features**

Phishing websites often manipulate the HTML and JavaScript code of a webpage to deceive users. Below are some key features that can be used to identify phishing websites:

 * **IFrame Redirection**  
   Phishing websites may use hidden or invisible IFrames to redirect users to malicious sites. IFrames can be used to load a fake page or capture user credentials without their knowledge.

*  **Status Bar Customization**  
   Phishers may alter the status bar in a web browser using JavaScript to trick users into thinking they are interacting with a legitimate website. This feature is commonly used to make suspicious links appear safe.

* **Disabling Right Click**  
   Some phishing sites disable the right-click functionality to prevent users from inspecting the page, viewing source code, or accessing other browser tools that could reveal malicious activity.

* **Website Forwarding**  
   Phishing sites may automatically redirect users to another webpage, often a fake login page, to capture sensitive information. This forwarding is usually achieved using JavaScript or HTML meta-refresh tags.

These features are often indicators of a phishing attempt, and identifying them can help flag suspicious websites.

In [43]:
import requests

In [44]:
import requests
from urllib3.exceptions import MaxRetryError, NameResolutionError
from requests.exceptions import ConnectionError

def make_request(url):
    try:
        response = requests.get(url)
        return response
    except (NameResolutionError, MaxRetryError) as e:
        return ""
    except ConnectionError as e:
        return ""
    except requests.exceptions.RequestException as e:
        return ""

*IFrame Redirection*

In [45]:
a = make_request("https://kmpyhnuvzzdowv.topiary.cn/caonima=*")

In [46]:
def iframe(response):
    if response == "":
        return 1
    else:
        if re.findall(r"[|]", response.text):
            return 0
        else:
            return 1

*Status Bar Customization*

In [47]:
def mouseOver(response):
  if response == "" :
    return 1
  else:
    if re.findall("", response.text):
      return 1
    else:
      return 0

Disabling Right Click

In [48]:
def rightClick(response):
  if response == "":
    return 1
  else:
    if re.findall(r"event.button ?== ?2", response.text):
      return 0
    else:
      return 1

Website Forwarding

In [49]:
def forwarding(response):
  if response == "":
    return 1
  else:
    if len(response.history) <= 2:
      return 0
    else:
      return 1

Calling the function to combine all the extracted features into a single dataset.

In [60]:
def featureExtraction(url,label):

  features = []
  #Address bar based features (9)
  features.append(getDomain(url))
  features.append(haveIP(url))
  features.append(haveAt(url))
  features.append(getLength(url))
  features.append(getDepth(url))
  features.append(redirection(url))
  features.append(httpDomain(url))
  features.append(tinyURL(url))
  features.append(prefixSuffix(url))
  #Domain name system based features (3)
  features.append(check_dns_records(url))
  features.append(domainAge(url))
  features.append(domainEnd(url))

  response = make_request(url)
  #Domain based features (4)
  features.append(iframe(response))
  features.append(mouseOver(response))
  features.append(rightClick(response))
  features.append(forwarding(response))
  features.append(label)

  return features




In [51]:
featureExtraction("https://www.photostory.cn/",1)

['photostory.cn', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1]

To check which instance and its indexes for data retrieval

In [52]:
# phish_set.iloc[]

Like the above example, we can retrieve 5000 preprocessed phishing samples

In [53]:
phish_features = []
label = 1

for i in range(0, 5):# change this to 5000; used 5 for smooth execution
  url = phish_set['URL'].iloc[i]
  phish_features.append(featureExtraction(url,label))

Make it into a DataFrame and label the features for the preprocessed data

In [54]:
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection',
                'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Domain_Age',
                'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']

phishing = pd.DataFrame(phish_features, columns= feature_names)
phishing.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,ecrbnvhafhugtkrbfbbgypv.kejiadmin.cn,0,0,0,1,0,0,0,0,1,0,0,1,1,1,1,1
1,tkiqcybvzaodsquksskzgthq.xmcm10.cn,0,0,0,1,0,0,0,0,1,1,1,1,1,1,1,1
2,cistaturko.wixsite.com,0,0,0,1,0,0,0,0,1,1,1,0,1,1,0,1
3,thunderteacher.com,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
4,meenchanrinwile.iusacom.com,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1


Save the preprocessed data as csv file

In [55]:
phishing.to_csv('phishing.csv', index= False)

Legitimate URLs:

In [59]:
legi_features = []
label = 0

for i in range(0, 5):# change this to 5000; used 5 for smooth execution
  url = legiurl['URLs'][i]
  legi_features.append(featureExtraction(url,label))

In [57]:
#converting the list to dataframe
feature_names = ['Domain', 'Have_IP', 'Have_At', 'URL_Length', 'URL_Depth','Redirection',
                'https_Domain', 'TinyURL', 'Prefix/Suffix', 'DNS_Record', 'Domain_Age',
                'Domain_End', 'iFrame', 'Mouse_Over','Right_Click', 'Web_Forwards', 'Label']

legitimate = pd.DataFrame(legi_features, columns= feature_names)
legitimate.head()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,graphicriver.net,0,0,1,1,0,0,0,0,0,0,0,0,1,1,0,0
1,ecnavi.jp,0,0,1,1,1,0,0,0,0,0,1,0,1,1,0,0
2,hubpages.com,0,0,1,1,0,0,0,0,0,0,1,0,1,1,0,0
3,extratorrent.cc,0,0,1,3,0,0,0,0,1,0,0,1,1,1,1,0
4,icicibank.com,0,0,1,3,0,0,0,0,0,0,0,1,1,1,0,0


In [71]:
legitimate.to_csv('legitimate.csv', index= False)

Using this data for classification tasks we need to explore a bit by combining it into one dataframe

In [76]:
url_processed_data = pd.concat([legitimate, phishing]).reset_index(drop=True)
url_processed_data

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
0,graphicriver.net,0,0,1,1,0,0,0,0,0,1,1,0,0,1,0,0
1,ecnavi.jp,0,0,1,1,1,0,0,0,0,1,1,0,0,1,0,0
2,hubpages.com,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0
3,extratorrent.cc,0,0,1,3,0,0,0,0,0,0,1,0,0,1,0,0
4,icicibank.com,0,0,1,3,0,0,0,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,nslow.t-catas.xyz,0,0,0,1,0,0,0,1,0,0,0,1,1,1,1,1
9996,hdsunshine100.com,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
9997,gbcw.cn,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1
9998,bLuebayantigua.com,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1


In [77]:
url_processed_data.tail()

Unnamed: 0,Domain,Have_IP,Have_At,URL_Length,URL_Depth,Redirection,https_Domain,TinyURL,Prefix/Suffix,DNS_Record,Domain_Age,Domain_End,iFrame,Mouse_Over,Right_Click,Web_Forwards,Label
9995,nslow.t-catas.xyz,0,0,0,1,0,0,0,1,0,0,0,1,1,1,1,1
9996,hdsunshine100.com,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
9997,gbcw.cn,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1
9998,bLuebayantigua.com,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
9999,eevee.tv,0,0,0,4,0,0,0,0,0,0,0,0,0,1,0,1


In [78]:
url_processed_data.shape

(10000, 17)

In [79]:
# Storing the data in CSV file
url_processed_data.to_csv('url_processed.csv', index=False)

To extract those 17 features we used this site as reference

References : https://archive.ics.uci.edu/dataset/327/phishing+websites