In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('PhiUSIIL_Phishing_URL_Dataset.csv')
data.shape

(235795, 56)

In [5]:
data.head()

Unnamed: 0,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,URLSimilarityIndex,CharContinuationRate,TLDLegitimateProb,...,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef,label
0,521848.txt,https://www.southbankmosaics.com,31,www.southbankmosaics.com,24,0,com,100.0,1.0,0.522907,...,0,0,1,34,20,28,119,0,124,1
1,31372.txt,https://www.uni-mainz.de,23,www.uni-mainz.de,16,0,de,100.0,0.666667,0.03265,...,0,0,1,50,9,8,39,0,217,1
2,597387.txt,https://www.voicefmradio.co.uk,29,www.voicefmradio.co.uk,22,0,uk,100.0,0.866667,0.028555,...,0,0,1,10,2,7,42,2,5,1
3,554095.txt,https://www.sfnmjournal.com,26,www.sfnmjournal.com,19,0,com,100.0,1.0,0.522907,...,1,1,1,3,27,15,22,1,31,1
4,151578.txt,https://www.rewildingargentina.org,33,www.rewildingargentina.org,26,0,org,100.0,1.0,0.079963,...,1,0,1,244,15,34,72,1,85,1


In [6]:
data.columns

Index(['FILENAME', 'URL', 'URLLength', 'Domain', 'DomainLength', 'IsDomainIP',
       'TLD', 'URLSimilarityIndex', 'CharContinuationRate',
       'TLDLegitimateProb', 'URLCharProb', 'TLDLength', 'NoOfSubDomain',
       'HasObfuscation', 'NoOfObfuscatedChar', 'ObfuscationRatio',
       'NoOfLettersInURL', 'LetterRatioInURL', 'NoOfDegitsInURL',
       'DegitRatioInURL', 'NoOfEqualsInURL', 'NoOfQMarkInURL',
       'NoOfAmpersandInURL', 'NoOfOtherSpecialCharsInURL',
       'SpacialCharRatioInURL', 'IsHTTPS', 'LineOfCode', 'LargestLineLength',
       'HasTitle', 'Title', 'DomainTitleMatchScore', 'URLTitleMatchScore',
       'HasFavicon', 'Robots', 'IsResponsive', 'NoOfURLRedirect',
       'NoOfSelfRedirect', 'HasDescription', 'NoOfPopup', 'NoOfiFrame',
       'HasExternalFormSubmit', 'HasSocialNet', 'HasSubmitButton',
       'HasHiddenFields', 'HasPasswordField', 'Bank', 'Pay', 'Crypto',
       'HasCopyrightInfo', 'NoOfImage', 'NoOfCSS', 'NoOfJS', 'NoOfSelfRef',
       'NoOfEmptyRef', 'NoOf

# Dataset Descriptions
## Features (TODO check validity of this ai generated description for features)
- **URL**: The complete URL of the website being analyzed.
- **URLLength**: The total number of characters in the URL.
- **Domain**: The main domain name extracted from the URL.
- **DomainLength**: The length of the domain name in characters.
- **IsDomainIP**: A binary indicator of whether the domain is an IP address (1 for true, 0 for false).
- **TLD**: The top-level domain of the URL (e.g., .com, .org).
- **URLSimilarityIndex**: A numerical index representing the similarity of the URL to known phishing URLs.
- **CharContinuationRate**: The rate at which characters in the URL are repeated consecutively.
- **TLDLegitimateProb**: The probability that the top-level domain is legitimate based on historical data.
- **URLCharProb**: The probability of encountering specific characters in the URL based on historical data.
- **TLDLength**: The length of the top-level domain in characters.
- **NoOfSubDomain**: The number of subdomains present in the URL.
- **HasObfuscation**: A binary indicator of whether the URL contains obfuscation techniques (1 for true, 0 for false).
- **NoOfObfuscatedChar**: The number of characters in the URL that are obfuscated.
- **ObfuscationRatio**: The ratio of obfuscated characters to the total number of characters in the URL.
- **NoOfLettersInURL**: The total number of alphabetic characters in the URL.
- **LetterRatioInURL**: The ratio of letters to the total number of characters in the URL.
- **NoOfDegitsInURL**: The total number of numeric digits in the URL.
- **DegitRatioInURL**: The ratio of digits to the total number of characters in the URL.
- **NoOfEqualsInURL**: The number of equal signs (`=`) present in the URL.
- **NoOfQMarkInURL**: The number of question marks (`?`) present in the URL.
- **NoOfAmpersandInURL**: The number of ampersands (`&`) present in the URL.
- **NoOfOtherSpecialCharsInURL**: The number of other special characters in the URL (excluding `=`, `?`, and `&`).
- **SpacialCharRatioInURL**: The ratio of special characters to the total number of characters in the URL.
- **IsHTTPS**: A binary indicator of whether the URL uses HTTPS (1 for true, 0 for false).
- **LineOfCode**: The total number of lines of code in the webpage.
- **LargestLineLength**: The length of the longest line of code in the webpage.
- **HasTitle**: A binary indicator of whether the webpage has a title (1 for true, 0 for false).
- **Title**: The title of the webpage, if present.
- **DomainTitleMatchScore**: A score indicating the match between the domain name and the webpage title.
- **URLTitleMatchScore**: A score indicating the match between the URL and the webpage title.
- **HasFavicon**: A binary indicator of whether the webpage has a favicon (1 for true, 0 for false).
- **Robots**: The content of the robots.txt file, which indicates how search engines should interact with the site.
- **IsResponsive**: A binary indicator of whether the webpage is responsive (1 for true, 0 for false).
- **NoOfURLRedirect**: The number of redirects that occur when accessing the URL.
- **NoOfSelfRedirect**: The number of self-referential redirects in the URL.
- **HasDescription**: A binary indicator of whether the webpage has a meta description (1 for true, 0 for false).
- **NoOfPopup**: The number of pop-up elements present on the webpage.
- **NoOfiFrame**: The number of iframe elements present on the webpage.
- **HasExternalFormSubmit**: A binary indicator of whether the webpage has forms that submit data to external sites (1 for true, 0 for false).
- **HasSocialNet**: A binary indicator of whether the webpage has social network links (1 for true, 0 for false).
- **HasSubmitButton**: A binary indicator of whether the webpage has a submit button (1 for true, 0 for false).
- **HasHiddenFields**: A binary indicator of whether the webpage contains hidden form fields (1 for true, 0 for false).
- **HasPasswordField**: A binary indicator of whether the webpage has a password input field (1 for true, 0 for false).
- **Bank**: A binary indicator of whether the webpage is related to banking services (1 for true, 0 for false).
- **Pay**: A binary indicator of whether the webpage is related to payment services (1 for true, 0 for false).
- **Crypto**: A binary indicator of whether the webpage is related to cryptocurrency services (1 for true, 0 for false).
- **HasCopyrightInfo**: A binary indicator of whether the webpage contains copyright information (1 for true, 0 for false).
- **NoOfImage**: The total number of images present on the webpage.
- **NoOfCSS**: The total number of CSS files linked or embedded in the webpage.
- **NoOfJS**: The total number of JavaScript files linked or embedded in the webpage.
- **NoOfSelfRef**: The number of self-referential links within the webpage.
- **NoOfEmptyRef**: The number of empty references (links with no destination) present in the webpage.
- **NoOfExternalRef**: The number of external references (links to other domains) present in the webpage.

# References
- [Dataset](https://doi.org/10.1016/j.cose.2023.103545)
- [Paper](https://www.sciencedirect.com/science/article/pii/S0167404823004558)`