## BACKGROUND
Phishing attacks are one of the most prevalent and damaging forms of cybercrime today. These attacks deceive individuals into divulging sensitive information such as usernames, passwords, and credit card numbers by masquerading as legitimate websites. Despite growing awareness, phishing remains a major cybersecurity threat due to the increasing sophistication of attackers who craft URLs and websites that are difficult to distinguish from legitimate ones. The rapid growth of the internet and the ease of setting up fraudulent websites have made this problem even more critical, especially for organizations and individuals who rely heavily on online platforms.

## Problem Statement

Phishing attacks remain one of the most dangerous threats in the digital landscape, deceiving users into revealing sensitive information by imitating legitimate websites. Traditional anti-phishing solutions, such as blacklist-based systems, are often insufficient due to their inability to detect newly created phishing sites. This project seeks to address the problem of phishing detection by developing a machine learning model that leverages different features from URLs, including structural, domain-based, and behavioral characteristics, to distinguish phishing sites from legitimate ones. The goal is to achieve high accuracy and improve the generalizability of phishing detection models, providing a robust solution to combat emerging phishing threats in real-time.

## DATA UNDERSTANDING

### URL Structure Features
1. **`url`**: The actual URL of the webpage.
2. **`length_url:`** The length of the entire URL.
3. **`length_hostname`:** The length of the hostname in the URL (e.g., "example.com").
4. **`ip:`** Indicates whether the URL contains an IP address instead of a domain name (phishing URLs often use raw IPs).
5. **`nb_dots:`** Number of dots (.) in the URL, often high for phishing URLs.
6. **`nb_hyphens:`** Number of hyphens (-) in the URL.
7. **`nb_at:`** Number of "@" symbols in the URL, which can be used to hide part of the URL.
9. **`nb_qm:`** Number of question marks (?), often used in dynamic URLs.
10. **`nb_and:`** Number of ampersands (&), used in URL parameters.
11. **`nb_or:`** Number of | symbols, which may be used in obfuscated URLs.
12. **`nb_eq:`** Number of equals (=) signs, commonly seen in URL query strings.
13. **`nb_underscore:`** Number of underscores (_), often used to separate parts of a URL.
14. **`nb_tilde:`** Number of tildes (~), less common but sometimes used.
15. **`nb_percent:`** Number of percent-encoded characters (%), often used for encoding special characters.
16. **`nb_slash:`** Number of forward slashes (/), indicating the number of directory levels.
17. **`nb_star:`** Number of asterisks (*), rarely used but can be a sign of malicious intent.
18. **`nb_colon:`** Number of colons (:), commonly seen in URLs with ports or schemes.
19. **`nb_comma:`** Number of commas (,) in the URL.
20. **`nb_semicolumn:`** Number of semicolons (;), sometimes used in query strings.
21. **`nb_dollar:`** Number of dollar signs ($), rarely used in legitimate URLs.
22. **`nb_space:`** Number of spaces in the URL, which is unusual and could indicate phishing.
23. **`nb_www:`** Number of "www" occurrences, often phishing URLs may add or omit "www" in deceptive ways.
24. **`nb_com:`** Number of occurrences of ".com", since many phishing URLs might mimic legitimate ".com" domains.
25. **`nb_dslash:`** Number of double slashes (//), which can occur in various parts of the URL.

### Protocol & Host Features
25. **`http_in_path:`** Whether "http" is found in the path part of the URL (could be a trick to deceive users).
27. **`https_token:`** Whether "https" appears in parts of the URL other than at the beginning (phishers might place "https" to fake legitimacy).
28. **`ratio_digits_url:`** Ratio of digits in the entire URL.
29. **`ratio_digits_host:`** Ratio of digits in the hostname.
30. **`punycode:`** Checks if Punycode (encoding for Unicode characters in domain names) is used, as phishing sites sometimes use internationalized domain names.
31. **`port:`** If the URL contains a non-standard port (e.g., other than 80 or 443), it can be suspicious.
32. **`tld_in_path:`** Checks if the top-level domain (TLD, like ".com") appears in the path rather than the domain itself, which is a phishing indicator.
33. **`tld_in_subdomain:`** Checks if the TLD appears in the subdomain, another potential phishing indicator.
34. **`abnormal_subdomain:`** Whether the subdomain structure is abnormal (e.g., very long or complex).
35. **`nb_subdomains:`** Number of subdomains (phishing URLs may use long chains of subdomains).
36. **`prefix_suffix:`** Whether there is a prefix-suffix structure (e.g., using - between domain parts, which is common in phishing).
37. **`random_domain:`** Indicates if the domain name appears to be randomly generated.
38. **`shortening_service:`** If the URL is shortened by a URL shortening service (e.g., bit.ly), which phishers often use.

### Path & Word Features
38. **`path_extension:`** Whether the URL path contains a suspicious extension (e.g., ".exe" or ".php").
40. **`nb_redirection:`** Number of redirections (//) within the URL.
41. **`nb_external_redirection:`** Number of redirections to external sites.
42. **`length_words_raw:`** Total length of words in the URL.
43. **`char_repeat:`** Number of repeated characters in the URL.
44. **`shortest_words_raw:`** Length of the shortest word in the URL.
45. **`shortest_word_host:`** Length of the shortest word in the host.
46. **`shortest_word_path:`** Length of the shortest word in the path.
47. **`longest_words_raw:`** Length of the longest word in the URL.
48. **`longest_word_host:`** Length of the longest word in the host.
49. **`longest_word_path:`** Length of the longest word in the path.
50. **`avg_words_raw:`** Average length of words in the URL.
51. **`avg_word_host:`** Average length of words in the host.
52. **`avg_word_path:`** Average length of words in the path.

### Brand & Domain Features
52. **`phish_hints:`** Whether the URL contains hints indicating phishing.
54. **`domain_in_brand:`** If the domain contains a known brand name.
55. **`brand_in_subdomain:`** If the brand name appears in the subdomain.
56. **`brand_in_path:`** If the brand name appears in the path.
57. **`suspecious_tld:`** Whether the TLD is suspicious or frequently used in phishing domains.
58. **`statistical_report:`** Whether the domain is flagged in any statistical or blacklists.

### Hyperlink & Media Features
58. **`nb_hyperlinks:`** Number of hyperlinks in the webpage.
60. **`ratio_intHyperlinks:`** Ratio of internal hyperlinks to the total.
61. **`ratio_extHyperlinks:`** Ratio of external hyperlinks.
62. **`ratio_nullHyperlinks:`** Ratio of hyperlinks without valid URLs (could be a phishing tactic).
63. **`nb_extCSS:`** Number of external CSS files linked in the page.
64. **`ratio_intRedirection:`** Ratio of internal redirections.
65. **`ratio_extRedirection:`** Ratio of external redirections.
66. **`ratio_intErrors:`** Ratio of internal errors.
67. **`ratio_extErrors:`** Ratio of external errors.
68. **`login_form:`** Whether a login form is present, which can be used to steal credentials.
69. **`external_favicon:`** Whether the favicon (site icon) is hosted externally (which could indicate phishing).
70. **`links_in_tags:`** Number of hyperlinks embedded within HTML tags.
71. **`submit_email:`** Whether a form submits data via email (a common phishing tactic).

### JavaScript & Behavior Features
71. **`ratio_intMedia:`** Ratio of internal media resources (e.g., images, videos).
74. **`ratio_extMedia:`** Ratio of external media resources.
75. **`sfh: Server Form Handler`** — checks if the form action is pointing to a different domain, which can be used for phishing.
76. **`iframe:`** Whether an iframe is present on the page (sometimes used for phishing).
77. **`popup_window:`** Whether the page triggers a popup window (phishing pages may use this).
78. **`safe_anchor:`** Whether the anchor **()** tags are used safely (i.e., without empty or suspicious links).
79. onmouseover: Whether the onmouseover event is used for redirection or trickery.
80. right_clic: Whether the page disables right-click, which can be a tactic to prevent inspection.
81. empty_title: Whether the page has an empty or suspicious title.
82. domain_in_title: Whether the domain is part of the page title.
83. domain_with_copyright: Whether the domain includes a copyright symbol (sometimes used to appear legitimate).

### Domain & Web Reputation Features
82. **`whois_registered_domain:`** Whether the domain is registered (checked via WHOIS).
85. **`domain_registration_length:`** The length of time the domain has been registered.
86. **`domain_age:`** The age of the domain.
87. **`web_traffic:`** The amount of web traffic the domain receives (low traffic may indicate phishing).
88. **`dns_record:`** Whether the domain has a valid DNS record.
89. **`google_index:`** Whether the domain or page is indexed by Google (phishing sites may not be indexed).
90. **`page_rank:`** PageRank score, an indicator of a site’s importance (phishing sites typically have low or no rank).

### Label
89. **`status:`** The target variable indicating whether the URL is phishing or legitimate.

In [1]:
import pandas as pd
import numpy as np

Loading the dataset

In [2]:
train = pd.read_parquet("Training.parquet")
train.head()

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,https://www.todayshomeowner.com/how-to-make-ho...,82,23,0,2,7,0,0,0,0,...,1,1,0,240,8892,67860,0,1,4,legitimate
1,http://thapthan.ac.th/information/confirmation...,93,14,1,2,0,0,0,0,0,...,1,0,1,0,2996,4189860,0,1,2,phishing
2,http://app.dialoginsight.com/T/OFC4/L2S/3888/B...,121,21,1,3,0,0,0,0,0,...,1,1,0,30,2527,346022,0,1,3,phishing
3,https://www.bedslide.com,24,16,0,2,0,0,0,0,0,...,0,0,0,139,7531,1059151,0,0,4,legitimate
4,https://tabs.ultimate-guitar.com/s/sex_pistols...,73,24,0,3,1,0,0,0,0,...,0,0,0,3002,7590,635,0,1,5,legitimate


In [3]:
train.shape

(7658, 89)

- The dataset has 7658 observations and 89 Variables.
- Each row represents the response from different websites that were tested for phishing using different input variables
- The columns are the values corresponding to the websites that were part in the survey

checking info about the dataset

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7658 entries, 0 to 7657
Data columns (total 89 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   url                         7658 non-null   object 
 1   length_url                  7658 non-null   int64  
 2   length_hostname             7658 non-null   int64  
 3   ip                          7658 non-null   int64  
 4   nb_dots                     7658 non-null   int64  
 5   nb_hyphens                  7658 non-null   int64  
 6   nb_at                       7658 non-null   int64  
 7   nb_qm                       7658 non-null   int64  
 8   nb_and                      7658 non-null   int64  
 9   nb_or                       7658 non-null   int64  
 10  nb_eq                       7658 non-null   int64  
 11  nb_underscore               7658 non-null   int64  
 12  nb_tilde                    7658 non-null   int64  
 13  nb_percent                  7658 

- The dataset contains both numerical and categorical  variables represented as int64 and object data types respectively

Check whether the data has missing values

In [5]:
train.isna().sum().value_counts()

0    89
dtype: int64

None of the variables has missing values

### EXPLORATORY DATA ANALYS

Exploring the target variable (Status Variable)

In [6]:
train['url'].value_counts()

http://e710z0ear.du.r.appspot.com/c:/users/user/downlo                                                                    2
https://www.documentcloud.org/documents/2462194-the-senate-select-committee-on-intelligence.html                          1
https://elevenpuppy.com/hnd/sbc/sbc/sbcglobal.net.htm                                                                     1
http://www.sagam.sn/images/mailerhome.php                                                                                 1
https://agrocomlimited.com/JonCGoodman/toda/                                                                              1
                                                                                                                         ..
http://https.www.sandbox.paypal.com.ttlart2012ttcysu.aylandirow.tmf.org.ru/                                               1
http://mucins.weebly.com/13-protocols-and-standards.html                                                                  1
http://b

#### checking the number of values contained in the target variable

In [7]:
# code to check number of values in the target variable
train['status'].value_counts()

legitimate    3829
phishing      3829
Name: status, dtype: int64

- The status variable has two values, one value is legitimate, the other stands for phishing website.

In [8]:
train.columns

Index(['url', 'length_url', 'length_hostname', 'ip', 'nb_dots', 'nb_hyphens',
       'nb_at', 'nb_qm', 'nb_and', 'nb_or', 'nb_eq', 'nb_underscore',
       'nb_tilde', 'nb_percent', 'nb_slash', 'nb_star', 'nb_colon', 'nb_comma',
       'nb_semicolumn', 'nb_dollar', 'nb_space', 'nb_www', 'nb_com',
       'nb_dslash', 'http_in_path', 'https_token', 'ratio_digits_url',
       'ratio_digits_host', 'punycode', 'port', 'tld_in_path',
       'tld_in_subdomain', 'abnormal_subdomain', 'nb_subdomains',
       'prefix_suffix', 'random_domain', 'shortening_service',
       'path_extension', 'nb_redirection', 'nb_external_redirection',
       'length_words_raw', 'char_repeat', 'shortest_words_raw',
       'shortest_word_host', 'shortest_word_path', 'longest_words_raw',
       'longest_word_host', 'longest_word_path', 'avg_words_raw',
       'avg_word_host', 'avg_word_path', 'phish_hints', 'domain_in_brand',
       'brand_in_subdomain', 'brand_in_path', 'suspecious_tld',
       'statistical_report', 

In [9]:
train.dns_record.value_counts()

0    7496
1     162
Name: dns_record, dtype: int64

In [10]:
test = pd.read_parquet("Testing.parquet")
test

Unnamed: 0,url,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_or,...,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank,status
0,https://clubedemilhagem.com/home.php,36,19,0,2,0,0,0,0,0,...,1,0,0,344,21,0,0,1,0,phishing
1,http://www.medicalnewstoday.com/articles/18893...,51,24,0,3,0,0,0,0,0,...,1,1,0,103,6106,737,0,1,6,legitimate
2,https://en.wikipedia.org/wiki/NBC_Nightly_News,46,16,0,2,0,0,0,0,0,...,0,1,0,901,7134,12,0,0,7,legitimate
3,http://secure.web894.com/customer_center/custo...,185,17,1,2,1,0,1,2,0,...,1,1,0,247,1944,0,0,1,0,phishing
4,https://en.wikipedia.org/wiki/Transaction_proc...,52,16,0,2,0,0,0,0,0,...,0,1,0,901,7134,12,0,0,7,legitimate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3767,http://www.sublimefrequencies.com/,34,26,0,2,0,0,0,0,0,...,0,0,0,373,6202,7701846,0,0,5,legitimate
3768,http://koei.wikia.com/wiki/Dynasty_Warriors:_U...,54,14,0,2,0,0,0,0,0,...,1,1,0,139,6071,14420,0,0,5,legitimate
3769,https://www.motorzona.ru/,25,16,0,2,0,0,0,0,0,...,0,0,0,238,5971,402341,0,0,3,legitimate
3770,https://login.microsoftonline.com/aa687de1-52b...,550,25,1,5,24,0,1,9,0,...,1,1,0,349,6591,30,0,1,4,legitimate


### Combine the train and the test dataset

In [11]:
combined_data = pd.concat([train, test], axis=0, ignore_index=True)
combined_data.shape

(11430, 89)

#### Drop columns whose value counts is only 1

In [12]:
drop_columns = []
for i in range(89):
    if len(combined_data[combined_data.columns[i]].value_counts()) == 1:
        drop_columns.append(combined_data.columns[i])

drop_columns.append('url')

combined_data.drop(columns=drop_columns,inplace = True)

Check info for the combined dataset

In [13]:
combined_data.shape

(11430, 82)

The dataset has 11430 observations and 82 columns

In [14]:
combined_data.describe()

Unnamed: 0,length_url,length_hostname,ip,nb_dots,nb_hyphens,nb_at,nb_qm,nb_and,nb_eq,nb_underscore,...,empty_title,domain_in_title,domain_with_copyright,whois_registered_domain,domain_registration_length,domain_age,web_traffic,dns_record,google_index,page_rank
count,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,...,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0,11430.0
mean,61.126684,21.090289,0.150569,2.480752,0.99755,0.022222,0.141207,0.162292,0.293176,0.32266,...,0.124759,0.775853,0.439545,0.072878,492.532196,4062.543745,856756.6,0.020122,0.533946,3.185739
std,55.297318,10.777171,0.357644,1.369686,2.087087,0.1555,0.364456,0.821337,0.998317,1.093336,...,0.33046,0.417038,0.496353,0.259948,814.769415,3107.7846,1995606.0,0.140425,0.498868,2.536955
min,12.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,-1.0,-12.0,0.0,0.0,0.0,0.0
25%,33.0,15.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,84.0,972.25,0.0,0.0,0.0,1.0
50%,47.0,19.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,242.0,3993.0,1651.0,0.0,1.0,3.0
75%,71.0,24.0,0.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,449.0,7026.75,373845.5,0.0,1.0,5.0
max,1641.0,214.0,1.0,24.0,43.0,4.0,3.0,19.0,19.0,18.0,...,1.0,1.0,1.0,1.0,29829.0,12874.0,10767990.0,1.0,1.0,10.0
