# Phishing Website Detection 

## Importing necessary libraries 

In [1]:
import numpy as np 
import pandas as pd 

## Reading the dataset 

In [2]:
df = pd.read_csv(r"C:\Users\cl501_26\Downloads\phiusiil+phishing+url+dataset\PhiUSIIL_Phishing_URL_Dataset.csv")
df.head()

Unnamed: 0,FILENAME,URL,URLLength,Domain,DomainLength,IsDomainIP,TLD,URLSimilarityIndex,CharContinuationRate,TLDLegitimateProb,...,Pay,Crypto,HasCopyrightInfo,NoOfImage,NoOfCSS,NoOfJS,NoOfSelfRef,NoOfEmptyRef,NoOfExternalRef,label
0,521848.txt,https://www.southbankmosaics.com,31,www.southbankmosaics.com,24,0,com,100.0,1.0,0.522907,...,0,0,1,34,20,28,119,0,124,1
1,31372.txt,https://www.uni-mainz.de,23,www.uni-mainz.de,16,0,de,100.0,0.666667,0.03265,...,0,0,1,50,9,8,39,0,217,1
2,597387.txt,https://www.voicefmradio.co.uk,29,www.voicefmradio.co.uk,22,0,uk,100.0,0.866667,0.028555,...,0,0,1,10,2,7,42,2,5,1
3,554095.txt,https://www.sfnmjournal.com,26,www.sfnmjournal.com,19,0,com,100.0,1.0,0.522907,...,1,1,1,3,27,15,22,1,31,1
4,151578.txt,https://www.rewildingargentina.org,33,www.rewildingargentina.org,26,0,org,100.0,1.0,0.079963,...,1,0,1,244,15,34,72,1,85,1


## 📜 Description About The Columns 

1. **FILENAME** 📂: The name of the file containing the website data.

2. **URL** 🌐: The web address of the website.

3. **URLLength** 📏: The total number of characters in the URL.

4. **Domain** 🌍: The core part of the URL, typically the website's name (e.g., "example" in example.com).

5. **DomainLength** 🧮: The length of the domain name in characters.

6. **IsDomainIP** 🌐🔢: Whether the domain is an IP address instead of a regular domain name.

7. **TLD** 🏷️: The top-level domain like .com, .org, etc.

8. **URLSimilarityIndex** 📊: A score showing how similar the URL is to known legitimate URLs.

9. **CharContinuationRate** 🔁: How often characters repeat in the URL.

10. **TLDLegitimateProb** 🛡️: The probability that the top-level domain is valid and trustworthy.

11. **URLCharProb** 🔠: The likelihood that characters in the URL are legitimate.

12. **TLDLength** 🧵: The number of characters in the top-level domain.

13. **NoOfSubDomain** 🔗: The number of subdomains in the URL (e.g., "blog" in blog.example.com).

14. **HasObfuscation** 🎭: Whether the URL uses obfuscation to disguise its true nature.

15. **NoOfObfuscatedChar** 🕵️: The number of characters in the URL that are obfuscated.

16. **ObfuscationRatio** 📉: The ratio of obfuscated characters to the total number of characters in the URL.

17. **NoOfLettersInURL** 🔠📈: The count of letters in the URL.

18. **LetterRatioInURL** 📏🔠: The ratio of letters to the total number of characters in the URL.

19. **NoOfDigitsInURL** 🔢📊: The count of digits in the URL.

20. **DigitRatioInURL** 📊🔢: The ratio of digits to the total number of characters in the URL.

21. **NoOfEqualsInURL** ➗: The number of equals signs (=) in the URL.

22. **NoOfQMarkInURL** ❓: The number of question marks (?) in the URL.

23. **NoOfAmpersandInURL** &️⃣: The number of ampersands (&) in the URL.

24. **NoOfOtherSpecialCharsInURL** ⚠️: The count of other special characters in the URL.

25. **SpecialCharRatioInURL** 📉✨: The ratio of special characters to the total number of characters in the URL.

26. **IsHTTPS** 🔒: Whether the website uses HTTPS for secure browsing.

27. **LineOfCode** 🖥️: The total number of lines of code on the webpage.

28. **LargestLineLength** 📏🖥️: The length of the longest line of code.

29. **HasTitle** 🏷️: Indicates if the webpage has a title tag.

30. **Title** 📝: The text of the webpage's title tag.

31. **DomainTitleMatchScore** 🌍🏆: A score indicating how well the domain matches the title of the webpage.

32. **URLTitleMatchScore** 🌐🏅: A score showing how closely the URL matches the title of the webpage.

33. **HasFavicon** 🖼️: Whether the website has a favicon (small icon in the browser tab).

34. **Robots** 🤖: Indicates if the site has a robots.txt file to guide search engine crawlers.

35. **IsResponsive** 📱💻: Whether the website adapts to different screen sizes.

36. **NoOfURLRedirect** 🔄: The number of redirects the URL undergoes.

37. **NoOfSelfRedirect** 🔁: The number of times the URL redirects to itself.

38. **HasDescription** 📝: Indicates if the webpage has a meta description tag.

39. **NoOfPopup** 🛑: The number of pop-up windows or alerts on the webpage.

40. **NoOfiFrame** 🖼️🔢: The number of iframes (embedded content) on the webpage.

41. **HasExternalFormSubmit** 🌐✉️: Indicates if the webpage has forms that submit data to external servers.

42. **HasSocialNet** 🌍📲: Indicates if there are links to social networks on the site.

43. **HasSubmitButton** ✅: Whether the webpage includes a form submit button.

44. **HasHiddenFields** 🕵️‍♂️: Indicates if there are hidden form fields on the webpage.

45. **HasPasswordField** 🔐: Indicates if there is a password input field on the page.

46. **Bank** 🏦: Indicates if the website is related to banking services.

47. **Pay** 💳: Indicates if the website deals with payment services.

48. **Crypto** 💰🔑: Indicates if the website is related to cryptocurrencies.

49. **HasCopyrightInfo** ©️: Indicates if the webpage contains copyright information.

50. **NoOfImage** 🖼️: The number of images on the webpage.

51. **NoOfCSS** 🎨: The number of CSS (Cascading Style Sheets) files linked to the webpage.

52. **NoOfJS** 📜: The number of JavaScript files linked to the webpage.

53. **NoOfSelfRef** 🔗: The number of times the webpage links to itself.

54. **NoOfEmptyRef** 🚫🔗: The number of empty or broken links on the webpage.

55. **NoOfExternalRef** 🌐🔗: The number of links to exte is a phishing site or not.

This should make the details of your project a bit more colorful and engaging! 😊

In [10]:
# Displaying the datatypes of columns 
for i in df.columns:
    print(f"The column is :{i}, Datatype: {df[i].dtype}")

The column is :FILENAME, Datatype: object
The column is :URL, Datatype: object
The column is :URLLength, Datatype: int64
The column is :Domain, Datatype: object
The column is :DomainLength, Datatype: int64
The column is :IsDomainIP, Datatype: int64
The column is :TLD, Datatype: object
The column is :URLSimilarityIndex, Datatype: float64
The column is :CharContinuationRate, Datatype: float64
The column is :TLDLegitimateProb, Datatype: float64
The column is :URLCharProb, Datatype: float64
The column is :TLDLength, Datatype: int64
The column is :NoOfSubDomain, Datatype: int64
The column is :HasObfuscation, Datatype: int64
The column is :NoOfObfuscatedChar, Datatype: int64
The column is :ObfuscationRatio, Datatype: float64
The column is :NoOfLettersInURL, Datatype: int64
The column is :LetterRatioInURL, Datatype: float64
The column is :NoOfDegitsInURL, Datatype: int64
The column is :DegitRatioInURL, Datatype: float64
The column is :NoOfEqualsInURL, Datatype: int64
The column is :NoOfQMarkI

In [13]:
# Display whether the column is categorical or numerical
def identify_column_type(column):
    dtype = df[column].dtype
    unique_values = df[column].nunique()
    num_values = len(df[column])
    
    # Criteria for categorical data encoded as numbers
    if pd.api.types.is_numeric_dtype(dtype):
        if unique_values < 10:  # Example threshold for potential categorical
            # Further check for categorical patterns
            return 'Categorical (Encoded as Numeric)'
        else:
            return 'Numerical'
    elif pd.api.types.is_string_dtype(dtype) or pd.api.types.is_categorical_dtype(dtype):
        return 'Categorical'
    else:
        return 'Other'

# Determine column types
print("Column Name, Data Type, Type Category:")
for column in df.columns:
    column_type = identify_column_type(column)
    print(f"Column Name: {column}, Data Type: {df[column].dtype}, Type Category: {column_type}")

Column Name, Data Type, Type Category:
Column Name: FILENAME, Data Type: object, Type Category: Categorical
Column Name: URL, Data Type: object, Type Category: Categorical
Column Name: URLLength, Data Type: int64, Type Category: Numerical
Column Name: Domain, Data Type: object, Type Category: Categorical
Column Name: DomainLength, Data Type: int64, Type Category: Numerical
Column Name: IsDomainIP, Data Type: int64, Type Category: Categorical (Encoded as Numeric)
Column Name: TLD, Data Type: object, Type Category: Categorical
Column Name: URLSimilarityIndex, Data Type: float64, Type Category: Numerical
Column Name: CharContinuationRate, Data Type: float64, Type Category: Numerical
Column Name: TLDLegitimateProb, Data Type: float64, Type Category: Numerical
Column Name: URLCharProb, Data Type: float64, Type Category: Numerical
Column Name: TLDLength, Data Type: int64, Type Category: Numerical
Column Name: NoOfSubDomain, Data Type: int64, Type Category: Numerical
Column Name: HasObfuscati

In [17]:
# Check for Missing Values 
df.isnull().sum()


FILENAME                      0
URL                           0
URLLength                     0
Domain                        0
DomainLength                  0
IsDomainIP                    0
TLD                           0
URLSimilarityIndex            0
CharContinuationRate          0
TLDLegitimateProb             0
URLCharProb                   0
TLDLength                     0
NoOfSubDomain                 0
HasObfuscation                0
NoOfObfuscatedChar            0
ObfuscationRatio              0
NoOfLettersInURL              0
LetterRatioInURL              0
NoOfDegitsInURL               0
DegitRatioInURL               0
NoOfEqualsInURL               0
NoOfQMarkInURL                0
NoOfAmpersandInURL            0
NoOfOtherSpecialCharsInURL    0
SpacialCharRatioInURL         0
IsHTTPS                       0
LineOfCode                    0
LargestLineLength             0
HasTitle                      0
Title                         0
DomainTitleMatchScore         0
URLTitle

In [18]:
# Summary statistics for numerical columns
print(df.describe())

           URLLength   DomainLength     IsDomainIP  URLSimilarityIndex  \
count  235795.000000  235795.000000  235795.000000       235795.000000   
mean       34.573095      21.470396       0.002706           78.430778   
std        41.314153       9.150793       0.051946           28.976055   
min        13.000000       4.000000       0.000000            0.155574   
25%        23.000000      16.000000       0.000000           57.024793   
50%        27.000000      20.000000       0.000000          100.000000   
75%        34.000000      24.000000       0.000000          100.000000   
max      6097.000000     110.000000       1.000000          100.000000   

       CharContinuationRate  TLDLegitimateProb    URLCharProb      TLDLength  \
count         235795.000000      235795.000000  235795.000000  235795.000000   
mean               0.845508           0.260423       0.055747       2.764456   
std                0.216632           0.251628       0.010587       0.599739   
min          