<a href="https://colab.research.google.com/github/devpathak0212/Phishing-Domain-Detection/blob/main/Phishing_Domain_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Importing Required Libraries**

The libraries imported are essential for data analysis, visualization, and machine learning. `pandas` is used for data manipulation and analysis, providing data structures like DataFrames. `numpy` is a powerful library for numerical computations, particularly with arrays. `matplotlib.pyplot` and `seaborn` are used for data visualization, with Seaborn built on top of Matplotlib to provide more aesthetically pleasing and complex visualizations. `sklearn.model_selection` contains functions like `train_test_split` for
splitting datasets into training and testing sets. `sklearn.preprocessing` includes tools like `StandardScaler` for scaling features. `sklearn.metrics` offers functions for evaluating model performance, such as `classification_report` and `confusion_matrix`. Finally, `RandomForestClassifier` from `sklearn.ensemble` is an ensemble learning method used for classification tasks.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# **Loading Dataset and Data Cleaning**

In this part of the code, the dataset is loaded using `pandas` by reading a CSV file from the specified file path into a DataFrame called `data`. Basic information about the dataset, such as data types and statistical summaries, can be displayed using `data.info()` and `data.describe()`, though these lines are currently commented out. The dataset is printed to provide an overview of its contents. Additionally, a step to check for missing values in the dataset is implied, which would typically involve methods like `data.isnull().sum()` to identify any missing data points that need to be addressed for further analysis.


In [None]:
# Load the dataset
file_path = '/content/dataset_small.csv'
data = pd.read_csv(file_path)
print(data)

# Check for missing values
print(data.isnull().sum())


       qty_dot_url  qty_hyphen_url  qty_underline_url  qty_slash_url  \
0                2               0                  0              0   
1                4               0                  0              2   
2                1               0                  0              1   
3                2               0                  0              3   
4                1               1                  0              4   
...            ...             ...                ...            ...   
58640            1               0                  0              5   
58641            2               0                  0              0   
58642            5               6                  3              6   
58643            2               0                  0              0   
58644            2               0                  0              3   

       qty_questionmark_url  qty_equal_url  qty_at_url  qty_and_url  \
0                         0              0           0          

# **URL Feature Extraction**

In this part of the code, specific URL-based features are identified and extracted from the dataset. These features, stored in the `url_features` list, include various counts and characteristics of URL components such as dots, hyphens, underscores, slashes, question marks, equal signs, and other special characters within the URLs. Additionally, features like URL length, presence in Google's index, and whether the URL is shortened are included. The statistical summary of these features is displayed using `data[url_features].describe()`, providing insights into their distribution and central tendencies within the dataset.

In [None]:
url_features = ['qty_dot_url', 'qty_hyphen_url', 'qty_underline_url', 'qty_slash_url',
                'qty_questionmark_url', 'qty_equal_url', 'qty_at_url', 'qty_and_url',
                'qty_exclamation_url', 'qty_space_url', 'qty_tilde_url', 'qty_comma_url',
                'qty_plus_url', 'qty_asterisk_url', 'qty_hashtag_url', 'qty_dollar_url',
                'qty_percent_url', 'qty_tld_url', 'length_url', 'url_google_index', 'url_shortened']

# Analyze the URL-based features
print(data[url_features].describe())


        qty_dot_url  qty_hyphen_url  qty_underline_url  qty_slash_url  \
count  58645.000000    58645.000000       58645.000000   58645.000000   
mean       2.284338        0.457123           0.171285       1.937522   
std        1.473209        1.339340           0.801919       2.037525   
min        1.000000        0.000000           0.000000       0.000000   
25%        2.000000        0.000000           0.000000       0.000000   
50%        2.000000        0.000000           0.000000       1.000000   
75%        3.000000        0.000000           0.000000       3.000000   
max       24.000000       35.000000          21.000000      44.000000   

       qty_questionmark_url  qty_equal_url    qty_at_url   qty_and_url  \
count          58645.000000   58645.000000  58645.000000  58645.000000   
mean               0.014102       0.311177      0.033456      0.212959   
std                0.138156       1.159198      0.343272      1.130323   
min                0.000000       0.000000    

# **Domain Feature Extraction**

In this part of the code, domain-based features are extracted from the dataset and analyzed. The `domain_features` list includes counts of various characters and elements within the domain, such as dots, hyphens, underscores, slashes, and other special characters, as well as the number of vowels in the domain. Additional features include domain length, whether the domain is an IP address, server-client domain information, SPF records, domain activation and expiration times, and whether the domain is indexed by Google. The statistical summary of these domain features is displayed using `data[domain_features].describe()`, offering a comprehensive overview of their distribution and key statistics within the dataset.

In [None]:
domain_features = ['qty_dot_domain', 'qty_hyphen_domain', 'qty_underline_domain', 'qty_slash_domain',
                   'qty_questionmark_domain', 'qty_equal_domain', 'qty_at_domain', 'qty_and_domain',
                   'qty_exclamation_domain', 'qty_space_domain', 'qty_tilde_domain', 'qty_comma_domain',
                   'qty_plus_domain', 'qty_asterisk_domain', 'qty_hashtag_domain', 'qty_dollar_domain',
                   'qty_percent_domain', 'qty_vowels_domain', 'domain_length', 'domain_in_ip',
                   'server_client_domain', 'domain_spf', 'time_domain_activation', 'time_domain_expiration',
                   'domain_google_index']
# Analyze the Domain-based features
print(data[domain_features].describe())


       qty_dot_domain  qty_hyphen_domain  qty_underline_domain  \
count    58645.000000       58645.000000          58645.000000   
mean         1.799540           0.133294              0.000290   
std          0.790989           0.465673              0.019802   
min          0.000000           0.000000              0.000000   
25%          1.000000           0.000000              0.000000   
50%          2.000000           0.000000              0.000000   
75%          2.000000           0.000000              0.000000   
max         21.000000          11.000000              2.000000   

       qty_slash_domain  qty_questionmark_domain  qty_equal_domain  \
count           58645.0                  58645.0           58645.0   
mean                0.0                      0.0               0.0   
std                 0.0                      0.0               0.0   
min                 0.0                      0.0               0.0   
25%                 0.0                      0.0       

# **Page Feature Extraction**

In this part of the code, page-based features are extracted from the dataset for analysis. The `page_features` list includes counts of various characters within the page parameters, such as dots, hyphens, underscores, slashes, and other special characters, as well as the length of the parameters, presence of the top-level domain (TLD) in the parameters, and the total number of parameters. By printing the statistical summary using `data[page_features].describe()`, the code provides insights into the distribution and key statistics of these page-specific features within the dataset, helping to understand their characteristics and potential impact on the analysis.

In [None]:
page_features = ['qty_dot_params', 'qty_hyphen_params', 'qty_underline_params', 'qty_slash_params',
                 'qty_questionmark_params', 'qty_equal_params', 'qty_at_params', 'qty_and_params',
                 'qty_exclamation_params', 'qty_space_params', 'qty_tilde_params', 'qty_comma_params',
                 'qty_plus_params', 'qty_asterisk_params', 'qty_hashtag_params', 'qty_dollar_params',
                 'qty_percent_params', 'params_length', 'tld_present_params', 'qty_params']
# Analyze the Page-based features
print(data[page_features].describe())

       qty_dot_params  qty_hyphen_params  qty_underline_params  \
count    58645.000000       58645.000000          58645.000000   
mean        -0.714451          -0.816506             -0.791781   
std          1.193137           0.771199              0.797698   
min         -1.000000          -1.000000             -1.000000   
25%         -1.000000          -1.000000             -1.000000   
50%         -1.000000          -1.000000             -1.000000   
75%         -1.000000          -1.000000             -1.000000   
max         23.000000          35.000000             21.000000   

       qty_slash_params  qty_questionmark_params  qty_equal_params  \
count      58645.000000             58645.000000      58645.000000   
mean          -0.830898                -0.860670         -0.587723   
std            0.663899                 0.386856          1.345035   
min           -1.000000                -1.000000         -1.000000   
25%           -1.000000                -1.000000       

# **Content Feature Extraction**

In this part of the code, content-based features are extracted from the dataset for analysis. The `content_features` list includes counts of various characters within the content, such as dots, hyphens, underscores, slashes, and other special characters, along with the length of the content. By printing the statistical summary using `data[content_features].describe()`, the code provides insights into the distribution and key statistics of these content-specific features within the dataset, helping to understand their characteristics and how they might influence further analysis or model building.

In [None]:
content_features = ['qty_dot_file', 'qty_hyphen_file', 'qty_underline_file', 'qty_slash_file',
                    'qty_questionmark_file', 'qty_equal_file', 'qty_at_file', 'qty_and_file',
                    'qty_exclamation_file', 'qty_space_file', 'qty_tilde_file', 'qty_comma_file',
                    'qty_plus_file', 'qty_asterisk_file', 'qty_hashtag_file', 'qty_dollar_file',
                    'qty_percent_file', 'file_length']
# Analyze the Content-based features
print(data[content_features].describe())

       qty_dot_file  qty_hyphen_file  qty_underline_file  qty_slash_file  \
count  58645.000000     58645.000000        58645.000000    58645.000000   
mean      -0.045750        -0.211084           -0.260466       -0.298525   
std        0.762056         0.870709            0.606537        0.457615   
min       -1.000000        -1.000000           -1.000000       -1.000000   
25%       -1.000000        -1.000000           -1.000000       -1.000000   
50%        0.000000         0.000000            0.000000        0.000000   
75%        0.000000         0.000000            0.000000        0.000000   
max       12.000000        21.000000           17.000000        0.000000   

       qty_questionmark_file  qty_equal_file   qty_at_file  qty_and_file  \
count           58645.000000    58645.000000  58645.000000  58645.000000   
mean               -0.298525       -0.296035     -0.298082     -0.296377   
std                 0.457615        0.463333      0.458425      0.461529   
min        

# **Feature Selection**

In this step, the code combines all the previously defined feature sets into a single list called `all_features`, which includes URL-based, domain-based, page-based, and content-based features. This comprehensive list can be used for further analysis or modeling. Additionally, there are commented-out lines that allow selecting only specific feature sets (URL, domain, page, or content) if needed, enabling flexibility in choosing the features according to specific requirements or analysis goals.

In [None]:
# Combine all feature sets
all_features = url_features + domain_features + page_features + content_features
# all_features = url_features
# all_features = domain_features
# all_features = page_features
# all_features = content_features

# **Splitting the Train-Test Data and Data Standardization**

In this step, the dataset is split into training and testing sets using the selected features (`all_features`) and the target variable (`phishing`). The `train_test_split` function from `sklearn.model_selection` is used to allocate 70% of the data for training and 30% for testing, with a fixed random state for reproducibility. After splitting, the features in both the training and testing sets are standardized using `StandardScaler` from `sklearn.preprocessing`, ensuring that the data has a mean of 0 and a standard deviation of 1. This standardization step is crucial for improving the performance and convergence of many machine learning algorithms.

In [None]:
X = data[all_features]
y = data['phishing']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the feature set
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# **Preparing Model**

In this segment of the code, a dictionary named `models` is initialized to hold different machine learning models. Currently, only the `RandomForestClassifier` from `sklearn.ensemble` is selected and active for use. Other models, such as `LogisticRegression`, `SVC` (Support Vector Machine), and `KNeighborsClassifier`, are included in the dictionary but commented out, indicating they are potential alternatives that can be activated and evaluated based on their performance results. This setup allows for easy experimentation and comparison of different models for the task at hand.

In [None]:
# Initialize models
models = {
    "Random Forest": RandomForestClassifier(),
    "K-Nearest Neighbors": KNeighborsClassifier()
}

# **Testing and Evaluating the Model**

In this segment of the code, the selected machine learning model(s) from the `models` dictionary are trained and evaluated. For each model, the `fit` method is used to train the model on the training data (`X_train` and `y_train`). Predictions are then made on the test data (`X_test`) using the `predict` method. The performance of each model is evaluated by printing a classification report, which includes metrics such as precision, recall, and F1-score, and a confusion matrix, which provides a detailed breakdown of the model's performance on the test data. This process allows for a comprehensive assessment of the model's accuracy and effectiveness in classifying the target variable (`phishing`).

In [None]:
# Train and evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"{name} Model")
    print(classification_report(y_test, y_pred))
    print('Confusion Matrix: ')
    print(confusion_matrix(y_test, y_pred))

Random Forest Model
              precision    recall  f1-score   support

           0       0.94      0.93      0.94      8428
           1       0.94      0.95      0.94      9166

    accuracy                           0.94     17594
   macro avg       0.94      0.94      0.94     17594
weighted avg       0.94      0.94      0.94     17594

Confusion Matrix: 
[[7840  588]
 [ 496 8670]]
K-Nearest Neighbors Model
              precision    recall  f1-score   support

           0       0.91      0.92      0.91      8428
           1       0.93      0.91      0.92      9166

    accuracy                           0.92     17594
   macro avg       0.92      0.92      0.92     17594
weighted avg       0.92      0.92      0.92     17594

Confusion Matrix: 
[[7776  652]
 [ 816 8350]]
