<a href="https://colab.research.google.com/github/bongjoonsiong/Machine-Learning-Models/blob/main/Phishing_Detection_with_Transform_and_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Phishing Detection Leveraging Transformer and Pytorch

Phishing is a widespread technique for online identity theft and cyber attacks, posing significant risks to individuals and organizations alike. Cybercriminals typically use email, SMS, and social media platforms to execute these attacks, sending messages that contain malicious links designed to deceive recipients.

These messages often impersonate trusted entities such as banks, credit card companies, or well-known businesses like Amazon and eBay. By appearing legitimate, they trick individuals into divulging their personal information. Given the increasing sophistication of these attacks, it is crucial to develop advanced methods for detecting and preventing phishing attempts.

In this article, titled "Phishing Detection Leveraging Transformer and PyTorch," we explore how machine learning, specifically Transformer models and the PyTorch framework, can be utilized to identify and mitigate phishing threats effectively.

### Using "Bert-base-uncased” Transformer for Text Classification

In [None]:
#Install Required Tranformer and Framework

!pip install -q --upgrade simpletransformers
!pip install -q torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu
#!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
import torch
import torchvision
print(torch.__version__)
print(torchvision.__version__)



In [None]:
#import Library

from sklearn.metrics import f1_score
from simpletransformers.classification import ClassificationModel, ClassificationArgs

import pandas as pd
import numpy as np



## EDA on the Dataset

In [None]:
# Import Dataset

!git clone https://github.com/GregaVrbancic/Phishing-Dataset.git


In [None]:
data = pd.read_csv("/content/Phishing-Dataset/dataset_small.csv")
data.head()

In [None]:
data.shape

### Checking NULL data

In [None]:
nulldata = data.isnull().sum()
nulldata[nulldata>0]
print(f'Total Null is: {nulldata.sum()}')

In [None]:
x = data.drop(columns = 'phishing')
y = data[['phishing']]
x.shape, y.shape


In [None]:
y.head(20)

## Split Dataset into TRAIN, TEST

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 99)
x_train.shape, x_test.shape, y_train.shape, y_test.shape


# Print unique labels BEFORE mapping to check what labels exist
print(y_train['phishing'].unique())

# Make sure to include ALL unique labels in your y_train['phishing'] column
# Ensure all labels are mapped to either 0 or 1
label_map = {'your_current_label_1': 0, 'your_current_label_2': 1, 'other_label': 0, '5':0, '11': 0, '20': 1, '2': 0} # Add other labels as needed
y_train['phishing'] = y_train['phishing'].map(label_map)

# Fill NaN values with a suitable default (e.g., 0)
y_train['phishing'] = y_train['phishing'].fillna(0)  # Replace NaN with 0

# Verify that all labels are now 0 or 1 and NaN is handled
print(y_train['phishing'].unique())
print(x_train.columns)

In [None]:
# Create an instance of ClassificationArgs to specify model training configuration.
model_args = ClassificationArgs()

# Set the flag to train only custom parameters (not pre-trained ones).
model_args.train_custom_parameters_only = True

# Add num_labels to model_args to match your dataset
model_args.num_labels = 2  # Assuming your phishing dataset has two classes (phishing and not phishing)

# Define custom parameter groups, each with specific learning rates and weight decay.
custom_parameter_groups = [
    {
        "params": ["classifier.weight"],  # Specify which parameter(s) to include in this group.
        "lr": 1e-3,  # Learning rate for this group.
    },
    {
        "params": ["classifier.bias"],  # Specify another parameter group for classifier bias.
        "lr": 1e-3,  # Learning rate for this group.
        "weight_decay": 0.0,  # Specify weight decay for this group (0.0 in this case).
    },
]

# Create a copy of the list and assign it to model_args
model_args.custom_parameter_groups = custom_parameter_groups.copy()

# Enable CUDA for GPU acceleration (if available).
#model_args.use_cuda = False
model_args.use_multiprocessing = False

# Create a ClassificationModel using the "bert" architecture and "bert-base-uncased" pre-trained model.
# Pass the model_args to configure the training settings.
model = ClassificationModel("bert", "bert-base-uncased", args=model_args, use_cuda=False)

In [None]:
# Train the model
model.train_model(x_train, y_train)


In [None]:
# The following line is evaluating the model on the test dataset (y_test) and storing the results in three variables.
# 1. 'result' will contain various evaluation metrics like accuracy, F1-score, etc.
# 2. 'model_outputs' will contain the model's predicted outputs for each example in the test dataset.
# 3. 'wrong_predictions' may contain the examples that the model predicted incorrectly (if applicable).

result, model_outputs, wrong_predictions = model.eval_model(y_test, f1=f1_score)
y_pred = np.argmax(model_outputs, axis=1)


In [None]:
# Get the reports

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred)

## Conclusion

In conclusion, the integration of machine learning techniques into cybersecurity measures is not just beneficial but essential in the current digital age. Our project, "Phishing Detection Leveraging Transformer and PyTorch," demonstrates the effectiveness of using advanced Transformer models within the PyTorch framework to combat phishing threats. With an impressive accuracy of 90.1%, this model highlights the potential of machine learning to adapt to evolving phishing techniques and provide robust defenses. By detecting threats in real-time and extracting valuable insights from vast datasets, we can anticipate and prevent future attacks, ensuring a safer online environment for all. As we continue to explore and refine these technologies, leveraging both traditional algorithms and cutting-edge models like Transformers, the prospects for enhancing cybersecurity remain promising.






