# üîê Machine Learning for Malicious URL / QR Detection

---

## Project Overview

In today's world, the modern internet is densely filled with Uniform Resource Locators (URLs) that point to countless resources across popular social communication platforms like WhatsApp, Instagram, and Messenger. Users can easily share links and QR codes for quick access to information. However, these platforms can be misused to spread **malicious content** through deceptive URLs or QR codes, leading to significant risks to user privacy and data security.

Attackers often perform **social media phishing** by utilizing deceptive URLs and QR codes to trick users into clicking or scanning, thereby injecting malware virus by redirecting them to harmful websites. This significantly leads to **malware infections, data breaches, and financial losses**.

This project aims to build a machine learning model that integrates detailed feature extraction (domain_age, tinyURL, web_traffic, DNS record) to improve overall static detection of malicious URLs.

---

## Problem Objectives

1. ‚úÖ To identify URL patterns (lexical, domain-based) indicative of malicious sites
2. ‚úÖ To evaluate and compare the performance of different ML models for malicious URLs classification
3. ‚úÖ To develop a ML-based model capable of detecting phishing attempts based on URL patterns reliably

---

## Dataset: LegitPhish

| Metric | Value |
|--------|-------|
| **Total URLs** | 101,219 |
| **Phishing URLs** | 62.9% |
| **Legitimate URLs** | 37.1% |
| **Initial Features** | 18 |

**Data Source:** URLHaus database and other well-known repositories of malicious websites, as well as legitimate URLs collected from reputable sources like Wikipedia and Stack Overflow.

---

## Methodology

### Data Pre-processing
- Data Collection
- Data Cleaning
- Detailed Feature Extraction
- Data Splitting (80/20)

### Machine Learning (Binary Classification)
- Logistic Regression
- Random Forest
- XGBoost
- LightGBM

### Evaluation Metrics
- F1-Score
- Precision
- AUC-ROC
- Recall
- Confusion Matrix
- Accuracy

---

## References

1. Potpelwar, R. S., Kulkarni, U. V., & Waghmare, J. M. (2025). LegitPhish: A large-scale annotated dataset for URL-based phishing detection. Data in Brief, 63, 111972. https://doi.org/10.1016/j.dib.2025.111972
2. Xuan, C. D., Dinh, H., & Victor, T. (2020). Malicious URL Detection based on Machine Learning. International Journal of Advanced Computer Science and Applications, 11(1). https://doi.org/10.14569/ijacsa.2020.0110119
3. Aryan Nandu, Sosa, J., Pant, Y., Panchal, Y., & Sayyad, S. (2024). Malicious URL Detection Using Machine Learning. 1‚Äì6. https://doi.org/10.1109/asiancon62057.2024.10837752

---

## Authors

| Name | Student ID |
|------|------------|
| YIM WJUN JUN | 24201054 |
| RICHIE TEOH | 24088171 |
| ELMER LEE JIA ZHAO | 24082366 |
| ANGELINE TAN JIE LIN | 24084444 |
| NICOLE CHUNG SYN TUNG | 24073625 |

---

# 1. Data Collection

This section loads the LegitPhish dataset containing URLs with pre-extracted features for phishing detection.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import re
import time
import urllib
import urllib.request
from datetime import datetime
import requests
import socket
from urllib.parse import urlparse
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")

In [None]:
# Load the initial dataset
github_dataset = 'https://raw.githubusercontent.com/Elmer2408/WQD7006_Machine-learning-of-Malicious-URLs-Detection/main/url_features_extracted1.csv'
df = pd.read_csv(github_dataset)
df.columns = df.columns.str.lower()

print("="*60)
print("INITIAL DATASET LOADED")
print("="*60)
print(f"Total Samples: {df.shape[0]:,}")
print(f"Total Features: {df.shape[1]}")
df.head()

In [None]:
# Feature descriptions
descriptions = {
    "url": "The full URL string (original data).",
    "url_length": "Total number of characters in the URL.",
    "has_ip_address": "Binary flag (1/0): whether the URL contains an IP address.",
    "dot_count": "Number of . characters in the URL.",
    "https_flag": "Binary flag (1/0): whether the URL uses HTTPS.",
    "url_entropy": "Shannon entropy of the URL string ‚Äì higher values indicate more randomness.",
    "token_count": "Number of tokens/words in the URL.",
    "subdomain_count": "Number of subdomains in the URL.",
    "query_param_count": "Number of query parameters (after ?).",
    "tld_length": "Length of the Top-Level Domain (e.g., 'com' = 3).",
    "path_length": "Length of the path part after the domain.",
    "has_hyphen_in_domain": "Binary flag (1/0): whether the domain contains a hyphen (-).",
    "number_of_digits": "Total number of numeric characters in the URL.",
    "tld_popularity": "Binary flag (1/0): whether the TLD is popular.",
    "suspicious_file_extension": "Binary flag (1/0): indicates if the URL ends with suspicious extensions (e.g., .exe, .zip).",
    "domain_name_length": "Length of the domain name.",
    "percentage_numeric_chars": "Percentage of numeric characters in the URL.",
    "classlabel": "Target label: 1 = Legitimate, 0 = Phishing.",
}

# Create metadata table
metadata = pd.DataFrame({
    "Column Name": df.columns,
    "Data Type": df.dtypes.values,
    "Non-Null Count": df.notnull().sum().values,
    "Null Count": df.isnull().sum().values,
    "Unique Values": df.nunique().values,
    "Sample Values": [df[col].dropna().unique()[:3] for col in df.columns],
    "Description": [descriptions[col] for col in df.columns]
})

pd.set_option('display.max_colwidth', None)
print("="*60)
print("DATASET METADATA")
print("="*60)
metadata

---

# 2. Data Cleaning

This section handles missing values, duplicates, and data type conversions to prepare the dataset for analysis.

In [None]:
# Data Quality Check
print("="*60)
print("DATA QUALITY CHECK")
print("="*60)

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Duplicate URL entries: {duplicates}")

# Check for missing values
missing_values = df.isnull().sum()
print(f"\nMissing Values per Column:")
print(missing_values[missing_values > 0] if missing_values.sum() > 0 else "No missing values found")
print(f"\nTotal Missing in Target (classlabel): {df['classlabel'].isnull().sum()}")

In [None]:
# Clean the dataset
df_cleaned = df.copy()
df_cleaned = df_cleaned.dropna(how='any')  # Remove rows with null values
df_cleaned['classlabel'] = df_cleaned['classlabel'].astype(int)  # Convert float64 to int64
df_cleaned.columns = df_cleaned.columns.str.lower()

print("="*60)
print("AFTER DATA CLEANING")
print("="*60)
print(f"Original Dataset Size: {df.shape[0]:,}")
print(f"Cleaned Dataset Size: {df_cleaned.shape[0]:,}")
print(f"Rows Removed: {df.shape[0] - df_cleaned.shape[0]:,}")

# Updated metadata after cleaning
metadata_cleaned = pd.DataFrame({
    "Column Name": df_cleaned.columns,
    "Data Type": df_cleaned.dtypes.values,
    "Non-Null Count": df_cleaned.notnull().sum().values,
    "Null Count": df_cleaned.isnull().sum().values,
    "Unique Values": df_cleaned.nunique().values,
    "Sample Values": [df_cleaned[col].dropna().unique()[:3] for col in df_cleaned.columns],
    "Description": [descriptions[col] for col in df_cleaned.columns]
})

metadata_cleaned

---

# 3. Detailed Feature Extraction

This section defines additional feature extraction functions for HTML, JavaScript, and domain-based features. These functions can be used for real-time URL analysis.

**Note:** For this experiment, we use a pre-extracted dataset to avoid API rate limits and long processing times.

In [None]:
# Feature Extraction Functions

def DNSRecord(url):
    """
    Check if the domain has a valid DNS record.
    Returns: 1 = Legitimate (DNS exists), 0 = Phishing (No DNS)
    """
    try:
        hostname = urlparse(url).netloc
        if ':' in hostname:
            hostname = hostname.split(':')[0]
        if not hostname:
            return 0
        try:
            socket.gethostbyname(hostname)
            return 1
        except socket.gaierror:
            return 0
        except Exception:
            return 0
    except Exception:
        return 0


def iframe(response):
    """
    Check if the webpage contains iframe tags (potential phishing indicator).
    Returns: 1 = Contains iframe, 0 = No iframe
    """
    if response == "":
        return 0
    else:
        if re.findall(r"<iframe", response):
            return 1
        else:
            return 0


def mouseOver(response):
    """
    Check if the webpage uses onmouseover events (potential phishing technique).
    Returns: 0 = Contains onmouseover (suspicious), 1 = No onmouseover
    """
    if response == "":
        return 0
    else:
        if re.findall(r"onmouseover", response):
            return 0
        else:
            return 1


def rightClick(response):
    """
    Check if the webpage disables right-click (potential phishing technique).
    Returns: 1 = Right-click disabled, 0 = Normal
    """
    if response == "":
        return 0
    else:
        if re.findall(r"event.button ?== ?2", response):
            return 1
        else:
            return 0


def forwarding(url):
    """
    Check the number of redirects for a URL.
    Returns: 1 = Legitimate (‚â§2 redirects), 0 = Suspicious (>2 redirects)
    """
    try:
        headers = {'User-Agent': 'Mozilla/5.0'}
        response = requests.get(url, headers=headers, timeout=5)
        redirect_count = len(response.history)
        if redirect_count <= 2:
            return 1
        else:
            return 0
    except Exception:
        return 0


print("‚úÖ Feature extraction functions defined successfully!")
print("\nAvailable functions:")
print("  - DNSRecord(url): Check DNS record existence")
print("  - iframe(response): Detect iframe tags")
print("  - mouseOver(response): Detect onmouseover events")
print("  - rightClick(response): Detect right-click blocking")
print("  - forwarding(url): Count URL redirects")

In [None]:
# Load the complete dataset with pre-extracted features
github_dataset = 'https://raw.githubusercontent.com/Elmer2408/WQD7006_Machine-learning-of-Malicious-URLs-Detection/main/urldata_with_features_complete.csv'
df_extract = pd.read_csv(github_dataset)
df_extract.columns = df_extract.columns.str.lower()

print("="*60)
print("COMPLETE DATASET WITH EXTRACTED FEATURES")
print("="*60)
print(f"Total URLs: {df_extract.shape[0]:,}")
print(f"Total Features: {df_extract.shape[1]}")
print(f"\nFeatures: {df_extract.columns.tolist()}")
df_extract.head()

---

# 4. Exploratory Data Analysis (EDA)

This section explores the dataset structure, class distribution, feature correlations, and distributions to understand the data characteristics.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Dataset Overview
print("="*60)
print("DATASET OVERVIEW")
print("="*60)
print(f"Total URLs: {df_extract.shape[0]:,}")
print(f"Total Features: {df_extract.shape[1]}")
print(f"\nColumn Names:\n{df_extract.columns.tolist()}")

In [None]:
# Class Distribution
print("="*60)
print("CLASS DISTRIBUTION")
print("="*60)

class_counts = df_extract['classlabel'].value_counts()
class_percent = df_extract['classlabel'].value_counts(normalize=True) * 100

print(f"\nPhishing URLs (0): {class_counts[0]:,} ({class_percent[0]:.1f}%)")
print(f"Legitimate URLs (1): {class_counts[1]:,} ({class_percent[1]:.1f}%)")

# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pie Chart
colors = ['#dc3545', '#28a745']
axes[0].pie(class_counts, labels=['Phishing (0)', 'Legitimate (1)'], autopct='%1.1f%%',
            colors=colors, explode=(0.02, 0.02), startangle=90)
axes[0].set_title('Class Distribution', fontsize=14, fontweight='bold')

# Bar Chart
ax = sns.countplot(x='classlabel', data=df_extract, palette=['#dc3545', '#28a745'], ax=axes[1])
axes[1].set_xticklabels(['Phishing (0)', 'Legitimate (1)'])
axes[1].set_title('Class Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Class Label')
axes[1].set_ylabel('Count')
for p in ax.patches:
    ax.annotate(f'{int(p.get_height()):,}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='bottom', fontsize=12)

plt.tight_layout()
plt.show()

In [None]:
# Statistical Summary
print("="*60)
print("STATISTICAL SUMMARY")
print("="*60)
df_extract.describe().T

In [None]:
# Feature Correlation Heatmap
print("="*60)
print("FEATURE CORRELATION HEATMAP")
print("="*60)

plt.figure(figsize=(16, 12))
numeric_df = df_extract.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            fmt='.2f', linewidths=0.5, annot_kws={'size': 8})
plt.title('Feature Correlation Heatmap', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Feature Correlation with Target
print("="*60)
print("FEATURE CORRELATION WITH TARGET")
print("="*60)

target_correlation = correlation_matrix['classlabel'].drop('classlabel').sort_values(ascending=False)
print("\nFeatures sorted by correlation with classlabel:\n")
print(target_correlation.round(4))

# Visualize correlations
plt.figure(figsize=(10, 8))
colors = ['#28a745' if x > 0 else '#dc3545' for x in target_correlation.values]
target_correlation.plot(kind='barh', color=colors)
plt.title('Feature Correlation with Target (ClassLabel)', fontsize=14, fontweight='bold')
plt.xlabel('Correlation Coefficient')
plt.ylabel('Features')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.show()

In [None]:
# Key Features Analysis
# Based on project poster: url_length, has_ip_address, https_flag, subdomain_count, url_entropy

print("="*60)
print("KEY FEATURES ANALYSIS")
print("="*60)

key_features = ['url_length', 'has_ip_address', 'https_flag', 'subdomain_count', 'url_entropy']
print(f"5 Key Features: {key_features}")

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for idx, feature in enumerate(key_features):
    sns.boxplot(x='classlabel', y=feature, data=df_extract, ax=axes[idx],
                palette=['#dc3545', '#28a745'])
    axes[idx].set_xticklabels(['Phishing (0)', 'Legitimate (1)'])
    axes[idx].set_title(f'{feature}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('')

axes[-1].set_visible(False)

plt.suptitle('Key Features Distribution by Class', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Feature Distribution by Class (All Features)
print("="*60)
print("FEATURE DISTRIBUTION BY CLASS")
print("="*60)

feature_cols = [col for col in numeric_df.columns if col not in ['classlabel']]

fig, axes = plt.subplots(nrows=5, ncols=4, figsize=(20, 20))
axes = axes.flatten()

for idx, col in enumerate(feature_cols[:20]):
    sns.boxplot(x='classlabel', y=col, data=df_extract, ax=axes[idx], palette=['#dc3545', '#28a745'])
    axes[idx].set_title(f'{col}', fontsize=10, fontweight='bold')
    axes[idx].set_xlabel('')
    axes[idx].set_xticklabels(['Phishing', 'Legitimate'])

for idx in range(len(feature_cols[:20]), len(axes)):
    axes[idx].set_visible(False)

plt.suptitle('Feature Distribution by Class (0=Phishing, 1=Legitimate)', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

---

# 5. Data Splitting

This section separates features and target variables, then splits the data into training (80%) and testing (20%) sets with stratification to maintain class balance.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Separate features (X) and target (y)
X = df_extract.drop(columns=['classlabel', 'url'])
y = df_extract['classlabel']

print("="*60)
print("FEATURE AND TARGET SEPARATION")
print("="*60)
print(f"Features Shape (X): {X.shape}")
print(f"Target Shape (y): {y.shape}")
print(f"Number of Features: {X.shape[1]}")
print(f"\nFeature Columns:\n{X.columns.tolist()}")

In [None]:
# Train-Test Split (80% Training, 20% Testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print("="*60)
print("TRAIN-TEST SPLIT (80/20)")
print("="*60)
print(f"Training Set: {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.0f}%)")
print(f"Testing Set: {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.0f}%)")
print(f"\nTraining Class Distribution:")
print(f"  Phishing (0): {(y_train == 0).sum():,}")
print(f"  Legitimate (1): {(y_train == 1).sum():,}")
print(f"\nTesting Class Distribution:")
print(f"  Phishing (0): {(y_test == 0).sum():,}")
print(f"  Legitimate (1): {(y_test == 1).sum():,}")

In [None]:
# Feature Scaling using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print("="*60)
print("FEATURE SCALING (STANDARDIZATION)")
print("="*60)
print("StandardScaler applied: mean=0, std=1")
print("\nSample of scaled training data:")
X_train_scaled.head()

---

# 6. Modeling (Binary Classification)

This section trains four machine learning models as specified in the project methodology:
- **Logistic Regression**: A linear model for binary classification
- **Random Forest**: An ensemble of decision trees
- **XGBoost**: Extreme Gradient Boosting algorithm
- **LightGBM**: Light Gradient Boosting Machine

In [None]:
# Install required packages
!pip install lightgbm xgboost --quiet

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Define 4 models as per project methodology
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss', verbosity=0),
    'LightGBM': LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)
}

print("="*60)
print("MACHINE LEARNING MODELS")
print("="*60)
for i, model_name in enumerate(models.keys(), 1):
    print(f"{i}. {model_name}")

In [None]:
# Train all models and collect results
results = []

print("="*60)
print("MODEL TRAINING IN PROGRESS...")
print("="*60)

for model_name, model in models.items():
    print(f"\n‚ñ∂ Training {model_name}...")

    # Train the model
    model.fit(X_train_scaled, y_train)

    # Make predictions
    y_pred = model.predict(X_test_scaled)
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]

    # Calculate all evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)

    results.append({
        'Model': model_name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'AUC-ROC': roc_auc
    })

    print(f"  ‚úì Accuracy: {accuracy:.4f} | Precision: {precision:.4f} | Recall: {recall:.4f} | F1: {f1:.4f} | AUC: {roc_auc:.4f}")

print("\n" + "="*60)
print("‚úÖ ALL MODELS TRAINED SUCCESSFULLY!")
print("="*60)

---

# 7. Evaluation Metrics

This section evaluates and compares all trained models using the following metrics:
- **Accuracy**: Overall correctness of predictions
- **Precision**: Ratio of true positives to all positive predictions
- **Recall**: Ratio of true positives to all actual positives
- **F1-Score**: Harmonic mean of precision and recall
- **AUC-ROC**: Area under the Receiver Operating Characteristic curve

In [None]:
# Model Performance Comparison Table
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(by='F1-Score', ascending=False).reset_index(drop=True)

print("="*60)
print("MODEL PERFORMANCE COMPARISON")
print("="*60)
results_df

In [None]:
# Visualize Model Performance Comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Grouped bar chart for all metrics
x = np.arange(len(results_df))
width = 0.15
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC-ROC']
colors = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6']

for i, metric in enumerate(metrics):
    axes[0].bar(x + i*width, results_df[metric], width, label=metric, color=colors[i])

axes[0].set_xlabel('Model', fontsize=12)
axes[0].set_ylabel('Score', fontsize=12)
axes[0].set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
axes[0].set_xticks(x + width * 2)
axes[0].set_xticklabels(results_df['Model'], rotation=15, ha='right')
axes[0].legend(loc='lower right')
axes[0].set_ylim(0, 1.1)
axes[0].grid(axis='y', alpha=0.3)

# AUC-ROC comparison
colors_models = ['#1f77b4', '#2ca02c', '#ff7f0e', '#d62728']
bars = axes[1].barh(results_df['Model'], results_df['AUC-ROC'], color=colors_models, edgecolor='black')
axes[1].set_title('AUC-ROC Score Comparison', fontsize=14, fontweight='bold')
axes[1].set_xlabel('AUC-ROC Score')
axes[1].set_xlim(0, 1.1)
for i, (bar, v) in enumerate(zip(bars, results_df['AUC-ROC'])):
    axes[1].text(v + 0.01, bar.get_y() + bar.get_height()/2, f'{v:.4f}', va='center', fontsize=11)

plt.tight_layout()
plt.show()

In [None]:
# Best Model Summary
best_model_name = results_df.iloc[0]['Model']
best_model = models[best_model_name]

print("="*60)
print("üèÜ BEST PERFORMING MODEL")
print("="*60)
print(f"\n   Model:     {best_model_name}")
print(f"\n   Accuracy:  {results_df.iloc[0]['Accuracy']:.4f}")
print(f"   Precision: {results_df.iloc[0]['Precision']:.4f}")
print(f"   Recall:    {results_df.iloc[0]['Recall']:.4f}")
print(f"   F1-Score:  {results_df.iloc[0]['F1-Score']:.4f}")
print(f"   AUC-ROC:   {results_df.iloc[0]['AUC-ROC']:.4f}")
print("="*60)

---

# 8. Confusion Matrix

The confusion matrix visualizes the performance of each classification model by showing:
- **True Positives (TP)**: Correctly predicted legitimate URLs
- **True Negatives (TN)**: Correctly predicted phishing URLs
- **False Positives (FP)**: Phishing URLs incorrectly predicted as legitimate
- **False Negatives (FN)**: Legitimate URLs incorrectly predicted as phishing

In [None]:
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

for idx, (model_name, model) in enumerate(models.items()):
    y_pred = model.predict(X_test_scaled)
    cm = confusion_matrix(y_test, y_pred)

    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Phishing (0)', 'Legitimate (1)'])
    disp.plot(cmap='Blues', ax=axes[idx], colorbar=False)
    axes[idx].set_title(f'{model_name}', fontsize=12, fontweight='bold')

plt.suptitle('Confusion Matrix - All Models', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Detailed Classification Report for Best Model
print("="*60)
print(f"CLASSIFICATION REPORT - {best_model_name}")
print("="*60)

y_pred_best = best_model.predict(X_test_scaled)
print(classification_report(y_test, y_pred_best, target_names=['Phishing (0)', 'Legitimate (1)']))

---

# 9. AUC-ROC Curve

The ROC (Receiver Operating Characteristic) curve plots the True Positive Rate against the False Positive Rate at various threshold settings. The AUC (Area Under the Curve) measures the model's ability to distinguish between classes:
- **AUC = 1.0**: Perfect classifier
- **AUC = 0.5**: Random classifier (no discrimination)
- **AUC > 0.9**: Excellent classifier

In [None]:
from sklearn.metrics import roc_curve, auc

plt.figure(figsize=(10, 8))

colors = ['#1f77b4', '#2ca02c', '#ff7f0e', '#d62728']
line_styles = ['-', '--', '-.', ':']

for (model_name, model), color, ls in zip(models.items(), colors, line_styles):
    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.4f})',
             linewidth=2.5, color=color, linestyle=ls)

plt.plot([0, 1], [0, 1], 'k--', linewidth=1.5, label='Random Classifier (AUC = 0.5)')
plt.fill_between([0, 1], [0, 1], alpha=0.1, color='gray')

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curves - Model Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

# 10. Feature Importance Analysis

This section identifies the most important features for predicting phishing URLs. Feature importance helps understand which URL characteristics are most indicative of malicious intent.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(16, 14))

# Random Forest Feature Importance
rf_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': models['Random Forest'].feature_importances_
}).sort_values(by='Importance', ascending=True)

axes[0, 0].barh(rf_importance['Feature'], rf_importance['Importance'], color='#2ca02c')
axes[0, 0].set_xlabel('Importance Score')
axes[0, 0].set_title('Feature Importance - Random Forest', fontsize=12, fontweight='bold')

# XGBoost Feature Importance
xgb_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': models['XGBoost'].feature_importances_
}).sort_values(by='Importance', ascending=True)

axes[0, 1].barh(xgb_importance['Feature'], xgb_importance['Importance'], color='#ff7f0e')
axes[0, 1].set_xlabel('Importance Score')
axes[0, 1].set_title('Feature Importance - XGBoost', fontsize=12, fontweight='bold')

# LightGBM Feature Importance
lgb_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': models['LightGBM'].feature_importances_
}).sort_values(by='Importance', ascending=True)

axes[1, 0].barh(lgb_importance['Feature'], lgb_importance['Importance'], color='#d62728')
axes[1, 0].set_xlabel('Importance Score')
axes[1, 0].set_title('Feature Importance - LightGBM', fontsize=12, fontweight='bold')

# Logistic Regression Coefficients
lr_coef = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': np.abs(models['Logistic Regression'].coef_[0])
}).sort_values(by='Coefficient', ascending=True)

axes[1, 1].barh(lr_coef['Feature'], lr_coef['Coefficient'], color='#1f77b4')
axes[1, 1].set_xlabel('Absolute Coefficient')
axes[1, 1].set_title('Feature Importance - Logistic Regression', fontsize=12, fontweight='bold')

plt.suptitle('Feature Importance Analysis - All Models', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Top 10 Most Important Features
print("="*60)
print("TOP 10 MOST IMPORTANT FEATURES (Random Forest)")
print("="*60)

top_10_rf = rf_importance.sort_values(by='Importance', ascending=False).head(10)
for i, (idx, row) in enumerate(top_10_rf.iterrows(), 1):
    print(f"  {i:2}. {row['Feature']:25} : {row['Importance']:.4f}")

---

# 11. Summary & Conclusion

This section summarizes the findings from the machine learning experiment for malicious URL detection.

In [None]:
print("="*70)
print("üìä PROJECT SUMMARY: Machine Learning for Malicious URL/QR Detection")
print("="*70)

print(f"\nüìÅ DATASET: LegitPhish")
print(f"   ‚Ä¢ Total URLs: {df_extract.shape[0]:,}")
print(f"   ‚Ä¢ Phishing URLs: {(df_extract['classlabel'] == 0).sum():,} ({(df_extract['classlabel'] == 0).sum()/len(df_extract)*100:.1f}%)")
print(f"   ‚Ä¢ Legitimate URLs: {(df_extract['classlabel'] == 1).sum():,} ({(df_extract['classlabel'] == 1).sum()/len(df_extract)*100:.1f}%)")
print(f"   ‚Ä¢ Features Used: {X.shape[1]}")

print(f"\nü§ñ MODELS EVALUATED:")
for i, model in enumerate(models.keys(), 1):
    print(f"   {i}. {model}")

print(f"\nüìà BEST MODEL: {best_model_name}")
print(f"   ‚Ä¢ Accuracy:  {results_df.iloc[0]['Accuracy']:.4f} ({results_df.iloc[0]['Accuracy']*100:.2f}%)")
print(f"   ‚Ä¢ Precision: {results_df.iloc[0]['Precision']:.4f}")
print(f"   ‚Ä¢ Recall:    {results_df.iloc[0]['Recall']:.4f}")
print(f"   ‚Ä¢ F1-Score:  {results_df.iloc[0]['F1-Score']:.4f}")
print(f"   ‚Ä¢ AUC-ROC:   {results_df.iloc[0]['AUC-ROC']:.4f}")

print(f"\nüéØ TOP 5 PREDICTIVE FEATURES:")
top_5 = rf_importance.sort_values(by='Importance', ascending=False).head(5)
for i, (idx, row) in enumerate(top_5.iterrows(), 1):
    print(f"   {i}. {row['Feature']}")

print("\n" + "="*70)
print("‚úÖ Model successfully developed for detecting phishing URLs!")
print("="*70)

---

## üìå Key Findings

1. **Dataset Characteristics**: The LegitPhish dataset contains over 100,000 URLs with a class imbalance (62.9% phishing vs 37.1% legitimate).

2. **Model Performance**: All four models achieved high accuracy in detecting malicious URLs, with ensemble methods (Random Forest, XGBoost, LightGBM) generally outperforming Logistic Regression.

3. **Important Features**: URL entropy, URL length, and domain-based features proved to be the most predictive indicators of phishing attempts.

4. **Practical Application**: The trained model can be deployed for real-time URL scanning to protect users from phishing attacks on social media platforms.

---

## üöÄ Future Work

- Implement real-time URL feature extraction for live predictions
- Add QR code scanning and URL extraction functionality
- Deploy the model as a web API or browser extension
- Continuously update the model with new phishing patterns

---

**¬© 2025 WQD7006 Machine Learning Project - Universiti Malaya**