## **Model Training for Phishing Detection API**

This section trains **three models** to support our phishing detection API that takes a URL and returns:

1. **Predicted PageRank** (regression model)
2. **Predicted Google Index** (binary classification)
3. **Phishing Detection** (classification using predicted PageRank & Google Index)


In [None]:
import pandas as pd
import tldextract
import joblib
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, r2_score

# Load dataset
df = pd.read_csv("C:/Users/student/Downloads/archive (7)/dataset_phishing.csv")  # Change path accordingly

### **Step 1: Feature Engineering from URL**

We extract simple features from the raw URL using `tldextract` and basic string operations, including:
- Length of the URL and hostname
- Count of characters like `.`, `-`, `/`, `@`, `?`, `&`, `=`
- Presence of `www` or `https`
- Whether the domain contains a hyphen (`prefix_suffix`)

These features are used to simulate how much information we can get from the **URL alone**, without any external query.


In [None]:
# Feature engineering using only 'url'
def extract_url_features(url):
    ext = tldextract.extract(url)
    return {
        "length_url": len(url),
        "length_hostname": len(ext.domain + '.' + ext.suffix),
        "nb_dots": url.count('.'),
        "nb_hyphens": url.count('-'),
        "nb_slash": url.count('/'),
        "nb_www": int('www' in url),
        "has_https": int('https' in url),
        "nb_at": url.count('@'),
        "nb_qm": url.count('?'),
        "nb_and": url.count('&'),
        "nb_eq": url.count('='),
        "prefix_suffix": int('-' in ext.domain),
    }

# Apply to whole DataFrame
features_df = pd.DataFrame(df['url'].apply(extract_url_features).tolist())
features_df['page_rank'] = df['page_rank']
features_df['google_index'] = df['google_index']
features_df['status'] = LabelEncoder().fit_transform(df['status'])  # phishing = 1

### **Step 2: Train PageRank Regressor**

We train a **Random Forest Regressor** to predict the `page_rank` of the URL using the extracted features.


In [None]:
# 1. Model to predict PageRank (regression)
X_page_rank = features_df.drop(columns=['page_rank', 'google_index', 'status'])
y_page_rank = features_df['page_rank']
model_page_rank = RandomForestRegressor(random_state=42)
model_page_rank.fit(X_page_rank, y_page_rank)
joblib.dump(model_page_rank, 'model_pagerank.pkl')

### **Step 3: Train Google Index Classifier**

Next, we use the same features to train a **Random Forest Classifier** that predicts whether the URL is indexed by Google (`google_index`: 0 or 1).

In [None]:
# 2. Model to predict Google Index (classification)
y_google_index = features_df['google_index']
model_google_index = RandomForestClassifier(random_state=42)
model_google_index.fit(X_page_rank, y_google_index)
joblib.dump(model_google_index, 'model_googleindex.pkl')

### **Step 4: Train Phishing Detector**

Finally, we train another **Random Forest Classifier** using the **predicted values of PageRank and Google Index** to classify whether the URL is **phishing or legitimate**.


In [7]:
# 3. Model to predict phishing using predicted pagerank & google index
X_phish = features_df[['page_rank', 'google_index']]
y_phish = features_df['status']
model_phishing = RandomForestClassifier(random_state=42)
model_phishing.fit(X_phish, y_phish)
joblib.dump(model_phishing, 'model_phishing.pkl')

print("✅ Models saved.")

✅ Models saved.


### **Output**

All three models are saved as `.pkl` files using `joblib`:
- `model_pagerank.pkl`
- `model_googleindex.pkl`
- `model_phishing.pkl`

These models will be used in the **FastAPI backend** to serve predictions via a single `/predict` endpoint.

### **Author**
Ananya P S