# Phishing + AI — End-to-end Lab

This notebook walks through dataset acquisition (fallback to repo sample), preprocessing, simple URL feature engineering, RandomForest baseline training, evaluation (confusion matrix, precision/recall/F1, PR-AUC), and SHAP-based interpretation.

Notes:
- If you want the full Kaggle dataset, upload `kaggle.json` and uncomment the Kaggle cell.
- The notebook uses a small sample CSV included in the repo as fallback.

In [None]:
# Install dependencies (run once in Colab)
!pip install -q scikit-learn pandas matplotlib seaborn shap joblib


## 1) Data acquisition
Try to load a local `/content/phishing.csv`. If it does not exist, download the sample CSV from the repo raw URL.

In [None]:
import os
import pandas as pd
DATA_LOCAL = '/content/phishing.csv'
if os.path.exists(DATA_LOCAL):
    df = pd.read_csv(DATA_LOCAL)
    print('Loaded local /content/phishing.csv')
else:
    # Fallback: raw sample CSV in this repo (adjust OWNER/repo if needed)
    sample_url = 'https://raw.githubusercontent.com/anesra/phishing-baseline/main/data/sample/phishing_sample.csv'
    try:
        df = pd.read_csv(sample_url)
        print('Loaded sample CSV from repo')
    except Exception as e:
        print('Could not download sample CSV automatically; please upload a CSV to /content/phishing.csv or provide Kaggle credentials.')
        raise

df.head()

## 2) Quick EDA
Check basic structure, label balance, and missing values.

In [None]:
print('shape:', df.shape)
print('\ncolumns:', df.columns.tolist())
print('\nlabel value counts:')
if 'label' in df.columns:
    print(df['label'].value_counts(dropna=False))
else:
    print('No column named "label" found — please map your label column to "label"')
print('\nmissing per column:')
print(df.isnull().sum())

## 3) Preprocessing pipeline and simple feature engineering
The sample CSV contains columns: URL, HTTPS_Token, URL_Length, having_At_Symbol, Prefix_Suffix, Shortening_Service, label
We will extract/ensure a few numeric features and build a ColumnTransformer pipeline. If your dataset columns differ, adapt the column names.

In [None]:
from urllib.parse import urlparse
import numpy as np
import pandas as pd

def extract_url_features(url):
    try:
        s = str(url)
        p = urlparse(s)
        host = p.netloc or ''
        path = p.path or ''
    except Exception:
        host, path, s = '', '', str(url)
    return {
        'url_length': len(s),
        'host_length': len(host),
        'path_length': len(path),
        'has_https_token': int('https' in s.lower()),
        'has_at_symbol': int('@' in s),
        'num_dashes': s.count('-')
    }

# Apply feature extraction only if there's a URL column
if 'URL' in df.columns:
    features_df = df['URL'].apply(lambda u: pd.Series(extract_url_features(u)))
    # merge into df
    df = pd.concat([df.reset_index(drop=True), features_df.reset_index(drop=True)], axis=1)
else:
    print('No URL column found — ensure your dataset has a URL column or provide feature columns directly')

df.head()

In [None]:
# Build preprocessing pipeline for the example feature set
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

num_cols = ['url_length','host_length','path_length','num_dashes']
bin_cols = ['HTTPS_Token','has_https_token','having_At_Symbol','has_at_symbol','Prefix_Suffix','Shortening_Service']
num_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),('scaler', StandardScaler())])
cat_transformer = Pipeline([('imputer', SimpleImputer(strategy='most_frequent')),('ohe', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_cols),
    ('cat', cat_transformer, bin_cols)
])
print('Preprocessor prepared with numeric and categorical transformers')

## 4) Train / test split

In [None]:
from sklearn.model_selection import train_test_split
if 'label' not in df.columns:
    raise ValueError('No label column found. Please ensure your dataset has a "label" column with 1=phishing, 0=safe.')
X = df.drop(columns=['label'])
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
print('Train:', X_train.shape, 'Test:', X_test.shape)

## 5) Baseline training (RandomForest)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

clf = Pipeline([
    ('pre', preprocessor),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
])
clf.fit(X_train, y_train)
print('Baseline RandomForest trained')

## 6) Evaluation

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_curve, auc
import seaborn as sns, matplotlib.pyplot as plt

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted'); plt.ylabel('Actual'); plt.show()

y_scores = clf.predict_proba(X_test)[:,1]
precision, recall, _ = precision_recall_curve(y_test, y_scores)
print('PR-AUC:', auc(recall, precision))

## 7) Interpretation with SHAP (sampled for performance)
Compute SHAP values on a small sample for efficiency. If you run into memory issues, use fewer samples.

In [None]:
import shap
import numpy as np

# For tree models use TreeExplainer
explainer = shap.TreeExplainer(clf.named_steps['rf'])

# sample rows to explain
n_sample = min(200, len(X_test))
X_sample = X_test.sample(n_sample, random_state=42)
X_sample_pre = preprocessor.transform(X_sample)
shap_values = explainer.shap_values(X_sample_pre)

# Summary plot (class 1 = phishing)
shap.summary_plot(shap_values[1], X_sample_pre, show=True)


## 8) Save artifacts (model + metrics)
Save a trained model and metrics for reproducibility.

In [None]:
import joblib, json, os
os.makedirs('experiments/run_01/artifacts', exist_ok=True)
joblib.dump(clf, 'experiments/run_01/artifacts/phishing_rf_v1.joblib')
metrics = {'pr_auc': float(auc(recall, precision))}
with open('experiments/run_01/metrics.json','w') as f:
    json.dump(metrics, f, indent=2)
print('Saved model and metrics to experiments/run_01/')

## Next steps / notes
- Tune hyperparameters (RandomizedSearchCV, Optuna) optimizing PR-AUC or F1 according to operational needs.
- Run SHAP on specific false positive / false negative examples to investigate causes.
- If you have the full Kaggle dataset and `kaggle.json`, uncomment & run the Kaggle download cell (not included by default here).
- For productionization: containerize the inference code and expose a small REST API (FastAPI) that returns probabilities and top explanation features.
