## **Baseline-calibration** 
- this notebook is designed to build trustworthy baseline models for phishing detection. It’s not about chasing state-of-the-art performance yet — it’s about laying a solid foundation we can reason about, debug, and improve later.

### **Loads essential libraries for data manipulation, file handling, and visualization.**

In [1]:
from pathlib import Path
import os, json, numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import f1_score, average_precision_score, brier_score_loss
from xgboost import XGBClassifier
import mlflow
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

python-dotenv could not parse statement starting at line 1
python-dotenv could not parse statement starting at line 3
python-dotenv could not parse statement starting at line 7
python-dotenv could not parse statement starting at line 10
python-dotenv could not parse statement starting at line 11
python-dotenv could not parse statement starting at line 12
python-dotenv could not parse statement starting at line 13
python-dotenv could not parse statement starting at line 14
python-dotenv could not parse statement starting at line 15


True

### **Set working directory to root**

In [2]:
%pwd
os.chdir("../")
%pwd



'd:\\MLops\\NetworkSecurity'

### **Config & paths**

In [3]:
SEED = 42
RAW = Path("data/raw/PhiUSIIL_Phishing_URL_Dataset.csv")
CLEAN = Path("data/processed/phiusiil_clean.csv")
DATA = CLEAN if CLEAN.exists() else RAW
THRESH_PATH = Path(os.getenv("THRESHOLDS_JSON", "configs/dev/thresholds.json"))
MLFLOW_URI = os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000")
EXPERIMENT = os.getenv("MLFLOW_EXPERIMENT", "phiusiil_baselines")
THRESH_PATH.parent.mkdir(parents=True, exist_ok=True)

## **train, calibrate, evaluate, choose thresholds**

### **Load & split**

#### Intent: Load & Split

This block loads the raw or cleaned phishing dataset, identifies the label column, and prepares the features and labels for modeling. It also handles the URL column separately, ensuring only numeric features are used for training. Finally, it splits the data into training and validation sets using stratified sampling to preserve the class balance.

The goal is to set up a clean, well-structured dataset so that subsequent modeling steps are reliable and reproducible. This step is crucial for ensuring that the model is trained and evaluated on representative data, minimizing bias and data leakage.

In [None]:
df = pd.read_csv(DATA, encoding_errors = "ignore")
label_col = next((c for c in df.columns if c.lower() in {"label","result","y","target"}), None)
assert label_col is not None, "No label column found"

y = df[label_col].astype(int).values            # 1=legit, 0=phish
X = df.drop(columns=[label_col], axis=1)


if "URL" in X.columns:                         # Keep url from the X columns
    urls = X["URL"].astype(str).values
    X = X.drop(columns=["URL"])

else:
    urls = np.array([""] * len(X))            # Create placeholder URLs


# Keep only numeric values
X = X.select_dtypes(include=["number"]).copy()
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, stratify=y, random_state=SEED)



### **Define candidates (uncalibrated base)**

In [None]:
logreg_base = Pipeline([
    ("scaler", StandardScaler(with_mean=False)),   # sparse-safe; no harm if dense
    ("clf", LogisticRegression(max_iter=2000, class_weight="balanced", random_state=SEED))

])

xgb_base = XGBClassifier(
    n_estimators=300, max_depth=6, learning_rate=0.1, subsample=0.9, colsample_bytree=0.9,
    reg_lambda=1.0, random_state=SEED, n_jobs=0, objective="binary:logistic", verbose=False
)


candidates = {
    "logreg": logreg_base,
    "xgb": xgb_base,
}

### **Fit + calibrate + score**

In [None]:
def fit_calibrated(name, model):
    # isotonic calibration with 5-fold CV (robust on tabular)
    calib = CalibratedClassifierCV(model, 
                                   method="isotonic", 
                                   cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
                                   )
    calib.fit(X_train, y_train)
    
    # we need p_malicious = P(y=0). Most sklearn returns prob for class 1 -> P(y=1) (legit)
    p_legit = calib.predict_proba(X_val)[:, 1]
    p_mal = 1.0 - p_legit
    
    # core metrics
    f1m = f1_score(y_val, (p_mal >= 0.5).astype(int), average="macro")         # temp decision at 0.5 on p_mal
    prauc = average_precision_score((y_val==0).astype(int), p_mal)             # AP wrt phishing as positive class
    brier = brier_score_loss((y_val==0).astype(int), p_mal)                     # smaller=better
    return calib, {"f1_macro@0.5_on_p_mal": float(f1m), 
                   "pr_auc_phish": float(prauc), 
                   "brier_phish": float(brier)}, p_mal

results, calibrated, pvals = {}, {}, {}
for name, model in candidates.items():
    cls, metrics, p_mal = fit_calibrated(name, model)
    calibrated[name] = cls
    pvals[name] = p_mal
    results[name] = metrics

### **Pick best by PR-AUC (tie-break F1)**

In [None]:
order = sorted(results.items(), key=lambda kv: (kv[1]["pr_auc_phish"], kv[1]["f1_macro@0.5_on_p_mal"]), reverse=True)
best_name, best_metrics = order[0]
best_model = calibrated[best_name]
p_mal = pvals[best_name]

### **Find single threshold (t) maximizing F1-macro**

In [None]:
grid = np.linspace(0.05, 0.95, 19)
f1s = []
for t in grid:
    y_hat = (p_mal >= t).astype(int)         # 1=phish prediction if p_mal>=t
    # but our y is 0=phish, 1=legit → map predictions to y-space:
    y_pred = 1 - y_hat
    f1s.append(f1_score(y_val, y_pred, average="macro"))
t_star = float(grid[int(np.argmax(f1s))])

### **Expand to gray-zone band around t targeting ~10–15%**

In [None]:
target_lo, target_hi = 0.10, 0.15
band_candidates = np.linspace(0.05, 0.40, 8)     # half-widths
chosen = (t_star, max(0.0, t_star-0.10), min(1.0, t_star+0.10), 0.0)  # default
for w in band_candidates:
    low, high = max(0.0, t_star - w), min(1.0, t_star + w)
    gray = ((p_mal >= low) & (p_mal < high)).mean()
    if target_lo <= gray <= target_hi:
        chosen = (t_star, float(low), float(high), float(gray)); break
t_star, low, high, gray_rate = chosen

### **Final metrics (forced decision and gray-zone rate)**

In [None]:
y_hat_star = (p_mal >= t_star).astype(int)
y_pred_star = 1 - y_hat_star
final_f1 = f1_score(y_val, y_pred_star, average="macro")
final_pr = average_precision_score((y_val==0).astype(int), p_mal)

summary = {
    "data_file": str(DATA),
    "best_model": best_name,
    "metrics_val": {
        "pr_auc_phish": final_pr,
        "f1_macro@t_star": final_f1,
        "brier_phish": brier_score_loss((y_val==0).astype(int), p_mal),
    },
    "thresholds": {"t_star": t_star, "low": low, "high": high, "gray_zone_rate": gray_rate},
    "class_mapping": {"phish": 0, "legit": 1},
    "seed": SEED,
}
print("Selection:", best_name, best_metrics)
print("Thresholds:", summary["thresholds"])