# **WEEK 2**

## Dataset 1: Web page Phishing Detection Dataset

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
import kagglehub

# Dataset 1: Web page Phishing Detection
path = kagglehub.dataset_download("shashwatwork/web-page-phishing-detection-dataset")
dataset1 = pd.read_csv(path + "/dataset_phishing.csv")

# Encode target to match Week 1 convention: phishing=1, legitimate=0
dataset1["status"] = dataset1["status"].map({"phishing": 1, "legitimate": 0})

# Features = all numeric except 'url' and target
X = dataset1.drop(columns=[c for c in ["status", "url"] if c in dataset1.columns]).select_dtypes("number")
y = dataset1["status"]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42, test_size=0.2)

### **Linear regression with lasso, ridge, and elastic net regression**

Since the target status is binary (classification), we’ll use the logistic regression equivalents with penalties. We don’t use plain linear regression because the target is binary, and linear regression would produce invalid predictions outside the [0, 1] range. Instead, we use logistic regression with penalties (Ridge, Lasso, Elastic Net), which is the correct classification counterpart and allows us to apply regularization for better generalization and feature selection.

In [14]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedKFold, cross_val_score

# Define models
ridge = LogisticRegression(penalty="l2", solver="lbfgs", max_iter=5000, random_state=42)
lasso = LogisticRegression(penalty="l1", solver="saga", max_iter=5000, random_state=42)
elastic = LogisticRegression(penalty="elasticnet", l1_ratio=0.5, solver="saga", max_iter=5000, random_state=42)

# Cross-validation setup
cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

# Evaluate models using F1-score (since phishing detection is imbalanced)
ridge_scores = cross_val_score(ridge, X_train, y_train, scoring='f1', cv=cv, n_jobs=-1)
lasso_scores = cross_val_score(lasso, X_train, y_train, scoring='f1', cv=cv, n_jobs=-1)
elastic_scores = cross_val_score(elastic, X_train, y_train, scoring='f1', cv=cv, n_jobs=-1)

print("Ridge F1: %.3f ± %.3f" % (ridge_scores.mean(), ridge_scores.std()))
print("Lasso F1: %.3f ± %.3f" % (lasso_scores.mean(), lasso_scores.std()))
print("Elastic Net F1: %.3f ± %.3f" % (elastic_scores.mean(), elastic_scores.std()))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Ridge F1: 0.844 ± 0.012
Lasso F1: 0.326 ± 0.016
Elastic Net F1: 0.326 ± 0.016


Under the current split/preprocessing, Ridge is clearly the best regularized logistic regression model.

Lasso and Elastic Net are not competitive here unless we retune hyperparameters or standardize features.

What to change (only two things):

- Scale the features (L1/EN are scale-sensitive).

- Use stratified CV (to keep class balance in folds).

In [15]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RepeatedStratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression

# 1) Scale features (fit on train only)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)

# 2) Stratified CV
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)

# Same models as before (no other changes)
ridge = LogisticRegression(penalty="l2", solver="lbfgs", max_iter=5000, random_state=42)
lasso = LogisticRegression(penalty="l1", solver="saga", max_iter=5000, random_state=42)
elastic = LogisticRegression(penalty="elasticnet", l1_ratio=0.5, solver="saga", max_iter=5000, random_state=42)

# Evaluate (note: use X_train_s instead of X_train)
ridge_scores = cross_val_score(ridge,   X_train_s, y_train, scoring="f1", cv=cv, n_jobs=-1)
lasso_scores = cross_val_score(lasso,   X_train_s, y_train, scoring="f1", cv=cv, n_jobs=-1)
elastic_scores = cross_val_score(elastic, X_train_s, y_train, scoring="f1", cv=cv, n_jobs=-1)

print("Ridge F1:  %.3f ± %.3f" % (ridge_scores.mean(), ridge_scores.std()))
print("Lasso F1:  %.3f ± %.3f" % (lasso_scores.mean(), lasso_scores.std()))
print("Elastic F1: %.3f ± %.3f" % (elastic_scores.mean(), elastic_scores.std()))


Ridge F1:  0.947 ± 0.004
Lasso F1:  0.947 ± 0.004
Elastic F1: 0.947 ± 0.004


Interpretation of Results
1. Ridge Regression (L2 penalty)

Achieved very high F1 (0.947), showing strong generalization.

Handles correlated features well by shrinking coefficients smoothly.

A reliable baseline for regularized logistic regression.

2. Lasso Regression (L1 penalty)

Matches Ridge exactly here.

Once features were standardized, Lasso stopped collapsing and retained enough predictors.

This suggests redundancy among features is low: dropping some doesn’t hurt performance.

Bonus: Lasso still gives feature selection if interpretability is needed.

3. Elastic Net (L1 + L2 penalty)

Also identical performance.

Balanced mix of Ridge and Lasso effects, but here it doesn’t add extra benefit because both extremes already perform optimally.

Conclusion:

After correcting preprocessing (scaling + stratified CV), all three regularized logistic regression methods converge to the same very strong performance.

Ridge is stable for collinear features.

Lasso is useful if you want a sparse, interpretable model.

Elastic Net offers stability + sparsity, though it doesn’t outperform the others here.

Essentially, Dataset 1 is well-conditioned once standardized, so the choice among Ridge, Lasso, and Elastic Net can be based on interpretability needs rather than accuracy.

## Dataset 2: Phishing Email Detection

In [16]:
# Dataset 2: Phishing Email Detection
path = kagglehub.dataset_download("subhajournal/phishingemails")
dataset2 = pd.read_csv(path + "/Phishing_Email.csv")
if "Unnamed: 0" in dataset2.columns:
    dataset2 = dataset2.drop(columns=["Unnamed: 0"])
dataset2 = dataset2.dropna(subset=["Email Text"])

# Ensure numeric target: 0 = Phishing, 1 = Safe
if dataset2["Email Type"].dtype == object:
    dataset2["Email Type"] = dataset2["Email Type"].map({"Phishing Email": 0, "Safe Email": 1})

### **Linear regression with lasso, ridge, and elastic net regression**

Since the target is binary, we again use logistic regression with penalties (Ridge, Lasso, Elastic Net).

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedKFold, cross_val_score

# Models
ridge = LogisticRegression(penalty="l2", solver="lbfgs", max_iter=5000, random_state=42)
lasso = LogisticRegression(penalty="l1", solver="saga", max_iter=5000, random_state=42)
elastic = LogisticRegression(penalty="elasticnet", l1_ratio=0.5, solver="saga", max_iter=5000, random_state=42)

# Cross-validation
cv = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

ridge_scores = cross_val_score(ridge, X_train, y_train, scoring='f1', cv=cv, n_jobs=-1)
lasso_scores = cross_val_score(lasso, X_train, y_train, scoring='f1', cv=cv, n_jobs=-1)
elastic_scores = cross_val_score(elastic, X_train, y_train, scoring='f1', cv=cv, n_jobs=-1)

print("Ridge F1: %.3f ± %.3f" % (ridge_scores.mean(), ridge_scores.std()))
print("Lasso F1: %.3f ± %.3f" % (lasso_scores.mean(), lasso_scores.std()))
print("Elastic Net F1: %.3f ± %.3f" % (elastic_scores.mean(), elastic_scores.std()))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Ridge F1: 0.844 ± 0.012
Lasso F1: 0.326 ± 0.016
Elastic Net F1: 0.326 ± 0.016


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, train_test_split, RepeatedStratifiedKFold
from sklearn.metrics import classification_report

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    dataset2["Email Text"].astype(str), dataset2["Email Type"],
    test_size=0.2, stratify=dataset2["Email Type"], random_state=42
)

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)

for pen in ["l2", "l1", "elasticnet"]:
    model = LogisticRegression(
        penalty=pen, solver="saga", l1_ratio=0.5, class_weight="balanced",
        max_iter=5000, random_state=42
    )
    pipe = make_pipeline(
        TfidfVectorizer(max_features=50000, ngram_range=(1,2), min_df=2, max_df=0.9),
        TruncatedSVD(n_components=300, random_state=42),
        model
    )
    scores = cross_val_score(pipe, X_train, y_train, scoring="f1", cv=cv, n_jobs=-1)
    print(f"SVD | {pen.upper()} F1: {scores.mean():.3f} ± {scores.std():.3f}")

# Example final fit & report
pipe.fit(X_train, y_train)
print("\n=== Test Report ===")
print(classification_report(y_test, pipe.predict(X_test)))



SVD | L2 F1: 0.968 ± 0.003




SVD | L1 F1: 0.969 ± 0.002
SVD | ELASTICNET F1: 0.968 ± 0.003

=== Test Report ===
              precision    recall  f1-score   support

           0       0.94      0.98      0.96      1462
           1       0.99      0.96      0.97      2265

    accuracy                           0.97      3727
   macro avg       0.96      0.97      0.96      3727
weighted avg       0.97      0.97      0.97      3727



The application of regularized logistic regression models on the TF-IDF + SVD representation of the email text demonstrated strong predictive performance in detecting phishing emails. Both Lasso and Elastic Net achieved consistently high cross-validated F1-scores (0.969 ± 0.002 and 0.968 ± 0.003, respectively). On the held-out test set, the Elastic Net model reached 97% accuracy, with balanced precision and recall across both phishing and safe classes (F1 = 0.96 for phishing and 0.97 for safe). These results suggest that dimensionality reduction via SVD not only preserves critical discriminatory information but also enhances model stability and generalization.