# Baseline Phishing Email Detector\n
\n
This notebook provides a very simple baseline phishing email detector using classical machine learning.\n
\n
It is meant as a starting point for contributors to improve, not as a production-ready model.

## 1. Setup\n
\n
Make sure you have installed dependencies from the project root using:\n
\n
```bash\n
pip install -r requirements.txt\n
```\n

In [2]:
import pathlib
import sys

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

# Add this project's src/ folder to the Python path so we can import data.py
PROJECT_ROOT = pathlib.Path().resolve().parents[1]
SRC_PATH = PROJECT_ROOT / "src"
if str(SRC_PATH) not in sys.path:
    sys.path.append(str(SRC_PATH))

from data import load_example_dataset


ModuleNotFoundError: No module named 'data'

## 2. Load a tiny example dataset\n
\n
In this placeholder example, we use a very small, hard-coded dataset from `load_example_dataset`.\n
\n
Contributors are encouraged to replace this with a real phishing dataset loader.

In [None]:
df = load_example_dataset()\n
df

## 3. Train a baseline model\n
\n
We use a bag-of-words representation (CountVectorizer) and logistic regression.\n
\n
This is similar to the intro notebook but organized for this specific project.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(\n
    df['text'],\n
    df['label'],\n
    test_size=0.3,\n
    random_state=42,\n
    stratify=df['label'],\n
)\n
\n
model = make_pipeline(\n
    CountVectorizer(),\n
    LogisticRegression(max_iter=1000),\n
)\n
\n
model.fit(X_train, y_train)\n
y_pred = model.predict(X_test)\n
\n
print('Test texts:', list(X_test))\n
print('True labels:', list(y_test))\n
print('Predicted labels:', list(y_pred))\n
print()\n
print(classification_report(y_test, y_pred, zero_division=0))

## 4. Try your own messages\n
\n
You can type your own short email-like messages and see how the model classifies them. Remember that this small model is very fragile and easy to fool.

In [None]:
examples = [\n
    'Please update your payment information immediately or your account will be closed.',\n
    'Hi, just checking in about our meeting next week.',\n
    'Click this link to claim your urgent refund.',\n
]\n
\n
preds = model.predict(examples)\n
for text, label in zip(examples, preds):\n
    print(f'{label} - {text}')

## 5. Ideas for improvement\n
\n
Some ideas for contributors:\n
\n
- Use a larger, realistic phishing dataset.\n
- Try TF-IDF or other text representations.\n
- Evaluate with cross-validation.\n
- Add simple adversarial examples (e.g., obfuscating words).\n
- Compare different models (SVM, random forest, etc.).\n
\n
If you implement a significant improvement, please add notes and explanations so that beginners can learn from your work.