# Nemotron Phishing Detection Workshop

This notebook walks through the full fine-tuning workflow on the Enron dataset:
1. Download the dataset
2. Convert to JSONL
3. Fine-tune Nemotron with LoRA
4. Evaluate the model


## Install dependencies
If you're running in a fresh environment, install the workshop requirements.

In [1]:
!pip install -r ../requirements.txt



## Configure Kaggle API
Export your Kaggle credentials before downloading the dataset.

In [None]:
import os
os.environ['KAGGLE_USERNAME'] = 'YYYYYYYYYY'
os.environ['KAGGLE_KEY'] = 'XXXXXXXXXXXXX'

## Download the dataset

In [3]:
!python ../scripts/download_dataset.py --output_dir ../data/raw

Downloading wcukierski/enron-email-dataset to ../data/raw...
Dataset URL: https://www.kaggle.com/datasets/wcukierski/enron-email-dataset
Download completed, but maildir was not found. Check the output directory.


## Convert to JSONL
This uses a simple keyword heuristic to label phishing vs benign.

In [1]:
!python ../scripts/prepare_jsonl.py --input_csv ../data/raw/emails.csv --output_dir ../data/processed

Parsing emails: 517401it [05:59, 1440.15it/s]
Wrote JSONL files to ../data/processed


## Inspect dataset stats

In [2]:
import json
from pathlib import Path
stats = json.loads(Path('../data/processed/stats.json').read_text())
stats

{'total': 50000,
 'train': 40000,
 'val': 5000,
 'test': 5000,
 'phishing': 6567,
 'benign': 43433}

In [3]:
# Trim dataset to 10% to target ~1 hour training on L4
import json
import random
from pathlib import Path

src_dir = Path('../data/processed')
dst_dir = Path('../data/processed_small')
dst_dir.mkdir(parents=True, exist_ok=True)
seed = 42
fraction = 0.1

def sample_jsonl(src, dst, fraction, seed):
    lines = Path(src).read_text().splitlines()
    rng = random.Random(seed)
    k = max(1, int(len(lines) * fraction))
    sample = rng.sample(lines, k)
    Path(dst).write_text('\n'.join(sample) + '\n')
    return k

counts = {}
counts['train'] = sample_jsonl(src_dir / 'train.jsonl', dst_dir / 'train.jsonl', fraction, seed)
counts['val'] = sample_jsonl(src_dir / 'val.jsonl', dst_dir / 'val.jsonl', fraction, seed)
counts['test'] = sample_jsonl(src_dir / 'test.jsonl', dst_dir / 'test.jsonl', fraction, seed)

def count_labels(path):
    stats = {'phishing': 0, 'benign': 0, 'total': 0}
    for line in Path(path).read_text().splitlines():
        if not line:
            continue
        obj = json.loads(line)
        label = obj.get('label', '')
        if label in stats:
            stats[label] += 1
        stats['total'] += 1
    return stats

train_stats = count_labels(dst_dir / 'train.jsonl')
val_stats = count_labels(dst_dir / 'val.jsonl')
test_stats = count_labels(dst_dir / 'test.jsonl')
small_stats = {
    'total': train_stats['total'] + val_stats['total'] + test_stats['total'],
    'train': train_stats['total'],
    'val': val_stats['total'],
    'test': test_stats['total'],
    'phishing': train_stats['phishing'] + val_stats['phishing'] + test_stats['phishing'],
    'benign': train_stats['benign'] + val_stats['benign'] + test_stats['benign'],
}
(dst_dir / 'stats.json').write_text(json.dumps(small_stats, indent=2))
small_stats


{'total': 5000,
 'train': 4000,
 'val': 500,
 'test': 500,
 'phishing': 649,
 'benign': 4351}

## Evaluate the base model (run in terminal)
Serve the base model first, then score the test set and save results to disk.


Terminal A (serve):
```bash
python scripts/serve.py --model_name nvidia/Nemotron-Mini-4B-Instruct --port 8000
```

Terminal B (evaluate):
```bash
python scripts/test_model.py --endpoint http://127.0.0.1:8000/predict \
  --test_file data/processed_small/test.jsonl \
  --max_samples 5000 \
  --output_file outputs/eval_base.json
```

Stop the server with Ctrl+C when done.


## Fine-tune the model (run in terminal)
Training can take hours, so run it from a shell instead of the notebook.


```bash
python scripts/train.py --data_dir data/processed_small --output_dir outputs \
  --model_name nvidia/Nemotron-Mini-4B-Instruct --num_train_epochs 1 --max_seq_length 512
```


## Evaluate the fine-tuned model (run in terminal)
Serve the adapter, then score the same test set and save results to disk.


Terminal A (serve):
```bash
python scripts/serve.py --model_name nvidia/Nemotron-Mini-4B-Instruct \
  --adapter_dir outputs/adapter --port 8000
```

Terminal B (evaluate):
```bash
python scripts/test_model.py --endpoint http://127.0.0.1:8000/predict \
  --test_file data/processed_small/test.jsonl \
  --max_samples 5000 \
  --output_file outputs/eval_tuned.json
```

Stop the server with Ctrl+C when done.


In [None]:
import json
from pathlib import Path

base = json.loads(Path("../outputs/eval_base.json").read_text())
tuned = json.loads(Path("../outputs/eval_tuned.json").read_text())

def fmt(result):
    return f"{result["accuracy"]:.2%} ({result["correct"]}/{result["total"]})"

print("Base accuracy:", fmt(base))
print("Tuned accuracy:", fmt(tuned))
print("Absolute gain:", f"{(tuned["accuracy"] - base["accuracy"]) * 100:.2f} pp")
