# üîß Model Training for Phishing Detection

**Purpose**: Fine-tune Qwen2.5-1.5B for sequence classification on phishing detection.

This notebook:
- Creates training script with RSLoRA + MLflow
- Configures SageMaker PyTorch ModelTrainer
- Launches training job on ml.g5.xlarge
- Retrieves trained model from S3

## Prerequisites
- **Run `01_data_processing.ipynb` first**
- MLflow app created in SageMaker
- Budget: ~$1.50-$2.00 (60-75 mins on ml.g5.xlarge)

## Next Steps
After training completes ‚Üí `03_model_deployment.ipynb`

---

## 1. Setup and Installation

In [None]:
!pip install -Uq "sagemaker==2.253.1" "sagemaker-mlflow==0.2.0" mlflow

In [None]:
import boto3
import sagemaker
import os
import json
import time
from utils import get_mlflow_app_arn, find_latest_training_job, download_and_extract_model, upload_directory_to_s3, cleanup_local_files

## 2. Load Variables from Data Processing

In [None]:
%store -r train_s3_uri
%store -r val_s3_uri
%store -r test_s3_uri
%store -r NUM_LABELS
%store -r LABEL_NAMES
%store -r region
%store -r role
%store -r sagemaker_session_bucket

# Verify
try:
    print("‚úÖ Variables loaded:")
    print(f"  Train: {train_s3_uri}")
    print(f"  Val: {val_s3_uri}")
    print(f"  Test: {test_s3_uri}")
except NameError:
    print("‚ùå Run 01_data_processing.ipynb first!")
    raise

## 3. SageMaker Configuration

In [None]:
sess = sagemaker.Session(boto3.Session(region_name=region))

print(f"SageMaker role: {role}")
print(f"SageMaker bucket: {sess.default_bucket()}")
print(f"Region: {region}")

## 4. MLflow Configuration

Auto-detect MLflow app. If none found, create one in SageMaker Console ‚Üí MLflow.

In [None]:
MLFLOW_APP_ARN = get_mlflow_app_arn(region)
print(f"‚úÖ MLflow ARN: {MLFLOW_APP_ARN}")

## 5. Create Training Code Structure

In [None]:
os.makedirs('sagemaker_code', exist_ok=True)
print("‚úÖ Created sagemaker_code/ directory")

### 5.1 Requirements File

In [None]:
os.makedirs('sagemaker_code', exist_ok=True)

print("‚úÖ Created sagemaker_code/ directory")

In [None]:
%%writefile sagemaker_code/requirements.txt
transformers==4.55.0
torch>=2.1.0
accelerate==1.10.0
peft==0.17.0
datasets==4.0.0
scikit-learn==1.7.1
mlflow
sagemaker-mlflow==0.2.0
sentencepiece==0.2.0
safetensors>=0.6.2
evaluate==0.4.5
psutil
nvidia-ml-py

### 5.2 Training Script

This training script demonstrates:
- **Sequence classification** for binary phishing detection
- **RSLoRA** (rank-stabilized LoRA) for efficient fine-tuning
- **MLflow integration** for metric tracking
- **Security-focused metrics** (precision, recall, F1, FPR, FNR)

Read through this code to understand how the model is configured and trained.

In [None]:
%%writefile sagemaker_code/train.py
"""
SageMaker Training Script for Qwen2.5-1.5B Phishing Detection
Binary Classification: Safe (0) vs Phishing (1)
"""

import os
import sys
import argparse
import json
import torch
import numpy as np
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
)
from peft import (
    LoraConfig,
    get_peft_model,
    TaskType,
)
from sklearn.metrics import (
    accuracy_score,
    balanced_accuracy_score,
    f1_score,
    precision_score,
    recall_score,
    classification_report,
    confusion_matrix,
)
import sagemaker
import mlflow

def parse_args():
    parser = argparse.ArgumentParser()
    
    # Model parameters
    parser.add_argument("--model_id", type=str, default="Qwen/Qwen2.5-1.5B-Instruct")
    parser.add_argument("--num_labels", type=int, default=2)
    parser.add_argument("--max_length", type=int, default=512)
    
    # Training hyperparameters
    parser.add_argument("--epochs", type=int, default=1)
    parser.add_argument("--train_batch_size", type=int, default=8)
    parser.add_argument("--eval_batch_size", type=int, default=8)
    parser.add_argument("--gradient_accumulation_steps", type=int, default=4)
    parser.add_argument("--learning_rate", type=float, default=2e-4)
    parser.add_argument("--weight_decay", type=float, default=0.01)
    parser.add_argument("--warmup_ratio", type=float, default=0.03)
    
    # LoRA parameters
    parser.add_argument("--lora_r", type=int, default=16)
    parser.add_argument("--lora_alpha", type=int, default=32)
    parser.add_argument("--lora_dropout", type=float, default=0.05)
    parser.add_argument("--use_rslora", action="store_true")
    parser.add_argument("--use_dora", action="store_true")
    
    # SageMaker specific
    parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR", "/opt/ml/model"))
    parser.add_argument("--train_dir", type=str, default=os.environ.get("SM_CHANNEL_TRAINING"))
    parser.add_argument("--validation_dir", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
    parser.add_argument("--test_dir", type=str, default=os.environ.get("SM_CHANNEL_TEST"))
    parser.add_argument("--output_data_dir", type=str, default=os.environ.get("SM_OUTPUT_DATA_DIR", "/opt/ml/output/data"))
    
    return parser.parse_args()


def load_datasets(args):
    """Load train, validation, and test datasets from JSONL files."""
    datasets = {}
    
    for split_name, split_dir in [
        ('train', args.train_dir),
        ('validation', args.validation_dir),
        ('test', args.test_dir)
    ]:
        files = [os.path.join(split_dir, f) for f in os.listdir(split_dir) if f.endswith('.jsonl')]
        datasets[split_name] = load_dataset('json', data_files=files, split='train')
    
    print(f"Loaded datasets:")
    print(f"  Train: {len(datasets['train'])} samples")
    print(f"  Validation: {len(datasets['validation'])} samples")
    print(f"  Test: {len(datasets['test'])} samples")
    
    return datasets


def setup_model_and_tokenizer(args):
    """Initialize model and tokenizer with LoRA configuration."""
    print(f"\nSetting up model: {args.model_id}")
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        args.model_id,
        add_prefix_space=True,
        trust_remote_code=True,
    )

    # Qwen2.5 uses <|endoftext|> as pad token
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
        tokenizer.pad_token_id = tokenizer.eos_token_id
    
    # Load model
    model = AutoModelForSequenceClassification.from_pretrained(
        args.model_id,
        num_labels=args.num_labels,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )
    
    model.config.pad_token_id = tokenizer.pad_token_id
    model.config.use_cache = False
    model.config.pretraining_tp = 1
    
    # LoRA configuration
    lora_config = LoraConfig(
        r=args.lora_r,
        lora_alpha=args.lora_alpha,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        lora_dropout=args.lora_dropout,
        bias="none",
        task_type=TaskType.SEQ_CLS,   # Configuring the task as sequence classifier
        inference_mode=False,
        use_rslora=args.use_rslora,
        use_dora=args.use_dora,
    )
    
    # Enable gradient checkpointing and apply LoRA
    model.gradient_checkpointing_enable()
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()
    
    return model, tokenizer


def preprocess_function(examples, tokenizer, max_length):
    """Tokenize text inputs."""
    tokenized = tokenizer(
        examples['text'],
        truncation=True,
        max_length=max_length,
        padding=False,
    )
    tokenized['labels'] = examples['label']
    return tokenized


def compute_metrics(eval_pred):
    """Calculate security-focused metrics for phishing detection."""
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    
    return {
        'accuracy': accuracy_score(labels, predictions),
        'balanced_accuracy': balanced_accuracy_score(labels, predictions),
        'precision': precision_score(labels, predictions, average='binary', pos_label=1),
        'recall': recall_score(labels, predictions, average='binary', pos_label=1),
        'f1': f1_score(labels, predictions, average='binary', pos_label=1),
        'f1_macro': f1_score(labels, predictions, average='macro'),
        'f1_weighted': f1_score(labels, predictions, average='weighted'),
    }


def print_training_summary(train_result):
    """Print training completion summary."""
    print("\n" + "="*80)
    print("TRAINING COMPLETED")
    print("="*80)
    print(f"Training time: {train_result.metrics['train_runtime']:.2f}s")
    print(f"Training samples/second: {train_result.metrics['train_samples_per_second']:.2f}")
    print(f"Final training loss: {train_result.metrics['train_loss']:.4f}")


def evaluate_and_save_results(trainer, tokenized_test_dataset, args):
    """Evaluate model on test set, calculate metrics, and save results."""
    print("\n7. Evaluating on test set...")
    
    # Get predictions
    predictions = trainer.predict(tokenized_test_dataset)
    pred_labels = np.argmax(predictions.predictions, axis=1)
    true_labels = predictions.label_ids
    
    # Calculate metrics
    test_results = {
        'eval_accuracy': float(accuracy_score(true_labels, pred_labels)),
        'eval_balanced_accuracy': float(balanced_accuracy_score(true_labels, pred_labels)),
        'eval_precision': float(precision_score(true_labels, pred_labels, average='binary', pos_label=1)),
        'eval_recall': float(recall_score(true_labels, pred_labels, average='binary', pos_label=1)),
        'eval_f1': float(f1_score(true_labels, pred_labels, average='binary', pos_label=1)),
        'eval_f1_macro': float(f1_score(true_labels, pred_labels, average='macro')),
        'eval_f1_weighted': float(f1_score(true_labels, pred_labels, average='weighted')),
    }
    
    # Calculate FPR and FNR using confusion matrix
    tn, fp, fn, tp = confusion_matrix(true_labels, pred_labels).ravel()
    test_results['false_positive_rate'] = float(fp / (fp + tn))
    test_results['false_negative_rate'] = float(fn / (fn + tp))
    
    # Print metrics
    print(f"\nüîí Security Metrics (Test Set):")
    print(f"  Accuracy: {test_results['eval_accuracy']:.4f}")
    print(f"  Balanced Accuracy: {test_results['eval_balanced_accuracy']:.4f}")
    print(f"  Precision (Dangerous): {test_results['eval_precision']:.4f}")
    print(f"  Recall (Dangerous): {test_results['eval_recall']:.4f}")
    print(f"  F1 Score (Dangerous): {test_results['eval_f1']:.4f}")
    print(f"  F1 (Macro): {test_results['eval_f1_macro']:.4f}")
    print(f"  F1 (Weighted): {test_results['eval_f1_weighted']:.4f}")
    
    print(f"\n‚ö†Ô∏è  Error Analysis:")
    print(f"  False Positive Rate: {test_results['false_positive_rate']:.4f} (Safe flagged as Dangerous)")
    print(f"  False Negative Rate: {test_results['false_negative_rate']:.4f} (Dangerous missed)")
    
    # Log to MLflow
    mlflow.log_metrics(test_results)
    
    # Save results
    with open(os.path.join(args.output_data_dir, 'test_results.json'), 'w') as f:
        json.dump(test_results, f, indent=2)
    
    # Save classification report
    report = classification_report(
        true_labels,
        pred_labels,
        target_names=['Safe', 'Phishing'],
        digits=4
    )
    
    with open(os.path.join(args.output_data_dir, 'classification_report.txt'), 'w') as f:
        f.write(report)
    
    print(f"\nüìä Classification Report:\n{report}")
    
    return test_results


def save_models(model, tokenizer, model_dir):
    """Save merged model only."""
    print("\n8. Saving merged model...")
    
    try:
        merged_model = model.merge_and_unload()
        merged_model.save_pretrained(model_dir, safe_serialization=True)
        tokenizer.save_pretrained(model_dir)
        
        print(f"‚úÖ Merged model saved to: {model_dir}")
        print(f"   Files: model.safetensors, config.json, tokenizer files")
        
        return model_dir
    except Exception as e:
        print(f"‚ö†Ô∏è  Could not merge adapters: {e}")
        raise


def main():
    args = parse_args()
    
    print("="*80)
    print("LLAMA-3.2-1B PHISHING DETECTION - SAGEMAKER TRAINING")
    print("="*80)
    print(f"Task: Binary Classification (Safe vs Dangerous)")

    # Start MLflow run
    run_name = sagemaker.utils.name_from_base(f"phishing-{args.model_id.split('/')[-1]}-lr{args.learning_rate}-r{args.lora_r}")
    mlflow.start_run(run_name=run_name)
    
    # Load datasets
    print("\n1. Loading datasets...")
    datasets = load_datasets(args)
    
    # Setup model and tokenizer
    print("\n2. Setting up model and tokenizer...")
    model, tokenizer = setup_model_and_tokenizer(args)
    
    # Tokenize datasets (removes 'text' column, keeps only tokenized inputs and 'label')
    print("\n3. Tokenizing datasets...")
    tokenized_datasets = {
        split: datasets[split].map(
            lambda x: preprocess_function(x, tokenizer, args.max_length),
            batched=True,
            remove_columns=['text'],
            desc=f"Tokenizing {split}"
        )
        for split in datasets.keys()
    }
    
    # Data collator
    data_collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True)
    
    # Training arguments
    print("\n4. Setting up training arguments...")
    training_args = TrainingArguments(
        # Output
        output_dir=args.model_dir,
        
        # Training schedule
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        per_device_eval_batch_size=args.eval_batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        
        # Optimization
        learning_rate=args.learning_rate,
        weight_decay=args.weight_decay,
        optim="adamw_torch_fused",
        adam_beta1=0.9,
        adam_beta2=0.999,
        
        # Learning rate schedule
        lr_scheduler_type="cosine",
        warmup_ratio=args.warmup_ratio,
        
        # Evaluation
        eval_strategy="steps",
        eval_steps=50,
        load_best_model_at_end=True,
        metric_for_best_model="eval_f1",
        
        # Logging
        logging_dir=f"{args.model_dir}/logs",
        logging_steps=10,
        logging_first_step=True,
        
        # Performance
        fp16=False,
        bf16=True,
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={"use_reentrant": False},
        
        # Reproducibility
        seed=42,
        data_seed=42,
    )
    
    # Create Trainer
    print("\n5. Creating Trainer...")
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets['train'],
        eval_dataset=tokenized_datasets['validation'],
        processing_class=tokenizer,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
    )
    
    # Train
    print("\n6. Starting training...")
    print("="*80)
    train_result = trainer.train()
    print_training_summary(train_result)
    
    # Evaluate and save results
    test_results = evaluate_and_save_results(trainer, tokenized_datasets['test'], args)
    
    # Save models
    merged_model_dir = save_models(model, tokenizer, args.model_dir)
    
    # Final summary
    print(f"\nüìÅ Model Artifacts:")
    print(f"   Merged model: {args.model_dir}/")

    mlflow.end_run()
    
    print("\n" + "="*80)
    print("üéâ TRAINING JOB COMPLETED SUCCESSFULLY")
    print("="*80)


if __name__ == "__main__":
    main()

### 5.3 Launch Script

Bash script to install dependencies and run training.

In [None]:
%%writefile sagemaker_code/launch_train.sh
#!/bin/bash
set -e

echo "Installing dependencies..."
pip install -q -r requirements.txt

echo "Starting phishing detection training..."
python train.py \
    --model_id "Qwen/Qwen2.5-1.5B-Instruct" \
    --num_labels 2 \
    --max_length 512 \
    --epochs 1 \
    --train_batch_size 8 \
    --eval_batch_size 8 \
    --gradient_accumulation_steps 4 \
    --learning_rate 1e-4 \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --lora_r 16 \
    --lora_alpha 32 \
    --lora_dropout 0.05 \
    --use_rslora \
    "$@"

In [None]:
# Make script executable
!chmod +x sagemaker_code/launch_train.sh

## 6. Configure Training Job

In [None]:
from sagemaker.modules.configs import Compute, OutputDataConfig, SourceCode, StoppingCondition, InputData
from sagemaker.modules.train import ModelTrainer

# Configuration
MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"
job_name = sagemaker.utils.name_from_base("qwen2p5-1p5b-phishing-detection")
training_instance_type = "ml.g5.xlarge"
training_instance_count = 1

print(f"üîí Training Job Configuration:")
print(f"  Job name: {job_name}")
print(f"  Instance: {training_instance_type}")
print(f"  Model: {MODEL_ID}")

In [None]:
# MLflow environment
training_env = {
    "MLFLOW_EXPERIMENT_NAME": f"{job_name}-exp",
    "MLFLOW_TAGS": json.dumps({
        "source.job": "sm-training-jobs",
        "source.type": "phishing-detection",
        "model": "qwen2.5-1.5b",
    }),
    "MLFLOW_TRACKING_URI": MLFLOW_APP_ARN,
    "MLFLOW_ENABLE_SYSTEM_METRICS_LOGGING": "true",
}

print("‚úÖ MLflow environment configured")

In [None]:
# Get PyTorch training container
pytorch_image_uri = sagemaker.image_uris.retrieve(
    framework="pytorch",
    region=region,
    version="2.7.1",
    instance_type=training_instance_type,
    image_scope="training",
)

print(f"Using image: {pytorch_image_uri}")

In [None]:
# Configure ModelTrainer
source_code = SourceCode(
    source_dir="./sagemaker_code",
    command="bash launch_train.sh",
)

compute_configs = Compute(
    instance_type=training_instance_type,
    instance_count=training_instance_count,
    keep_alive_period_in_seconds=1800,
    volume_size_in_gb=100,
)

output_path = f"s3://{sess.default_bucket()}/phishing-detection/models/{job_name}"

model_trainer = ModelTrainer(
    training_image=pytorch_image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    stopping_condition=StoppingCondition(max_runtime_in_seconds=7200),
    output_data_config=OutputDataConfig(s3_output_path=output_path),
    role=role,
    environment=training_env,
)

print(f"‚úÖ ModelTrainer configured")
print(f"Output: {output_path}")

## 7. Launch Training Job

This will start a remote training job. Expected results:
- **Accuracy**: ~99%
- **F1 Score**: ~99%
- **Training time**: 60-75 minutes
- **Cost**: \$2.00

In [None]:
print("üöÄ Launching SageMaker Training Job...")
print(f"\nMonitor at: https://console.aws.amazon.com/sagemaker/home?region={region}#/jobs")
print("\n" + "="*80)

model_trainer.train(
    input_data_config=[
        InputData(channel_name="training", data_source=train_s3_uri),
        InputData(channel_name="validation", data_source=val_s3_uri),
        InputData(channel_name="test", data_source=test_s3_uri),
    ],
    wait=False,  # Set to True to wait for completion
)

print("\n‚úÖ Training job submitted!")
print("\nüí° Next: Monitor in SageMaker Console and view metrics in MLflow")

## 8. Monitor Training Progress

### View Training Metrics in MLflow

While training runs, you can monitor progress in real-time:

**MLflow Experiments Dashboard**
- Navigate to SageMaker Console ‚Üí MLflow
- View list of experiments and runs
- See training job metadata and status

![List of runs](./images/mlflow-view01.png)

**MLflow Run Details**
- Click on your training run
- View loss curves over training steps
- See evaluation metrics (accuracy, F1, precision, recall)
- Monitor system metrics (GPU utilization, memory)

![Run details](./images/mlflow-view02.png)

**MLflow Parameters**
- View hyperparameters tab
- See LoRA configuration (rank, alpha, dropout)
- Check training arguments (learning rate, batch size, epochs

![Hyperparameters](./images/mlflow-view03.png)

Training will take approximately 60-75 minutes. Wait for the job to complete before proceeding.

## 9. Retrieve Training Results

After training completes, retrieve and extract the model artifacts.

In [None]:
s3_client = boto3.client('s3')
bucket = sess.default_bucket()
base_prefix = f"phishing-detection/models/{job_name}"

training_job_name, training_job_prefix = find_latest_training_job(
    s3_client,
    bucket,
    base_prefix
)

print(f"\nLatest training job: {training_job_name}")

In [None]:
# Download and extract model
model_tar_key = f"{training_job_prefix}/output/model.tar.gz"

extract_dir = download_and_extract_model(
    s3_client,
    bucket,
    model_tar_key
)

print(f"\n‚úÖ Model extracted to: {extract_dir}/")

In [None]:
# Upload uncompressed model back to S3
s3_model_prefix = f"{training_job_prefix}/uncompressed_model"

upload_directory_to_s3(
    s3_client,
    extract_dir,
    bucket,
    s3_model_prefix
)

model_s3_uri = f"s3://{bucket}/{s3_model_prefix}/"
print(f"\nModel available at: {model_s3_uri}")

In [None]:
# Cleanup local files
cleanup_local_files("model.tar.gz", extract_dir)

print("\n‚úÖ Cleanup complete")

## 10. Store Variables for Deployment

Save model location for the deployment notebook.

In [None]:
%store model_s3_uri
%store training_job_name
%store MLFLOW_APP_ARN

mlflow_experiment_name = f"{job_name}-exp"
%store mlflow_experiment_name

print("\n‚úÖ Variables stored:")
print(f"  Model S3 URI: {model_s3_uri}")
print(f"  Training job: {training_job_name}")
print(f"  MLflow experiment: {mlflow_experiment_name}")

## ‚úÖ Training Complete!

### What We Accomplished:
1. ‚úÖ Created training script with RSLoRA + MLflow
2. ‚úÖ Configured SageMaker ModelTrainer
3. ‚úÖ Launched training job on ml.g5.xlarge
4. ‚úÖ Retrieved and extracted trained model
5. ‚úÖ Stored model path for deployment

### Training Results:
- View complete metrics in MLflow dashboard
- Expected accuracy: ~99%
- Model artifacts saved to S3

### Next Steps:
**Proceed to `03_model_deployment.ipynb`** to deploy the model as a real-time endpoint.

---

**Training Time**: ~60-75 minutes  
**Training Cost**: ~$1.50-$2.00