# üöÄ Model Deployment for Phishing Detection

**Purpose**: Deploy the fine-tuned Qwen2.5-1.5B model as a real-time SageMaker endpoint.

This notebook:
- Loads trained model from S3
- Configures vLLM for text classification
- Creates SageMaker endpoint with LMI container
- Tests inference with sample emails

## Prerequisites
- **Run `02_model_training.ipynb` first**
- Trained model artifacts in S3
- Budget: ~$1.41/hour for ml.g5.xlarge endpoint

## Next Steps
After deployment ‚Üí `04_benchmarking.ipynb`

---

## 1. Setup and Installation

In [None]:
!pip install -Uq "sagemaker==2.253.1"

In [None]:
import boto3
import sagemaker
import json
from botocore.config import Config

## 2. Load Variables from Training

In [None]:
%store -r model_s3_uri
%store -r training_job_name
%store -r NUM_LABELS
%store -r region
%store -r role
%store -r sagemaker_session_bucket

# Verify
try:
    print("‚úÖ Variables loaded:")
    print(f"  Model S3 URI: {model_s3_uri}")
    print(f"  Training job: {training_job_name}")
    print(f"  Number of labels: {NUM_LABELS}")
except NameError:
    print("‚ùå Run 02_model_training.ipynb first!")
    raise

## 3. SageMaker Configuration

In [None]:
sess = sagemaker.Session(boto3.Session(region_name=region))

print(f"SageMaker role: {role}")
print(f"SageMaker bucket: {sess.default_bucket()}")
print(f"Region: {region}")

## 4. Configure Deployment

We'll use SageMaker LMI containers with vLLM for optimized inference.

### Why vLLM for Text Classification?
- **Fast**: Optimized for single-token generation (label prediction)
- **Efficient**: Lower latency than full text generation
- **Cost-effective**: Can run on smaller instances

In [None]:
# Deployment configuration
inference_instance_type = "ml.g5.xlarge"
timeout = 900
image_lmi_v18 = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.36.0-lmi18.0.0-cu128"

print(f"Instance type: {inference_instance_type}")
print(f"Container: {image_lmi_v18}")

In [None]:
# vLLM configuration for text classification
env_vars = {
    "HF_MODEL_ID": model_s3_uri,
    "NUM_GPUS": "1",
    "OPTION_TENSOR_PARALLELISM": "1",
    "OPTION_TASK": "text-classification",
    "OPTION_ENFORCE_EAGER": "true",
}

print("‚úÖ vLLM environment configured for text classification")

## 5. Create SageMaker Model

This creates a model resource that references our trained model artifacts.

In [None]:
model_name = sagemaker.utils.name_from_base("qwen-phishing")

from sagemaker_core.shapes import ContainerDefinition
from sagemaker_core.resources import Model

boto_session = boto3.session.Session()

model = Model.create(
    model_name=model_name,
    primary_container=ContainerDefinition(
        image=image_lmi_v18,
        environment=env_vars
    ),
    execution_role_arn=role,
    session=boto_session,
    region=region,
)

print(f"‚úÖ Model created: {model_name}")

## 6. Deploy Endpoint

This will create a real-time inference endpoint. Deployment takes ~5-10 minutes.

**Cost**: ~$1.41/hour while endpoint is active

In [None]:
from sagemaker_core.shapes import ProductionVariant, ProductionVariantRoutingConfig
from sagemaker_core.resources import EndpointConfig, Endpoint

print(f"üöÄ Deploying endpoint: {model_name}")
print("This will take ~5-10 minutes...\n")

endpoint = Endpoint.create(
    endpoint_name=model_name,
    endpoint_config_name=EndpointConfig.create(
        endpoint_config_name=model_name,
        production_variants=[
            ProductionVariant(
                variant_name=model_name,
                initial_instance_count=1,
                instance_type=inference_instance_type,
                model_name=model,
                container_startup_health_check_timeout_in_seconds=timeout,
                model_data_download_timeout_in_seconds=timeout,
                routing_config=ProductionVariantRoutingConfig(
                    routing_strategy="LEAST_OUTSTANDING_REQUESTS"
                ),
            )
        ],
    ),
)

endpoint.wait_for_status("InService")

print(f"\n‚úÖ Endpoint deployed and in service!")
print(f"Endpoint name: {model_name}")

## 7. Test Inference

Create a helper function to invoke the endpoint.

In [None]:
no_retry_config = Config(retries={'max_attempts': 1})
runtime_client = boto3.client("sagemaker-runtime", config=no_retry_config)

def invoke_classification_endpoint(ep_name, texts):
    """
    Invoke SageMaker LMI classification endpoint.
    
    Args:
        ep_name: SageMaker endpoint name
        texts: Single string or list of strings to classify
    
    Returns:
        dict: Classification results with probabilities
    """
    if isinstance(texts, str):
        texts = [texts]
    
    payload = {
        "inputs": texts
    }
    
    response = runtime_client.invoke_endpoint(
        EndpointName=ep_name,
        ContentType='application/json',
        Body=json.dumps(payload)
    )
    
    result = json.loads(response['Body'].read().decode())
    
    return result

print("‚úÖ Inference function ready")

### 7.1 Test Single Email

In [None]:
# Test with a phishing example
test_email = "Urgent! Click here to verify your account immediately!"

result = invoke_classification_endpoint(model_name, test_email)

print(f"Email: {test_email}")
print(f"\nResult:")
print(json.dumps(result, indent=2))

### 7.2 Test Batch Inference

In [None]:
# Test with multiple emails
test_emails = [
    "Urgent! Click here to verify your account!",
    "Meeting scheduled for tomorrow at 2pm",
    "You've won $1,000,000! Claim your prize now!",
    "Please review the attached quarterly report",
]

results = invoke_classification_endpoint(model_name, test_emails)

print("Batch classification results:\n")
for i, (email, result) in enumerate(zip(test_emails, results)):
    print(f"{i+1}. {email[:50]}...")
    print(f"   Result: {result}\n")

## 8. Store Variables for Benchmarking

In [None]:
endpoint_name = model_name

%store endpoint_name
%store model_name

print("\n‚úÖ Variables stored:")
print(f"  Endpoint name: {endpoint_name}")
print(f"  Model name: {model_name}")

## ‚úÖ Deployment Complete!

### What We Accomplished:
1. ‚úÖ Loaded trained model from S3
2. ‚úÖ Configured vLLM for text classification
3. ‚úÖ Created SageMaker model resource
4. ‚úÖ Deployed real-time endpoint
5. ‚úÖ Tested inference (single + batch)
6. ‚úÖ Stored endpoint info for benchmarking

### Your Endpoint is Live!
- **Endpoint name**: Stored in `endpoint_name`
- **Instance type**: ml.g5.xlarge
- **Cost**: ~$1.41/hour while running

### Next Steps:
**Proceed to `04_benchmarking.ipynb`** to evaluate endpoint performance.

‚ö†Ô∏è **Remember**: Delete the endpoint when done to avoid charges!

---

**Deployment Time**: ~5-10 minutes  
**Hourly Cost**: ~$1.41