# PPO Training with Qwen2.5-0.5B on SageMaker

This notebook demonstrates PPO (Proximal Policy Optimization) training using the exact workflow from qwen-sentiment.py but adapted for SageMaker.

## Key Features:
- Uses TRL 0.11.3 (the working version)
- BERT sentiment classifier as reward function
- Follows exact qwen-sentiment.py pattern
- Runs on SageMaker with proper scaling

In [None]:
# Import from sama_rl
import sys
!pip install -e .
from sama_rl import PPO, create_inference_model

## Configuration

Load the PPO configuration that follows the qwen-sentiment.py pattern:

In [None]:
# Create PPO trainer with config
ppo_trainer = PPO(
    yaml_file="./sama_rl/recipes/PPO/qwen2-0.5b-ppo-config.yaml",
    instance_type="ml.g6.48xlarge",  # Good for 0.5B model
    max_steps=100,  # Override for testing
    wandb_api_key=""
)

print("PPO trainer configured with qwen-sentiment pattern")
print(f"Model: {ppo_trainer.config.model['name']}")
print(f"Instance: {ppo_trainer.config.sagemaker['instance_type']}")
print(f"Max steps: {ppo_trainer.config.training['max_steps']}")

## Training

Start PPO training using the exact qwen-sentiment.py workflow:
1. Load IMDB dataset
2. Initialize PPOTrainer with TRL 0.11.3
3. Use BERT sentiment classifier as reward
4. Run PPO training loop

In [None]:
# Start training
ppo_trainer.train()

print(f"Training job: {ppo_trainer.training_job_name}")
print("PPO training started with qwen-sentiment pattern!")

## Monitor Training

The training will:
- Use TRL 0.11.3 (working version)
- Load BERT sentiment classifier
- Generate positive movie reviews
- Optimize with PPO using sentiment rewards

Expected training time: ~30 minutes on ml.g4dn.2xlarge

In [None]:
# Get model artifacts after training
model_uri = ppo_trainer.get_model_artifacts()
print(f"Trained model artifacts: {model_uri}")

## Deployment

Deploy the trained model for inference:

In [None]:
# Deploy using sama_rl inference
inference_model = create_inference_model(
    model_uri=model_uri,
    instance_type="ml.g4dn.xlarge"
)

print(f"Model deployed: {inference_model}")

## Test the Model

Test the trained model to see if it generates more positive sentiment:

In [None]:
# Test the model with movie review prompts
test_prompts = [
    "This movie was",
    "The acting in this film",
    "Overall, I thought the movie"
]

for prompt in test_prompts:
    response = inference_model.predict(prompt, max_new_tokens=20)
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print()