In Assignment 2, the agent was trained using **Behavioral Cloning** (Supervised Learning). While effective for feature extraction, the agent suffered from distribution shift. 

For the final assignment, I implemented **Proximal Policy Optimization (PPO)**, a state-of-the-art policy gradient method. This required building a custom `Gym` environment that wraps the *Cuphead* application, using Computer Vision to calculate rewards (HP loss) in real-time.

### 1. Dependencies
We utilize `stable-baselines3` for the PPO implementation and `mss` for high-speed screen capture.

In [None]:
# Install necessary RL libraries
!pip install stable-baselines3 gym mss pydirectinput opencv-python

In [None]:
import gym
from gym import spaces
import numpy as np
import cv2
import mss
import pydirectinput
import time
from stable_baselines3 import PPO
from tensorflow.keras.models import load_model

# Global Constants
LATENT_DIM = 512
IMG_WIDTH = 128
IMG_HEIGHT = 72

### 2. The Reward Function (Computer Vision)
Since *Cuphead* provides no API, we must calculate the Reward function visually. We monitor the bottom-left corner of the screen (the HP cards). If the average red intensity of that region drops below a threshold, we know the player has taken damage.

In [None]:
def calculate_visual_reward(screen_frame):
    """
    Analyzes specific screen region to determine game state.
    Returns:
        reward (float): +0.1 for survival, -10.0 for taking damage.
        done (bool): True if HP is 0 or Level Complete.
    """
    hp_region = screen_frame[650:700, 20:100]
    
    # Calculate Red Channel Intensity
    # If the card flips (damage taken), the red/pink turns to grey
    avg_red = np.mean(hp_region[:, :, 2]) # OpenCV uses BGR, index 2 is Red
    
    # Threshold determined experimentally
    if avg_red < 50: 
        return -10.0, True # Penalty and End Episode (simplified for training)
    
    # Small positive reward for every frame survived
    return 0.1, False

### 3. The Custom Gym Environment
This class bridges the gap between the RL agent and the game application. It adheres to the standard OpenAI Gym API (`step`, `reset`, `render`).

**Key Architecture Decision:** We reuse the **Autoencoder** trained in Assignment 2. We do *not* retrain the vision layer. The environment captures a frame, passes it through the frozen Autoencoder, and yields the 512-dimensional Latent Vector as the `observation`.

In [None]:
class CupheadEnv(gym.Env):
    def __init__(self, encoder_path):
        super(CupheadEnv, self).__init__()
        
        # 1. Load Pre-trained Vision Model (Frozen)
        print("Loading Autoencoder...")
        self.encoder = load_model(encoder_path)
        
        # 2. Define Action Space (Discrete Buttons)
        # 0: No Op, 1: Left, 2: Right, 3: Jump, 4: Shoot, 5: Dash
        self.action_space = spaces.Discrete(6)
        
        # 3. Define Observation Space (Latent Vector)
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(LATENT_DIM,), dtype=np.float32)
        
        # 4. Screen Capture Setup
        self.sct = mss.mss()
        self.monitor = {"top": 100, "left": 0, "width": 1280, "height": 720}

    def step(self, action):
        # A. Execute Action in Game
        self._perform_input(action)
        
        # Wait for frame update (simulating reaction time)
        time.sleep(0.05)
        
        # B. Capture Environment State
        screen = np.array(self.sct.grab(self.monitor))
        
        # C. Process Vision (Resize -> Normalize -> Autoencode)
        screen_proc = cv2.resize(screen, (IMG_WIDTH, IMG_HEIGHT))
        screen_proc = cv2.cvtColor(screen_proc, cv2.COLOR_BGRA2GRAY)
        screen_proc = screen_proc.astype('float32') / 255.0
        
        # Get Latent Vector (Observation)
        # We expand dims because model expects batch (1, 72, 128, 1)
        observation = self.encoder.predict(np.expand_dims(screen_proc, axis=[0, -1]), verbose=0)[0]
        
        # D. Calculate Reward
        reward, done = calculate_visual_reward(screen)
        
        return observation, reward, done, {}

    def reset(self):
        # Sequence to restart the level in Cuphead
        pydirectinput.press('r') 
        time.sleep(2.0)
        
        # Return initial observation
        screen = np.array(self.sct.grab(self.monitor))
        screen_proc = cv2.resize(screen, (IMG_WIDTH, IMG_HEIGHT))
        screen_proc = cv2.cvtColor(screen_proc, cv2.COLOR_BGRA2GRAY).astype('float32') / 255.0
        return self.encoder.predict(np.expand_dims(screen_proc, axis=[0, -1]), verbose=0)[0]

    def _perform_input(self, action):
        # Map RL outputs to Keystrokes
        if action == 1: pydirectinput.press('left')
        elif action == 2: pydirectinput.press('right')
        elif action == 3: pydirectinput.press('z') # Jump
        elif action == 4: pydirectinput.press('x') # Shoot
        elif action == 5: pydirectinput.press('c') # Dash

### 4. PPO Agent Initialization and Training

We use the **Proximal Policy Optimization** algorithm. PPO is an Actor-Critic method that optimizes a surrogate objective function:

$$ L^{CLIP}(\theta) = \hat{E}_t [\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)] $$

The clipping term $(1-\epsilon, 1+\epsilon)$ ensures that the policy does not change too drastically in a single update, providing the stability needed for a difficult game.

In [None]:
# Initialize Environment
env = CupheadEnv(encoder_path="../models/2_encoder_512.keras")

# Initialize PPO Agent
# 'MlpPolicy': We use a Multi-Layer Perceptron because our input is 
# the 1D Latent Vector, not the raw image pixels.
model = PPO("MlpPolicy", env, verbose=1, learning_rate=0.0003)

print("Starting RL Training Loop...")
# We train for 10,000 timesteps for the demonstration
# In a full deployment, this would be >100,000
model.learn(total_timesteps=10000)

### 5. Saving the Policy
Once trained, the PPO policy is saved. This small file contains the "brain" that can be loaded to play the game autonomously.

In [None]:
model.save("cuphead_ppo_policy_v1")
print("Model Saved.")