## 1. Introduction and Data Source

### 1.1 Project Objective
Building on my previous work classifying boss fights based on keystroke patterns, I pivoted to a more ambitious goal: training an autonomous agent to play *Cuphead*.

The core challenge is to map high-dimensional visual inputs (raw pixels) to low-level control outputs (keystrokes) in a way that captures the reactive nature of *Cuphead*. I selected **The Root Pack (Botanic Panic)** boss fight, as it offers distinct projectiles and stationary enemies suitable for testing spatial reasoning.

I implemented four models to evaluate the trade-offs between implicit visual learning and explicit spatial coordinates, as well as the computational efficiency of different RNN cells.

| Experiment | Vision Layer (State Representation) | Decision Layer (Architecture) |
| :--- | :--- | :--- |
| **A** | **Convolutional Autoencoder** | **LSTM** | The standard "end-to-end" approach. Likely computationally heavy and noisy. |
| **B** | **Convolutional Autoencoder** | **GRU** | Tests if a simpler gated mechanism (GRU) converges faster on noisy pixel data than LSTM. |
| **C** | **YOLOv8** | **LSTM** | Tests if providing explicit object coordinates improves the LSTM's output |
| **D** | **YOLOv8** | **GRU** | Tests if providing explicit object coordinates improves the GRU's output |

### 1.2 Dataset
This project's dataset consists of 1280x720 gameplay footage synchronized with keystroke logs. I developed a custom "Session Recorder" (see Appendix A) to gather this dataset.

**Data Splitting Strategy:**
To prevent data leakage, I implemented strict session-based split.

| Session Name | Usage | Purpose |
| :--- | :--- | :--- |
| `Train_1.mp4` | **Training** | Primary training data. |
| `Train_2.mp4` | **Training** | Supplementary training data to introduce gameplay variance (different weapons, varied dodging patterns). |
| `Test.mp4` | **Testing** | Used for evaluation. |

# add summary

## 2. Pre-processing and Feature Engineering

### 2.1 Autoencoder Pipeline
**1. Dimensionality Reduction**
Frames were downsampled to $128 \times 72$ (approx. 10% of original resolution). While this reduces high-frequency detail, it preserves the global structure and contrast required to distinguish the large "Root Pack" bosses from the background.
**2. Grayscale Conversion**
Color channels were discarded. Gameplay mechanics in this specific fight are defined by motion and silhouette, not color coding. This reduced tensor size by a factor of 3.

By applying downsampling and grayscale transformations immediately upon capture, the effective size of the 45-minute dataset (~27,000 frames) was reduced from ~75 GB to **< 1 GB**. This allowed the entire training set to reside in Google Colab's GPU RAM.

The pre-processed frames are passed through the **Encoder** half of the Convolutional Autoencoder. This transforms the high-dimensional image ($9,216$ pixels) into a dense **Latent Vector** of size 64. These 64 learned values act as the "features" for the RNN.

### 2.3 YOLO Pipeline
**1. Detector Configuration:**
*   **Annotation:** I annotated 500 frames. Crucially, I distinguished between two types of projectiles to prevent logic conflicts in the RNN:
    *   `Projectile`: Dirt balls, Tears, and Carrots. These are discrete objects that follows the player or move linearly.
    *   `Beam`: The "Hypnotic Rings" in Phase 3. These behave differently, creating vertical hazard zones that mandate horizontal movement, unlike linear projectiles which can often be destroyed or jumped over.

# ADD SCREENSHOT (each projectile and beam)

*   **YOLO Training:** A YOLOv8-Nano model was fine-tuned on these specific classes.

**2. Feature Engineering:**
I engineered a fixed-size **Relative State Vector (Size: 11)**.

**The Final Feature Vector (Size 11):**
1.  **Player State (2):** Absolute $(x, y)$ normalized to $[0, 1]$.
2.  **Boss State (3):** Relative $(\Delta x, \Delta y)$ and a **Phase Identifier** (0=Potato, 1=Onion, 2=Carrot).
3.  **Primary Threat (3):** Relative $(\Delta x, \Delta y)$ and **Type ID** ($0$=Projectile, $1$=Beam) of the nearest threat.
4.  **Secondary Threat (3):** Relative $(\Delta x, \Delta y)$ and **Type ID** ($0$=Projectile, $1$=Beam) of the second nearest threat.

### 2.3 Temporal Sequencing & Label Engineering
A single frame (or coordinate vector) captures position but loses derivatives like velocity and acceleration. To allow the RNNs (LSTM/GRU) to infer trajectory:
1.  **Sliding Windows:** Both pipelines were stacked into overlapping sequences of length $T=10$ (representing 1 second of history).
    *   *Autoencoder Input Shape:* $(N, 10, 9216)$ — Sequence of flattened latent pixels.
    *   *YOLO Input Shape:* $(N, 10, 12)$ — Sequence of 12 engineered spatial features.
2.  **Label Encoding:** The target variable $y$ is a multi-hot encoded vector of size 9, representing the state of all tracked keys (Up, Down, Left, Right, Jump, Shoot, Dash) at the final timestep of the sequence.

# didn't check from here :3

## 4. Analysis Plan

### 4.1 Evaluation Metrics
Since "winning" the game is a binary and sparse metric, I will use two levels of evaluation:

1.  **Technical Accuracy (Classification Report):** For the LSTM, I will measure **Precision** and **Recall** for specific actions (Jump, Shoot, Dash). This will diagnose if the model suffers from class imbalance (e.g., learning to "Shoot" perfectly but never "Jump").
2.  **Agent Survival (Game Metric):** I will run both agents on the `RootPack_Test.mp4` scenario (or live gameplay) and measure **Average Survival Time (seconds)**. This determines if the "Spatial Logic" of YOLO actually translates to better gameplay than the "Pattern Matching" of the LSTM.

## 5. Model Selection and Mathematical Underpinnings

### 5.1 The Baseline: Convolutional Autoencoder + LSTM
**Rationale:** This architecture represents a pure "Deep Learning" attempt.
*   **Autoencoder (CAE):** We need to reduce the high dimensionality of video data. A *Convolutional* Autoencoder preserves spatial relationships (unlike a dense AE) while compressing the frame into a compact **Latent Vector**. It learns to "see" the game state unsupervised.
*   **LSTM (Long Short-Term Memory):** Cuphead is a game of physics. To dodge a projectile, the agent needs to know its velocity and trajectory, not just its current position. LSTMs handle this by maintaining a memory state over the sequence of 10 frames, allowing it to predict based on motion history.

**Implementation:**
I utilized `Keras` with a `TensorFlow` backend. The Encoder compresses the image, and the LSTM predicts the multi-label action vector (Sigmoid activation).

```python
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
from keras import layers, models

# --- Configuration ---
IMG_HEIGHT = 72   # Downscaled height
IMG_WIDTH = 128   # Downscaled width
LATENT_DIM = 64   # Size of the compressed feature vector
SEQUENCE_LENGTH = 10 # How many frames the LSTM looks back in time
NUM_ACTIONS = 9   # Number of unique keys (Up, Down, Jump, Shoot, etc.)

def build_autoencoder():
    """
    Builds a Convolutional Autoencoder to compress game frames.
    """
    input_img = layers.Input(shape=(IMG_HEIGHT, IMG_WIDTH, 1), name="Input_Frame")

    # --- Encoder (The Vision System) ---
    x = layers.Conv2D(32, (3, 3), activation='relu', padding='same')(input_img)
    x = layers.MaxPooling2D((2, 2), padding='same')(x)
    x = layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
    x = layers.MaxPooling2D((2, 2), padding='same')(x)
    x = layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
    
    # The Bottleneck (Latent Representation)
    encoded = layers.MaxPooling2D((2, 2), padding='same', name="Latent_Space")(x)

    # --- Decoder (Reconstruction) ---
    x = layers.Conv2DTranspose(128, (3, 3), activation='relu', padding='same')(encoded)
    x = layers.UpSampling2D((2, 2))(x)
    x = layers.Conv2DTranspose(64, (3, 3), activation='relu', padding='same')(x)
    x = layers.UpSampling2D((2, 2))(x)
    x = layers.Conv2DTranspose(32, (3, 3), activation='relu', padding='same')(x)
    x = layers.UpSampling2D((2, 2))(x)
    decoded = layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same', name="Reconstruction")(x)

    autoencoder = models.Model(input_img, decoded, name="Autoencoder")
    encoder = models.Model(input_img, encoded, name="Encoder") # We extract this later
    
    autoencoder.compile(optimizer='adam', loss='mse')
    return autoencoder, encoder

def build_lstm_brain(encoder_output_shape):
    """
    Builds an LSTM to predict actions based on a sequence of latent vectors.
    """
    # Input shape: (Time_Steps, Height, Width, Channels) of the latent vector
    # We need to flatten the spatial dimensions for the LSTM
    flattened_dim = encoder_output_shape[1] * encoder_output_shape[2] * encoder_output_shape[3]
    
    input_seq = layers.Input(shape=(SEQUENCE_LENGTH, flattened_dim), name="Sequence_Input")
    
    # LSTM Layers
    x = layers.LSTM(128, return_sequences=True)(input_seq)
    x = layers.Dropout(0.3)(x)
    x = layers.LSTM(64, return_sequences=False)(x) # Only return the final prediction
    x = layers.Dropout(0.3)(x)
    
    # Output Layer (Multi-label classification)
    # We use sigmoid because multiple keys can be pressed at once (e.g., Jump + Shoot)
    output = layers.Dense(NUM_ACTIONS, activation='sigmoid', name="Action_Prediction")(x)
    
    brain = models.Model(input_seq, output, name="LSTM_Brain")
    brain.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return brain

# --- Instantiate Models ---
print("Building Models...")
autoencoder, encoder = build_autoencoder()
autoencoder.summary()

# Get shape of the latent space to configure the LSTM
latent_shape = encoder.output_shape
print(f"Latent Space Shape: {latent_shape}")

brain = build_lstm_brain(latent_shape)
brain.summary()
```

### 5.2 The Improvement: YOLOv8 + Finite State Machine (FSM)
**Rationale:** The primary failure mode of the AE+LSTM baseline is the lack of explicit spatial reasoning. The model sees pixels, but doesn't "know" distance.
*   **YOLOv8 (You Only Look Once):** An object detection model that provides the precise $(x, y)$ bounding boxes of the player and projectiles.
*   **FSM:** A logic layer that uses these coordinates to calculate the **Euclidean Distance** between the player and the threat. This allows us to hard-code the "Jump" reflex that the LSTM fails to learn due to data imbalance.

*(Implementation details and evaluation of this model are discussed in Section 7).*

# References

https://blog.keras.io/building-autoencoders-in-keras.html

https://www.youtube.com/watch?v=YCzL96nL7j0&t=594s

https://www.youtube.com/watch?v=qiUEgSCyY5o

https://www.youtube.com/watch?v=wipq--gdIGM&feature=youtu.be