## 2. Data Collection and Pre-processing

### 2.1 The Challenge: Continuous Stream vs. Discrete Phases
The raw data collected in the previous experiment consisted of 15 contiguous gameplay sessions recorded in a single video file. This presented a significant noise challenge: the recording contained loading screens, menu navigations, and three distinct boss phases (Potato, Onion, Carrot), each requiring different visual recognition patterns and reflex strategies.

To train a specialized agent, I needed to isolate the **Potato Phase**â€”the specific segment where the player must dodge projectiles (dirt clods) and jump over obstacles.
I created segmentation script and filtered dataset genreator to solve this problem (see appendix X)



## 3. Methodology: Pivoting from Generalization to Micro-Timing

### 3.1 Refinement of Scope
In Assignment 2, the "General Cuphead Agent" achieved high accuracy (96%) but failed functionally because it prioritized the majority class (shooting) and struggled to "see" small, high-velocity threats in the Autoencoder's latent space.

For this final experiment, I pivoted from a broad strategy (playing the whole game) to a focused **micro-timing analysis**. The core research question shifted:
> *Does Explicit Perception (Object Detection) outperform Implicit Perception (Latent Representation) in high-speed reaction tasks?*

To answer this, I restricted the domain to the **Potato Phase**. This phase is mechanically deterministic but visually demanding, making it the perfect "lab environment" to compare how well different vision architectures can track projectile velocity and trigger a jump action.



### 3.2 Model Selection and Architecture Comparison

I designed two distinct pipelines to compete against each other:

#### **Pipeline A: The Baseline (Implicit Perception)**
*   **Vision:** Convolutional Autoencoder (CAE).
*   **Decision:** Gated Recurrent Unit (GRU).
*   **Rationale:** This represents the "End-to-End" Deep Learning approach. The model is not told what a "bullet" is; it must learn to represent valid game states in a latent vector $z$. The GRU is necessary here to infer velocity, as a single static frame's latent vector does not contain motion information. The RNN must "remember" the previous latent states to understand if an object is moving toward the player.


# make CAE GRU for potato phase

#### **Pipeline B: The Novel Approach (Explicit Perception via YOLO)**
*   **Vision:** **YOLOv8 (You Only Look Once)**.
*   **Decision:** Multi-Layer Perceptron (MLP).
*   **Rationale:** This is the novel technique introduced for this assignment. Instead of relying on the neural network to figure out what pixels matter, we explicitly teach it.
    1.  **Object Detection:** YOLO detects discrete entities: `Cuphead`, `Potato`, and `Projectile`.
    2.  **State Extraction:** We convert bounding boxes into a structured state vector containing **Physics Features** (Distance to projectile, Relative Velocity).
    3.  **Logic:** Because we explicitly calculate velocity from the YOLO detections across frames ($v = \frac{\Delta x}{\Delta t}$), we do **not** need a complex Recurrent Neural Network (GRU). A simple Feed-Forward Network (MLP) should be sufficient to map the state vector $[x, y, v_x, v_y]$ to the action `JUMP`.

### 3.3 Hypothesis
I hypothesize that **Pipeline B (YOLO)** will significantly outperform Pipeline A in **Recall** (catching rare Jump events). The Autoencoder in Pipeline A tends to blur small, fast-moving projectiles as "noise" during compression. YOLO, trained specifically on those objects, will force the decision model to acknowledge their existence, creating a distinct "Decision Boundary" that mimics human reaction timing.

# appendix


 ### 2.2 Custom Segmentation Tooling
Standard video editing tools were insufficient because I needed to preserve the exact millisecond alignment between video frames and the asynchronous keystroke logs (UTC timestamps).

To solve this, I developed a custom Python utility, `mark_potato_segments.py`, which utilizes OpenCV and the original `_frames.jsonl` log file.
*   **Frame-Accurate Navigation:** The tool allowed me to step through the raw training video frame-by-frame.
*   **Metadata Tagging:** I manually identified the start (the moment the "Fight!" banner disappears) and end (the moment the Potato retreats) of every Potato phase across all 15 sessions.
*   **UTC Extraction:** Instead of cutting the video files (which risks re-encoding artifacts and frame drift), the tool generated a lightweight JSON metadata file (`potato_phase_segments.json`).

**Segment Metadata Structure:**
```json
{
    "id": 1,
    "start_frame": 450,
    "start_utc": 1763997299.05,
    "end_frame": 1200,
    "end_utc": 1763997324.10
}
```

### 2.3 Dataset Filtering
Using this metadata, I created a filtered dataset generator. During training, the data loader reads the JSON segments and only yields frames and keystrokes that fall strictly within the active Potato Phase combat windows. This reduced the dataset size but drastically increased data quality, ensuring the model only learns from relevant combat frames.