# Colab Preprocessing Guide

This notebook standardizes your videos for fine-tuning (H.264/AAC, YUV420p, target 1080p short-side, optional 30fps) and can optionally extract frames.

Requirements: Colab runtime with internet (for apt) and your `.mp4` files.

## Setup

In [None]:
%%bash
set -euo pipefail
apt-get -y update >/dev/null 2>&1 || true
apt-get -y install ffmpeg >/dev/null 2>&1 || true
ffmpeg -version | head -n 1

## Mount Drive (optional)
Use this if your videos are in Google Drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Prepare workspace paths

In [None]:
from pathlib import Path
WORKDIR = Path('/content')
INPUT_DIR = WORKDIR / 'videos_finetune'
OUTPUT_DIR = WORKDIR / 'data' / 'videos_baseline'
INPUT_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
print('Input dir:', INPUT_DIR)
print('Output dir:', OUTPUT_DIR)

## Upload or copy your `.mp4` files
Upload via the Colab file browser into `/content/videos_finetune`, or copy from Drive (example below).

In [None]:
%%bash
set -euo pipefail
echo 'Place your MP4s into /content/videos_finetune or edit and run the copy example below:'
# cp -v /content/drive/MyDrive/path/to/your_videos/*.mp4 /content/videos_finetune/ || true
ls -lh /content/videos_finetune || true

## Verify preprocessing script is available
This notebook expects the project repo files present so it can run `tools/preprocess_videos.py`. If not present, upload the `tools/` folder or clone the repo into `/content`.

In [None]:
from pathlib import Path
script = Path('tools/preprocess_videos.py')
print('Found script at:', script.resolve() if script.exists() else 'NOT FOUND')
# If NOT FOUND, upload the script or clone the repo, e.g.:
# !git clone https://github.com/<you>/<repo>.git /content/repo && cd /content/repo
# then open this notebook from that folder or adjust paths.

## Run preprocessing (1080p, 30fps)
Outputs go to `/content/data/videos_baseline` and a manifest to `/content/data/videos_manifest.jsonl`.

In [None]:
%%bash
set -euo pipefail
python tools/preprocess_videos.py \
  --input_dir '/content/videos_finetune' \
  --output_dir '/content/data/videos_baseline' \
  --target_height 1080 \
  --fps 30 \
  --compute_checksums

## Verify one output file

In [None]:
%%bash
set -euo pipefail
VID=$(ls -1 /content/data/videos_baseline/*.mp4 2>/dev/null | head -n1 || true)
echo "Video: $VID"
[ -n "$VID" ] && ffprobe -v error -show_streams -select_streams v:0 "$VID" | sed -n '1,50p' || echo 'No outputs found.'


## Optional: Extract frames for training
Creates a frames dataset (`/content/data/frames/<video_stem>/*.jpg`) and a manifest at `/content/data/dataset.jsonl`.

In [None]:
%%bash
set -euo pipefail
python scripts/prepare_dataset.py \
  --videos_dir '/content/data/videos_baseline' \
  --out_dir '/content/data/frames' \
  --fps 2 --size 384


## Notes
- The preprocessing script uses H.264 (libx264), AAC audio (or `-an` if no audio), yuv420p, and `+faststart`.
- Adjust `--target_height`, `--fps`, and `--crf` as needed.
- If the `tools/` or `scripts/` paths are not found, ensure you opened this notebook from the project root or adjust paths accordingly.