# Child Speech Recognition - Google Colab Training

This notebook trains a Whisper-small model on children's speech data using Google Colab's free GPU.

**Before starting:**
1. Enable GPU: `Runtime` → `Change runtime type` → `T4 GPU`
2. Upload data to Google Drive (see instructions below)
3. Run cells sequentially

**Estimated time:**
- Setup: 5 minutes
- Quick test: 5-10 minutes  
- Full training: 3-6 hours

## 1. Setup Environment

In [None]:
# Clone repository
!git clone https://github.com/ekshubina/childs_speech_recog_chall.git
%cd childs_speech_recog_chall

# Install dependencies (takes ~2 minutes)
!pip install -q -r requirements.txt

# Set PYTHONPATH so 'src' module is importable in subprocesses
import os
os.environ['PYTHONPATH'] = '/content/childs_speech_recog_chall'

In [None]:
# Verify GPU and PyTorch installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️ WARNING: No GPU detected! Training will be very slow.")
    print("Enable GPU: Runtime → Change runtime type → T4 GPU")

# Verify src module is importable
try:
    from src.utils.config import load_config
    print("✓ src module loaded successfully")
except ImportError as e:
    print(f"❌ ERROR: Cannot import src module: {e}")
    print("Re-run the setup cell above")

### Update Code from Git

Run this cell if you've pushed changes to GitHub and need to update the code in Colab.

In [None]:
%cd /content/childs_speech_recog_chall
!git pull origin main

# Reinstall requirements if dependencies changed
!pip install -q -r requirements.txt

print("✓ Code updated from GitHub")

## 2. Load Data via Bore Tunnel

Transfer data from your Mac directly to Colab — no Google Drive needed.

**On your Mac, open two terminals:**

**Terminal 1** — serve your data directory:
```bash
cd /path/to/childs_speech_recog_chall/data
python3 -m http.server 8080
```

**Terminal 2** — expose it with bore:
```bash
bore local 8080 --to bore.pub
# Prints: listening at bore.pub:XXXXX  ← copy that port number
```

Then set `BORE_PORT` in the cell below and run it.

In [None]:
import os
if os.path.exists('data'):
    !rm -r data

In [None]:
# ← Set your bore port number here
BORE_PORT = "XXXXX"
BORE_URL = f"http://bore.pub:{BORE_PORT}"

!mkdir -p data

# Download zip files from your Mac
!wget -q --show-progress "{BORE_URL}/audio_0.zip" -O data/audio_0.zip
!wget -q --show-progress "{BORE_URL}/audio_1.zip" -O data/audio_1.zip
!wget -q --show-progress "{BORE_URL}/audio_2.zip" -O data/audio_2.zip
!wget -q --show-progress "{BORE_URL}/train_word_transcripts.jsonl" -O data/train_word_transcripts.jsonl

print("✓ Downloads complete")
# Extract zip files
# Note: assumes each zip extracts to audio_0/, audio_1/, audio_2/ inside data/
# Run: !unzip -l data/audio_0.zip | head -5   to check structure if unsure
!unzip -q data/audio_0.zip -d data/
!unzip -q data/audio_1.zip -d data/
!unzip -q data/audio_2.zip -d data/

# Free up disk space
# !rm data/audio_*.zip

# Verify
from pathlib import Path

manifest_path = Path('data/train_word_transcripts.jsonl')
if manifest_path.exists():
    with open(manifest_path) as f:
        sample_count = sum(1 for _ in f)
    print(f"✓ Found training manifest with {sample_count:,} samples")
else:
    print("❌ ERROR: Training manifest not found!")

for audio_dir in ['audio_0', 'audio_1', 'audio_2']:
    audio_path = Path(f'data/{audio_dir}')
    if audio_path.exists():
        file_count = len(list(audio_path.glob('**/*.flac')))
        print(f"✓ Found data/{audio_dir}/ with {file_count:,} audio files")
    else:
        print(f"❌ ERROR: data/{audio_dir}/ not found! Check zip structure.")

## 3. Mount Google Drive

Checkpoints are saved locally during training (fast I/O), then copied to Drive afterwards.  
**Run this before the quick test and full training.**


In [None]:
# Mount Google Drive and symlink checkpoint directory
# The trainer writes to a local path, which is a symlink → Drive.
# Checkpoints land on Drive in real time — no copy step needed.
import os, yaml
from pathlib import Path

# Always anchor to repo root — prevents CWD-relative path bugs after runtime restarts
REPO_ROOT = Path("/content/childs_speech_recog_chall")
os.chdir(REPO_ROOT)
print(f"✓ CWD: {os.getcwd()}")

from google.colab import drive
drive.mount('/content/drive')

DRIVE_CHECKPOINT_DIR = "/content/drive/MyDrive/childs_speech_recog_chall/checkpoints"
LOCAL_CHECKPOINT_DIR = str(REPO_ROOT / "checkpoints")  # symlink → Drive

# Create the real directory on Drive
os.makedirs(DRIVE_CHECKPOINT_DIR, exist_ok=True)

# Symlink local path → Drive (so trainer writes straight to Drive)
if not Path(LOCAL_CHECKPOINT_DIR).exists():
    os.symlink(DRIVE_CHECKPOINT_DIR, LOCAL_CHECKPOINT_DIR)
    print(f"✓ Symlink created: {LOCAL_CHECKPOINT_DIR} → {DRIVE_CHECKPOINT_DIR}")
elif Path(LOCAL_CHECKPOINT_DIR).is_symlink():
    print(f"✓ Symlink already exists: {LOCAL_CHECKPOINT_DIR} → {os.readlink(LOCAL_CHECKPOINT_DIR)}")
else:
    # A real directory exists — replace it with a symlink
    import shutil
    shutil.rmtree(LOCAL_CHECKPOINT_DIR)
    os.symlink(DRIVE_CHECKPOINT_DIR, LOCAL_CHECKPOINT_DIR)
    print(f"✓ Replaced local dir with symlink → {DRIVE_CHECKPOINT_DIR}")

# --- Build patched config and write it to an absolute path ---
RUN_NAME       = "baseline_whisper_small"
# Use absolute path so the correct file is always written regardless of CWD
config_patched = str(REPO_ROOT / "configs" / "baseline_whisper_small_local.yaml")

with open(REPO_ROOT / "configs" / "baseline_whisper_small.yaml") as f:
    cfg = yaml.safe_load(f)

local_output = f"{LOCAL_CHECKPOINT_DIR}/{RUN_NAME}"
cfg["training"]["output_dir"]  = local_output
cfg["training"]["logging_dir"] = f"{local_output}/runs"

# Memory optimizations for T4 GPU (15 GB):
# - batch_size=4 + grad_accum=10 → effective batch = 40 (≈ original 12*3=36)
# - freeze_encoder skips storing encoder gradients (~60% of params) → ~4 GB saved
# - eval_batch_size=8 is safe (no gradients needed during eval)
≈
cfg["model"]["freeze_encoder"]                 = True

with open(config_patched, "w") as f:
    yaml.dump(cfg, f, default_flow_style=False, allow_unicode=True)

# --- Verify the file on disk matches what we wrote ---
with open(config_patched) as f:
    verify = yaml.safe_load(f)

assert verify["training"]["batch_size"] == 4, \
    f"❌ batch_size write failed! Got {verify['training']['batch_size']}"
assert verify["training"]["gradient_accumulation_steps"] == 10, \
    f"❌ grad_accum write failed! Got {verify['training']['gradient_accumulation_steps']}"
assert verify["model"]["freeze_encoder"] == True, \
    f"❌ freeze_encoder write failed! Got {verify['model']['freeze_encoder']}"

print(f"✓ Config written and verified: {config_patched}")
print(f"  batch_size            = {verify['training']['batch_size']}")
print(f"  gradient_accumulation = {verify['training']['gradient_accumulation_steps']}")
print(f"  effective_batch       = {verify['training']['batch_size'] * verify['training']['gradient_accumulation_steps']}")
print(f"  eval_batch_size       = {verify['training']['eval_batch_size']}")
print(f"  freeze_encoder        = {verify['model']['freeze_encoder']}")
print(f"✓ Checkpoints → {local_output}  (symlinked to Drive, saved in real time)")
print(f"✓ On Drive: MyDrive/childs_speech_recog_chall/checkpoints/{RUN_NAME}/")


## 4. Quick Test (Recommended First Step)

Run a quick test on 100 samples to verify everything works before starting full training.  
After it finishes, run the **"Copy to Drive"** cell to confirm files appear on Drive.


In [None]:
# Quick test — trains on 100 samples to verify the pipeline (~5-10 min)
# Checkpoints are written directly to Drive via symlink — nothing to copy.
!PYTORCH_ALLOC_CONF=expandable_segments:True python scripts/train.py --config {config_patched} --debug

from pathlib import Path
drive_output = f"{DRIVE_CHECKPOINT_DIR}/{RUN_NAME}"
if Path(drive_output).exists():
    print(f"\n✓ Checkpoints on Drive: MyDrive/childs_speech_recog_chall/checkpoints/{RUN_NAME}/")
    print(f"   {[p.name for p in sorted(Path(drive_output).iterdir())]}")
else:
    print("⚠️ No checkpoints found — check logs above for errors")


## 5. Full Training

⚠️ **This will take 3-6 hours** depending on GPU speed. Make sure:
- You have GPU enabled (Runtime → Change runtime type → T4 GPU)
- **Re-run the Drive mount cell (cell 11)** to apply memory optimizations to the config
- Checkpoints are written to Drive in real time via symlink — safe even if session disconnects

To **resume** from a previous checkpoint:
```python
!PYTORCH_ALLOC_CONF=expandable_segments:True python scripts/train.py \
    --config {config_patched} \
    --resume {LOCAL_CHECKPOINT_DIR}/{RUN_NAME}/checkpoint-XXXX
```


In [None]:
# Full training — checkpoints are written directly to Drive via symlink.
# To resume from a previous run, set --resume to a checkpoint path on Drive:
#   --resume {LOCAL_CHECKPOINT_DIR}/{RUN_NAME}/checkpoint-XXXX
!PYTORCH_ALLOC_CONF=expandable_segments:True python scripts/train.py --config {config_patched}

from pathlib import Path
drive_output = f"{DRIVE_CHECKPOINT_DIR}/{RUN_NAME}"
if Path(drive_output).exists():
    print(f"\n✓ Checkpoints on Drive: MyDrive/childs_speech_recog_chall/checkpoints/{RUN_NAME}/")
    print(f"   {[p.name for p in sorted(Path(drive_output).iterdir())]}")
else:
    print("⚠️ No checkpoints found — check logs above for errors")


## 6. Monitor Training with TensorBoard


In [None]:
# Load TensorBoard — reads from local logs while training, Drive logs after copy
%load_ext tensorboard
%tensorboard --logdir {LOCAL_CHECKPOINT_DIR}/baseline_whisper_small/runs


## 7. Evaluation


In [None]:
# Evaluate on validation set — uses model from Google Drive
MODEL_PATH = f"{DRIVE_CHECKPOINT_DIR}/baseline_whisper_small/final_model"

!python scripts/evaluate.py \
    --model-path {MODEL_PATH} \
    --val-manifest data/val_manifest.jsonl


## 8. Generate Predictions


In [None]:
# Generate predictions on test set
!python scripts/predict.py \
    --model-path {MODEL_PATH} \
    --input-jsonl data/test_manifest.jsonl \
    --output-jsonl predictions.jsonl \
    --batch-size 16

# Copy predictions to Drive
!cp predictions.jsonl /content/drive/MyDrive/predictions.jsonl

print("✓ Predictions saved to Google Drive")

## 9. Download Results


In [None]:
# Download predictions file to your local machine
from google.colab import files
files.download('predictions.jsonl')

## Troubleshooting

### No `childs_speech_recog_chall` folder on Google Drive
If the folder is missing, re-run the **Drive mount cell** (cell 11) to remount Drive and recreate the symlink and variables. Checkpoints are saved directly to Drive in real time via the symlink — no copy step needed.

### Session Timeout During Training
If your Colab session disconnects, checkpoints already saved are safe on Drive. Resume from the last checkpoint:
```python
RESUME_CKPT = "/content/drive/MyDrive/childs_speech_recog_chall/checkpoints/baseline_whisper_small/checkpoint-XXXX"
!PYTORCH_ALLOC_CONF=expandable_segments:True python scripts/train.py \
    --config configs/baseline_whisper_small_local.yaml \
    --resume {RESUME_CKPT}
```

### Out of Memory (OOM)
The Drive mount cell (cell 11) patches the config to use memory-safe settings for the T4 GPU:
- `batch_size=4`, `gradient_accumulation_steps=10` → effective batch = 40
- `freeze_encoder=True` → skips encoder gradients, saves ~4 GB
- `PYTORCH_ALLOC_CONF=expandable_segments:True` → reduces fragmentation

**You must re-run cell 11 to regenerate `configs/baseline_whisper_small_local.yaml` before training.**  
If you still get OOM after re-running cell 11, reduce batch size further by editing cell 11:
```python
cfg["training"]["batch_size"]                  = 2
cfg["training"]["gradient_accumulation_steps"] = 20
```
Then re-run cell 11 and restart training.
