# Child Speech Recognition - Google Colab Training

This notebook trains a Whisper-small model on children's speech data using Google Colab's free GPU.

**Before starting:**
1. Enable GPU: `Runtime` → `Change runtime type` → `T4 GPU`
2. Upload data to Google Drive (see instructions below)
3. Run cells sequentially

**Estimated time:**
- Setup: 5 minutes
- Quick test: 5-10 minutes  
- Full training: 3-6 hours

## 1. Setup Environment

In [None]:
# Clone repository
!git clone https://github.com/ekshubina/childs_speech_recog_chall.git
%cd childs_speech_recog_chall

# Install dependencies (takes ~2 minutes)
!pip install -q -r requirements.txt

In [None]:
# Verify GPU and PyTorch installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️ WARNING: No GPU detected! Training will be very slow.")
    print("Enable GPU: Runtime → Change runtime type → T4 GPU")

## 2. Load Data via Bore Tunnel

Transfer data from your Mac directly to Colab — no Google Drive needed.

**On your Mac, open two terminals:**

**Terminal 1** — serve your data directory:
```bash
cd /path/to/childs_speech_recog_chall/data
python3 -m http.server 8080
```

**Terminal 2** — expose it with bore:
```bash
bore local 8080 --to bore.pub
# Prints: listening at bore.pub:XXXXX  ← copy that port number
```

Then set `BORE_PORT` in the cell below and run it.

In [None]:
import os
if os.path.exists('data'):
    !rm -r data

In [None]:
# ← Set your bore port number here
BORE_PORT = "XXXXX"
BORE_URL = f"http://bore.pub:{BORE_PORT}"

!mkdir -p data

# Download zip files from your Mac
!wget -q --show-progress "{BORE_URL}/audio_0.zip" -O data/audio_0.zip
!wget -q --show-progress "{BORE_URL}/audio_1.zip" -O data/audio_1.zip
!wget -q --show-progress "{BORE_URL}/audio_2.zip" -O data/audio_2.zip
!wget -q --show-progress "{BORE_URL}/train_word_transcripts.jsonl" -O data/train_word_transcripts.jsonl

print("✓ Downloads complete")

In [None]:
# Extract zip files
# Note: assumes each zip extracts to audio_0/, audio_1/, audio_2/ inside data/
# Run: !unzip -l data/audio_0.zip | head -5   to check structure if unsure
!unzip -q data/audio_0.zip -d data/
!unzip -q data/audio_1.zip -d data/
!unzip -q data/audio_2.zip -d data/

# Free up disk space
# !rm data/audio_*.zip

# Verify
from pathlib import Path

manifest_path = Path('data/train_word_transcripts.jsonl')
if manifest_path.exists():
    with open(manifest_path) as f:
        sample_count = sum(1 for _ in f)
    print(f"✓ Found training manifest with {sample_count:,} samples")
else:
    print("❌ ERROR: Training manifest not found!")

for audio_dir in ['audio_0', 'audio_1', 'audio_2']:
    audio_path = Path(f'data/{audio_dir}')
    if audio_path.exists():
        file_count = len(list(audio_path.glob('**/*.flac')))
        print(f"✓ Found data/{audio_dir}/ with {file_count:,} audio files")
    else:
        print(f"❌ ERROR: data/{audio_dir}/ not found! Check zip structure.")

## 3. Quick Test (Recommended First Step)

Run a quick test on 100 samples to verify everything works before starting full training.

In [None]:
# Quick test - trains on 100 samples (takes ~5-10 minutes)
!python scripts/train.py --config configs/baseline_whisper_small.yaml --debug

## 4. Full Training

⚠️ **This will take 3-6 hours** depending on GPU speed. Make sure:
- You have GPU enabled (Runtime → Change runtime type → T4 GPU)
- Your Colab session won't timeout (keep browser open or use Colab Pro)
- Checkpoints are saved to Google Drive for persistence

In [None]:
# Checkpoints are saved locally in Colab's VM storage
# ⚠️ They will be lost if the session ends — download them when training is done
CHECKPOINT_DIR = "/content/childs_speech_recog_chall/checkpoints"

print(f"✓ Checkpoints will be saved to {CHECKPOINT_DIR}")
print("⚠️  Remember to download checkpoints before your session ends!")
print("    Use: files.download() or copy back via bore")

In [None]:
# Start full training
!python scripts/train.py --config configs/baseline_whisper_small.yaml

## 5. Monitor Training with TensorBoard

In [None]:
# Load TensorBoard
%load_ext tensorboard
%tensorboard --logdir logs/baseline_whisper_small

# Alternative: View logs directly
# !tail -n 50 logs/train_*.log

## 6. Evaluation

In [None]:
# Evaluate on validation set
MODEL_PATH = f"{CHECKPOINT_DIR}/baseline_whisper_small/final_model"

!python scripts/evaluate.py \
    --model-path {MODEL_PATH} \
    --val-manifest data/val_manifest.jsonl

## 7. Generate Predictions

In [None]:
# Generate predictions on test set
!python scripts/predict.py \
    --model-path {MODEL_PATH} \
    --input-jsonl data/test_manifest.jsonl \
    --output-jsonl predictions.jsonl \
    --batch-size 16

# Copy predictions to Drive
!cp predictions.jsonl /content/drive/MyDrive/predictions.jsonl

print("✓ Predictions saved to Google Drive")

## 8. Download Results

In [None]:
# Download predictions file to your local machine
from google.colab import files
files.download('predictions.jsonl')

## Troubleshooting

### Session Timeout
If your Colab session disconnects during training:
1. Re-run cells 1–2 to set up the environment and re-download data
2. Resume from the last checkpoint:
```python
!python scripts/train.py \
    --config configs/baseline_whisper_small.yaml \
    --resume checkpoints/baseline_whisper_small/checkpoint-XXXX
```

### Bore connection drops / download stalls
Restart bore on your Mac and re-run the download cell with the new port.

### Wrong zip structure
Check what's inside your zip before extracting:
```python
!unzip -l data/audio_part_0.zip | head -10
```
If files are at root level (not inside `audio_0/`), extract to the specific folder:
```python
!unzip -q data/audio_part_0.zip -d data/audio_0/
```

### Out of Memory (OOM)
If you get CUDA OOM errors, reduce batch size:
```python
!sed -i 's/batch_size: 12/batch_size: 8/g' configs/baseline_whisper_small.yaml
```
Then restart runtime and try again.

### Slow Training
Verify GPU is enabled:
```python
import torch
print(torch.cuda.is_available())  # Should be True
```

### Save checkpoints before session ends
```python
from google.colab import files
import shutil
shutil.make_archive('checkpoints', 'zip', 'checkpoints/')
files.download('checkpoints.zip')
```