# Child Speech Recognition - Google Colab Training

This notebook trains a Whisper-small model on children's speech data using Google Colab's free GPU.

**Before starting:**
1. Enable GPU: `Runtime` → `Change runtime type` → `T4 GPU`
2. Upload data to Google Drive (see instructions below)
3. Run cells sequentially

**Estimated time:**
- Setup: 5 minutes
- Quick test: 5-10 minutes  
- Full training: 3-6 hours

## 1. Setup Environment

In [2]:
# Clone repository
!git clone https://github.com/ekshubina/childs_speech_recog_chall.git
%cd childs_speech_recog_chall

# Install dependencies (takes ~2 minutes)
!pip install -q -r requirements.txt

Cloning into 'childs_speech_recog_chall'...
remote: Enumerating objects: 107, done.[K
remote: Counting objects: 100% (107/107), done.[K
remote: Compressing objects: 100% (72/72), done.[K
remote: Total 107 (delta 28), reused 107 (delta 28), pack-reused 0 (from 0)[K
Receiving objects: 100% (107/107), 146.65 KiB | 9.17 MiB/s, done.
Resolving deltas: 100% (28/28), done.
/content/childs_speech_recog_chall
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m803.2/803.2 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m114.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (pyproject.toml) ... [?25l[?25hdone


In [3]:
# Verify GPU and PyTorch installation
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("⚠️ WARNING: No GPU detected! Training will be very slow.")
    print("Enable GPU: Runtime → Change runtime type → T4 GPU")

PyTorch version: 2.9.0+cu128
CUDA available: True
GPU: Tesla T4
GPU Memory: 15.64 GB


## 2. Mount Google Drive and Link Data

**First-time setup:**
1. Upload your data to Google Drive:
   - `MyDrive/child_speech_data/train_word_transcripts.jsonl`
   - `MyDrive/child_speech_data/audio_0/`
   - `MyDrive/child_speech_data/audio_1/`
   - `MyDrive/child_speech_data/audio_2/`

2. Adjust paths in the cell below if your data is in a different location

In [None]:
# Create symlinks to your data in Google Drive
# Adjust these paths if your data is in a different location
DRIVE_DATA_PATH = "/content/drive/MyDrive/child_speech_data"

# Create data directory if it doesn't exist
!mkdir -p data

# Link manifest file
!ln -sf {DRIVE_DATA_PATH}/train_word_transcripts.jsonl data/train_word_transcripts.jsonl

# Link audio directories
!ln -sf {DRIVE_DATA_PATH}/audio_0 data/audio_0
!ln -sf {DRIVE_DATA_PATH}/audio_1 data/audio_1
!ln -sf {DRIVE_DATA_PATH}/audio_2 data/audio_2

# Verify data is accessible
import json
from pathlib import Path

manifest_path = Path('data/train_word_transcripts.jsonl')
if manifest_path.exists():
    with open(manifest_path) as f:
        sample_count = sum(1 for _ in f)
    print(f"✓ Found training manifest with {sample_count:,} samples")
else:
    print("❌ ERROR: Training manifest not found!")
    print(f"Expected at: {manifest_path.absolute()}")
    print("Make sure data is uploaded to Google Drive and paths are correct.")

for audio_dir in ['audio_0', 'audio_1', 'audio_2']:
    audio_path = Path(f'data/{audio_dir}')
    if audio_path.exists():
        file_count = len(list(audio_path.glob('*.flac')))
        print(f"✓ Found data/{audio_dir}/ with {file_count:,} audio files")
    else:
        print(f"❌ ERROR: data/{audio_dir}/ not found!")

## 3. Quick Test (Recommended First Step)

Run a quick test on 100 samples to verify everything works before starting full training.

In [None]:
# Quick test - trains on 100 samples (takes ~5-10 minutes)
!python scripts/train.py --config configs/baseline_whisper_small.yaml --debug

## 4. Full Training

⚠️ **This will take 3-6 hours** depending on GPU speed. Make sure:
- You have GPU enabled (Runtime → Change runtime type → T4 GPU)
- Your Colab session won't timeout (keep browser open or use Colab Pro)
- Checkpoints are saved to Google Drive for persistence

In [None]:
# Setup checkpoint directory in Google Drive for persistence
CHECKPOINT_DIR = "/content/drive/MyDrive/child_speech_checkpoints"
!mkdir -p {CHECKPOINT_DIR}

# Update config to save checkpoints to Drive
# This ensures checkpoints survive if Colab disconnects
!sed -i "s|output_dir: checkpoints/baseline_whisper_small|output_dir: {CHECKPOINT_DIR}/baseline_whisper_small|g" configs/baseline_whisper_small.yaml

print(f"✓ Checkpoints will be saved to {CHECKPOINT_DIR}")

In [None]:
# Start full training
!python scripts/train.py --config configs/baseline_whisper_small.yaml

## 5. Monitor Training with TensorBoard

In [None]:
# Load TensorBoard
%load_ext tensorboard
%tensorboard --logdir logs/baseline_whisper_small

# Alternative: View logs directly
# !tail -n 50 logs/train_*.log

## 6. Evaluation

In [None]:
# Evaluate on validation set
MODEL_PATH = f"{CHECKPOINT_DIR}/baseline_whisper_small/final_model"

!python scripts/evaluate.py \
    --model-path {MODEL_PATH} \
    --val-manifest data/val_manifest.jsonl

## 7. Generate Predictions

In [None]:
# Generate predictions on test set
!python scripts/predict.py \
    --model-path {MODEL_PATH} \
    --input-jsonl data/test_manifest.jsonl \
    --output-jsonl predictions.jsonl \
    --batch-size 16

# Copy predictions to Drive
!cp predictions.jsonl /content/drive/MyDrive/predictions.jsonl

print("✓ Predictions saved to Google Drive")

## 8. Download Results

In [None]:
# Download predictions file to your local machine
from google.colab import files
files.download('predictions.jsonl')

## Troubleshooting

### Session Timeout
If your Colab session disconnects during training:
1. Remount Google Drive (Section 2)
2. Resume from checkpoint:
```python
!python scripts/train.py \
    --config configs/baseline_whisper_small.yaml \
    --resume {CHECKPOINT_DIR}/baseline_whisper_small/checkpoint-XXXX
```

### Out of Memory (OOM)
If you get CUDA OOM errors:
1. Reduce batch size in config:
```python
!sed -i 's/batch_size: 12/batch_size: 8/g' configs/baseline_whisper_small.yaml
```
2. Restart runtime and try again

### Slow Training
Verify GPU is enabled:
```python
import torch
print(torch.cuda.is_available())  # Should be True
```

### Data Not Found
Check your Google Drive paths:
```python
!ls -lh /content/drive/MyDrive/child_speech_data/
```