# Finnish TTS Model Training on Kaggle

**Project:** Train Finnish TTS model using Fish Speech + LoRA  
**Dataset:** 2000 Finnish samples (cv-15 + parliament)  
**Expected Time:** ~5-6 hours on P100 GPU  
**Cost:** FREE (Kaggle free tier - 30 hours/week)

---

## Prerequisites

1. **Kaggle Account:** Create account at https://kaggle.com
2. **GPU Enabled:** Settings → Enable GPU Accelerator (P100 or T4)
3. **Dataset Upload:** Upload `FinnishSpeaker_2000_partial.tar.gz` (636 MB) as Kaggle Dataset
4. **HuggingFace Token:** For downloading base model
5. **Session Limit:** 12 hours max per session (enough for training)

---

## What This Notebook Does

1. ✅ Install Fish Speech
2. ✅ Download base model (5GB)
3. ✅ Load your dataset (2000 samples, 502 VQ tokens already extracted)
4. ✅ Extract remaining 1498 VQ tokens (~30-40 min)
5. ✅ Pack dataset to protos
6. ✅ Train with LoRA (2000 steps, ~4-5 hours)
7. ✅ Merge LoRA weights
8. ✅ Download trained model

---

## Step 1: Check GPU

In [None]:
# Verify GPU is available
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

## Step 2: Install Dependencies

In [None]:
# Install system dependencies
!apt-get update -qq
!apt-get install -y -qq sox libsox-dev ffmpeg

# Clone Fish Speech repository
!git clone https://github.com/fishaudio/fish-speech.git
%cd fish-speech

# Install Python dependencies explicitly
!pip install -q torch torchvision torchaudio
!pip install -q hydra-core omegaconf pyrootutils
!pip install -q lightning tensorboard
!pip install -q transformers tokenizers
!pip install -q loralib
!pip install -q nemo_text_processing WeTextProcessing
!pip install -q gradio protobuf fish-audio-preprocess soundfile

# Install Fish Speech itself
!pip install -q -e .

print("\n✅ Installation complete!")

## Step 3: Login to HuggingFace

In [None]:
from huggingface_hub import login

# Enter your HuggingFace token
login()

print('\n⚠️ IMPORTANT: You must request access to the gated model:')
print('Visit: https://huggingface.co/fishaudio/openaudio-s1-mini')
print('Click "Request Access" button (approval is usually instant)')

## Step 4: Download Base Model

In [None]:
# Download pre-trained model (~5 GB, takes 2-3 minutes)
!hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini

# Verify download
!ls -lh checkpoints/openaudio-s1-mini/

## Step 5: Load Dataset from Kaggle Input

**Before running:** Add your dataset as input in Kaggle notebook settings:
1. Click "+ Add Data" → "Your Datasets"
2. Select your uploaded `FinnishSpeaker_2000_partial` dataset
3. It will mount at `/kaggle/input/finnishspeaker-2000-partial/`

In [None]:
# Extract dataset from Kaggle input
!mkdir -p data
!tar -xzf /kaggle/input/finnishspeaker-2000-partial/FinnishSpeaker_2000_partial.tar.gz -C data/

# Verify extraction
!echo "WAV files: $(ls data/FinnishSpeaker/*.wav | wc -l)"
!echo "LAB files: $(ls data/FinnishSpeaker/*.lab | wc -l)"
!echo "NPY files (VQ tokens): $(ls data/FinnishSpeaker/*.npy | wc -l)"
!echo ""
!echo "Sample files:"
!ls data/FinnishSpeaker/ | head -10

## Step 6: Extract Remaining VQ Tokens

You have 502/2000 VQ tokens. This step extracts the remaining 1498 tokens.  
**Time:** ~30-40 minutes on P100, ~60 minutes on T4

In [None]:
# Extract VQ tokens for files that don't have them yet
!python tools/vqgan/extract_vq.py \
  data/FinnishSpeaker \
  --num-workers 4 \
  --batch-size 32 \
  --config-name modded_dac_vq \
  --checkpoint-path checkpoints/openaudio-s1-mini/codec.pth

# Verify all tokens extracted
!echo "\nFinal VQ token count: $(ls data/FinnishSpeaker/*.npy | wc -l)"
!echo "Expected: 2000"

## Step 7: Pack Dataset into Protos

In [None]:
# Pack dataset into protocol buffer format
!python tools/llama/build_dataset.py \
  --input "data/FinnishSpeaker" \
  --output "data/protos" \
  --text-extension .lab \
  --num-workers 4

# Verify protos created
!ls -lh data/protos/

## Step 8: Start Training

**Configuration:**
- Samples: 2000
- Steps: 2000
- Batch size: 2 (with gradient accumulation)
- Checkpoints: Every 250 steps
- Time: ~4-5 hours on P100

Training will auto-save checkpoints. If session disconnects, you can resume from last checkpoint.

In [None]:
# Train with LoRA fine-tuning
!python fish_speech/train.py \
  --config-name text2semantic_finetune \
  project=FinnishSpeaker_2000_finetune \
  +lora@model.model.lora_config=r_8_alpha_16 \
  data.batch_size=2 \
  data.num_workers=4 \
  trainer.max_steps=2000 \
  trainer.val_check_interval=250 \
  trainer.accumulate_grad_batches=2

## Step 9: Monitor Training (Optional)

Run this in a separate cell while training to monitor progress:

In [None]:
# Check training progress
!ls -lht results/FinnishSpeaker_2000_finetune/checkpoints/ | head -5

# View tensorboard logs (optional)
%load_ext tensorboard
%tensorboard --logdir results/FinnishSpeaker_2000_finetune/

## Step 10: Merge LoRA Weights

After training completes, merge the LoRA adapter with the base model.

In [None]:
# Find the final checkpoint (or best checkpoint)
!ls -lh results/FinnishSpeaker_2000_finetune/checkpoints/

# Merge LoRA weights (update step number if needed)
!python tools/llama/merge_lora.py \
  --lora-config r_8_alpha_16 \
  --base-weight checkpoints/openaudio-s1-mini \
  --lora-weight results/FinnishSpeaker_2000_finetune/checkpoints/step_000002000.ckpt \
  --output checkpoints/FinnishSpeaker_2000_finetuned

# Verify merged model
!ls -lh checkpoints/FinnishSpeaker_2000_finetuned/

## Step 11: Download Trained Model

**Choose one method to download your trained model:**

### Method A: Download Directly (Kaggle Notebook Output)

In [None]:
# Create archive for download
!tar -czf FinnishSpeaker_2000_trained.tar.gz checkpoints/FinnishSpeaker_2000_finetuned/

# The file will appear in Kaggle Output section (right sidebar)
!ls -lh FinnishSpeaker_2000_trained.tar.gz
print("\n✅ Model packaged! Find it in Output tab →")

### Method B: Upload to HuggingFace Hub (Recommended for Sharing)

In [None]:
from huggingface_hub import HfApi

api = HfApi()

# Create a new model repo (will be private by default)
repo_id = "YOUR_USERNAME/finnish-tts-finetuned"  # Change this!

# Upload model files
api.upload_folder(
    folder_path="checkpoints/FinnishSpeaker_2000_finetuned",
    repo_id=repo_id,
    repo_type="model",
)

print(f"\n✅ Model uploaded to: https://huggingface.co/{repo_id}")

### Method C: Save LoRA Adapter Only (8 MB vs 3.2 GB)

More efficient - just save the LoRA weights, use with base model later.

In [None]:
# Package just the LoRA checkpoint
!mkdir -p lora_adapter
!cp results/FinnishSpeaker_2000_finetune/checkpoints/step_000002000.ckpt lora_adapter/
!tar -czf FinnishSpeaker_2000_lora.tar.gz lora_adapter/

!ls -lh FinnishSpeaker_2000_lora.tar.gz
print("\n✅ LoRA adapter packaged (only 8 MB!)")
print("Use with base model: fishaudio/openaudio-s1-mini")

## Step 12: Quick Test (Optional)

Test the model with a short Finnish sentence:

In [None]:
# Quick inference test
test_text = "Hei, kuinka voit?"

# Generate audio (basic CLI inference)
!python tools/llama/generate.py \
  --text "$test_text" \
  --checkpoint-path checkpoints/FinnishSpeaker_2000_finetuned \
  --output test_output.wav

# Play audio
from IPython.display import Audio
Audio('test_output.wav')

---

## Training Summary

**What you trained:**
- Base model: fishaudio/openaudio-s1-mini
- Dataset: 2000 Finnish samples (cv-15 + parliament)
- Method: LoRA fine-tuning (r=8, alpha=16)
- Steps: 2000
- Time: ~5-6 hours

**Output files:**
- Full model: `checkpoints/FinnishSpeaker_2000_finetuned/` (3.2 GB)
- LoRA only: `lora_adapter/step_000002000.ckpt` (8 MB)
- Checkpoints: `results/FinnishSpeaker_2000_finetune/checkpoints/` (for resuming)

**Next steps:**
1. Download model to your Mac
2. Test with Fish Speech WebUI
3. Add more data and train again (incremental learning)
4. Share with team via HuggingFace

---

## Troubleshooting

**Out of Memory:**
- Reduce `data.batch_size` to 1
- Increase `trainer.accumulate_grad_batches` to 4

**Session Timeout:**
- Training auto-saves checkpoints every 250 steps
- Restart notebook and change training command to resume:
  ```python
  !python fish_speech/train.py \
    --config-name text2semantic_finetune \
    project=FinnishSpeaker_2000_finetune \
    +lora@model.model.lora_config=r_8_alpha_16 \
    ckpt_path=results/FinnishSpeaker_2000_finetune/checkpoints/last.ckpt
  ```

**HuggingFace Access Denied:**
- Make sure you requested access at https://huggingface.co/fishaudio/openaudio-s1-mini
- Wait a few minutes for approval
- Re-run Step 3 with correct token

---