# Finnish TTS Model Training on Kaggle

**Project:** Train Finnish TTS model using Fish Speech + LoRA  
**Dataset:** 2000 Finnish samples (cv-15 + parliament)  
**Training:** Resume from step 750 → 2000 (improves quality)

---

## Step 1: Check GPU

In [None]:
!nvidia-smi

import torch
print(f"\nPyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

## Step 2: Install Dependencies

In [None]:
# Install system dependencies
!apt-get update -qq
!apt-get install -y -qq sox libsox-dev ffmpeg portaudio19-dev

# Clone Fish Speech
!git clone https://github.com/fishaudio/fish-speech.git /kaggle/working/fish-speech
%cd /kaggle/working/fish-speech

# Install core dependencies
!pip install -q hydra-core omegaconf pyrootutils
!pip install -q lightning tensorboard
!pip install -q transformers tokenizers
!pip install -q loralib
!pip install -q nemo_text_processing WeTextProcessing
!pip install -q descript-audio-codec
!pip install -q git+https://github.com/descriptinc/audiotools

# Install fish-speech (skip pyaudio - not needed for training)
!pip install -q -e . --no-deps
!pip install -q "einx[torch]==0.2.2" "kui>=1.6.0" "modelscope==1.17.1" \
  "opencc-python-reimplemented==0.1.7" ormsgpack "resampy>=0.4.3" silero-vad

print("\n✅ Installation complete!")

## Step 3: Login to HuggingFace

In [None]:
from kaggle_secrets import UserSecretsClient
from huggingface_hub import login

user_secrets = UserSecretsClient()
hf_token = user_secrets.get_secret("HF_TOKEN")
login(token=hf_token)

print("✅ HuggingFace login successful!")

## Step 4: Download Base Model

In [None]:
!hf download fishaudio/openaudio-s1-mini --local-dir checkpoints/openaudio-s1-mini
!ls -lh checkpoints/openaudio-s1-mini/

## Step 5: Load Dataset

Make sure you added `finnishspeaker-2000-partial` dataset in notebook settings.

In [None]:
# Copy dataset (Kaggle auto-extracted it)
!mkdir -p data
!cp -r /kaggle/input/finnishspeaker-2000-partial/FinnishSpeaker data/

# Verify
!echo "WAV: $(ls data/FinnishSpeaker/*.wav | wc -l)"
!echo "LAB: $(ls data/FinnishSpeaker/*.lab | wc -l)"
!echo "NPY: $(ls data/FinnishSpeaker/*.npy | wc -l)"

## Step 6: Extract Remaining VQ Tokens (if needed)

Skip if you already have 2000 .npy files. Run only if NPY count < 2000.

In [None]:
!pip install -q loguru

In [None]:
# Run ONLY if you don't have all 2000 .npy files
!python tools/vqgan/extract_vq.py \
  data/FinnishSpeaker \
  --num-workers 1 \
  --batch-size 4 \
  --config-name modded_dac_vq \
  --checkpoint-path checkpoints/openaudio-s1-mini/codec.pth

!echo "\nFinal NPY count: $(ls data/FinnishSpeaker/*.npy | wc -l)"

## Step 7: Pack Dataset

In [None]:
!pip install -q "protobuf>=3.20.3,<5" --upgrade

In [None]:
!python tools/llama/build_dataset.py \
  --input "data/FinnishSpeaker" \
  --output "data/protos" \
  --text-extension .lab \
  --num-workers 4

!ls -lh data/protos/

## Step 8: Resume Training (750 → 2000 steps)

**This will:**
- Load your previous checkpoint from step 750
- Train for 1250 more steps
- Take ~1.5 hours
- Improve quality significantly

In [None]:
# First, check the exact path
!ls -la /kaggle/input/my-750-step-output/fish-speech/results/FinnishSpeaker_2000_finetune/checkpoints/step_000000750.ckpt

In [None]:
#!python fish_speech/train.py \
  #--config-name text2semantic_finetune \
  #project=FinnishSpeaker_2000_finetune \
  #+lora@model.model.lora_config=r_8_alpha_16 \
  #data.batch_size=2 \
  #data.num_workers=4 \
  #trainer.max_steps=2000 \
  #trainer.val_check_interval=50 \
  #trainer.accumulate_grad_batches=2

## Step 9: Monitor Progress (Optional)

In [None]:
!ls -lht results/FinnishSpeaker_2000_finetune/checkpoints/ | head -5

## Step 10: Merge LoRA Weights

In [None]:
!ls -lh /kaggle/input/my-2000-run/fish-speech/results/FinnishSpeaker_2000_finetune/checkpoints/step_000001050.ckpt

In [None]:
# Check which checkpoint to use
!ls -lh /kaggle/input/my-2000-run/fish-speech/results/FinnishSpeaker_2000_finetune/checkpoints/

# Merge (use step_000002000.ckpt or whatever your final checkpoint is)
!python tools/llama/merge_lora.py \
  --lora-config r_8_alpha_16 \
  --base-weight checkpoints/openaudio-s1-mini \
  --lora-weight /kaggle/input/my-2000-run/fish-speech/results/FinnishSpeaker_2000_finetune/checkpoints/step_000001050.ckpt \
  --output checkpoints/FinnishSpeaker_2000_finetuned

!ls -lh checkpoints/FinnishSpeaker_2000_finetuned/

## Step 11: Download Model

In [None]:
# Create archive
!tar -czf FinnishSpeaker_2000_trained_v2.tar.gz checkpoints/FinnishSpeaker_2000_finetuned/

!ls -lh FinnishSpeaker_2000_trained_v2.tar.gz
print("\n✅ Download from Output tab (right sidebar) →")

In [None]:
# Check training progress
!tail -20 /kaggle/input/my-2000-run/fish-speech/results/FinnishSpeaker_2000_finetune/train.log 2>/dev/null || echo "Log not created yet"

# Check checkpoints
!ls -lh /kaggle/input/my-2000-run/fish-speech/results/FinnishSpeaker_2000_finetune/checkpoints/ 2>/dev/null || echo "No checkpoints yet"

# Check GPU activity
!nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used --format=csv

In [None]:
!tail -50 results/FinnishSpeaker_2000_finetune/train.log | grep -E "(Epoch|step|loss|it/s)"

---

## Summary

**Training:**
- Resumed from step 750 → completed 2000 steps
- Total: ~2.6 epochs over 2000 samples
- Should sound **much better** than 750-step model

**Testing:**
1. Download `FinnishSpeaker_2000_trained_v2.tar.gz`
2. Extract on Mac
3. Test with WebUI
4. Try settings: `temperature=0.5`, `max_new_tokens=256`