# Triton Spark TTS + RVC Voice Conversion

This notebook runs the full pipeline:
1. **Triton Spark TTS** - Text-to-Speech (runs inside udocker container)
2. **RVC Voice Conversion** - Voice cloning (runs on host Python with CUDA)

## Architecture
- Triton Server runs Spark TTS with TensorRT-LLM optimization (~10x faster)
- RVC runs on host Python with direct CUDA access
- Both share GPU memory via cross-process CUDA
- Host Python communicates with Triton via gRPC (udocker shares host network)

## Requirements
- Google Colab with GPU runtime (T4 minimum, A100 recommended)
- ~10GB disk space for models

---
## Section 1: Setup Environment

In [None]:
# Cell 1.1: Initialize udocker for Triton container
def udocker_init():
    import os
    if not os.path.exists("/home/user"):
        !pip install udocker > /dev/null
        !udocker --allow-root install > /dev/null
        !useradd -m user > /dev/null
    print('Docker-in-Colab initialized')
    def execute(command: str):
        user_prompt = "\033[1;32muser@pc\033[0m"
        print(f"{user_prompt}$ udocker {command}")
        !su - user -c "udocker $command"
    return execute

udocker = udocker_init()

In [None]:
# Cell 1.2: Clone repository
!git clone https://github.com/VSlobolinskyi/triton-spark-server.git
%cd triton-spark-server
# Switch to feature branch with RVC integration
!git checkout feature/rvc-integration

In [None]:
# Cell 1.3: GPU Configuration
import subprocess

def detect_gpu():
    """Detect GPU and return configuration."""
    try:
        result = subprocess.run(['nvidia-smi', '--query-gpu=name,memory.total', '--format=csv,noheader'],
                               capture_output=True, text=True)
        gpu_info = result.stdout.strip()
        gpu_name = gpu_info.split(',')[0].strip()
        gpu_mem = int(gpu_info.split(',')[1].replace('MiB', '').strip())
        print(f"Detected GPU: {gpu_name} ({gpu_mem}MB)")
        return gpu_name, gpu_mem
    except:
        print("No GPU detected!")
        return None, 0

gpu_name, gpu_mem = detect_gpu()

# udocker GPU options
gpu_config = {
    'all_options': '--device=/dev/nvidia0 --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia-uvm-tools'
}
print("GPU config ready")

---
## Section 2: Setup Triton Container (Spark TTS)

In [None]:
# Cell 2.1: Pull Triton Server image
!udocker --allow-root pull nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3
!udocker --allow-root create --name=triton_server nvcr.io/nvidia/tritonserver:25.02-trtllm-python-py3
!udocker --allow-root setup --nvidia triton_server

In [None]:
# Cell 2.2: Install dependencies inside container + download Spark model
# Note: Repo is mounted at /workspace via --volume=$PWD:/workspace
import os
pwd = os.getcwd()

install_cmd = f'''udocker --allow-root run \
    --hostauth --hostenv \
    {gpu_config['all_options']} \
    --volume={pwd}:/workspace \
    triton_server \
    /bin/bash -c "apt-get update && apt-get install -y cmake && \
git clone https://github.com/pytorch/audio.git && cd audio && git checkout c670ad8 && USE_FFMPEG=0 PATH=/usr/local/cuda/bin:\$PATH python3 setup.py develop && \
pip install einx==0.3.0 omegaconf==2.3.0 soundfile==0.12.1 soxr==0.5.0.post1 tritonclient librosa 'huggingface-hub>=0.24.0,<1.0' && \
python3 -c 'from huggingface_hub import snapshot_download; snapshot_download(repo_id=\"SparkAudio/Spark-TTS-0.5B\", local_dir=\"/workspace/pretrained_models/Spark-TTS-0.5B\")'"
'''

!{install_cmd}

In [None]:
# Cell 2.3: Build TensorRT-LLM engine for Spark TTS
import os
pwd = os.getcwd()

setup_cmd = f'''udocker --allow-root run \
    --hostauth --hostenv \
    {gpu_config['all_options']} \
    --env=PYTHONPATH=/workspace \
    --volume={pwd}:/workspace \
    triton_server \
    /bin/bash -c "cd /workspace/runtime/triton_trtllm && bash run.sh 1 2 offline"'''

!{setup_cmd}

---
## Section 3: Setup RVC on Host Python

In [None]:
# Cell 3.1: Install RVC dependencies on host
# Colab has Python 3.10 which is compatible with our requirements

# Install fairseq from git (required for HuBERT)
!pip install -q git+https://github.com/One-sixth/fairseq.git

# Install main requirements
!pip install -q -r requirements.txt

print("RVC dependencies installed")

In [None]:
# Cell 3.2: Download RVC assets (HuBERT, RMVPE)
!python tools/download_all_assets.py --rvc-only --skip-pretrained

In [None]:
# Cell 3.3: Download RVC voice model (SilverWolf example)
# You can replace this URL with your own RVC model
RVC_MODEL_URL = "https://huggingface.co/Juneuarie/SilverWolfEN/resolve/main/SilverWolf.zip"
!python tools/download_all_assets.py --rvc-model "{RVC_MODEL_URL}"

In [None]:
# Cell 3.4: Test RVC initialization
import sys
sys.path.insert(0, '.')

from rvc import init_rvc, get_config

init_rvc()
config = get_config()
print(f"RVC initialized:")
print(f"  Device: {config.device}")
print(f"  Half precision: {config.is_half}")
print(f"  GPU: {config.gpu_name}")

---
## Section 4: Start Triton Server

In [None]:
# Cell 4.1: Start Triton server in background
# Note: udocker shares host network, so Triton will be accessible at localhost:8001
import subprocess
import time
import os

pwd = os.getcwd()

server_cmd = f'''udocker --allow-root run \
    --hostauth --hostenv \
    {gpu_config['all_options']} \
    --env=PYTHONPATH=/workspace \
    --volume={pwd}:/workspace \
    triton_server \
    /bin/bash -c "cd /workspace/runtime/triton_trtllm && tritonserver --model-repository=./model_repo_test"'''

# Run in background
process = subprocess.Popen(
    f"nohup {server_cmd} > server_log.txt 2>&1 &",
    shell=True
)

print("Triton server starting...")
print("Waiting 45 seconds for models to load...")
time.sleep(45)
!tail -30 server_log.txt

In [None]:
# Cell 4.2: Test Triton connection from host Python
# This verifies that host can communicate with Triton via gRPC
from rvc.triton_client import TritonSparkClient

client = TritonSparkClient(server_addr="localhost", server_port=8001)

if client.is_server_ready():
    print("Triton server is ready!")
else:
    print("Triton server not ready - check server_log.txt")

if client.is_model_ready():
    print("Spark TTS model is loaded!")
else:
    print("Spark TTS model not loaded - check server_log.txt")

client.close()

---
## Section 5: Upload Reference Audio

In [None]:
# Cell 5.1: Upload reference audio for voice cloning
import os
import soundfile as sf
from scipy import signal
from google.colab import files

# Create directories
!mkdir -p references TEMP/tts TEMP/rvc TEMP/test

# Upload reference audio
print("Please upload a reference audio file (.wav format):")
uploaded = files.upload()

# Process uploaded file
reference_audio = list(uploaded.keys())[0]
references_path = os.path.join("references", reference_audio)

with open(references_path, "wb") as f:
    f.write(uploaded[reference_audio])
print(f"Saved to {references_path}")

# Resample to 16kHz for Spark TTS
audio, sr = sf.read(references_path)
print(f"Original: {len(audio)/sr:.2f}s @ {sr}Hz")

if sr != 16000:
    target_sr = 16000
    num_samples = int(len(audio) * (target_sr / sr))
    resampled_audio = signal.resample(audio, num_samples)
    
    base_name = os.path.splitext(reference_audio)[0]
    reference_audio_16k = os.path.join("references", f"{base_name}_16k.wav")
    sf.write(reference_audio_16k, resampled_audio, target_sr)
    print(f"Resampled to 16kHz: {reference_audio_16k}")
else:
    reference_audio_16k = references_path
    print(f"Already 16kHz, using: {reference_audio_16k}")

---
## Section 6: Run Full Pipeline

In [None]:
# Cell 6.1: Configuration
# Modify these settings as needed

# Text to synthesize
TARGET_TEXT = "Hello! This is a test of the Triton Spark TTS and RVC voice conversion pipeline. The voice should sound like the reference audio."

# Reference text (can be empty for offline mode)
REFERENCE_TEXT = ""

# RVC model name (from assets/weights/)
RVC_MODEL = "SilverWolf.pth"

# Pitch shift in semitones (0 = no change)
PITCH_SHIFT = 0

# F0 extraction method (rmvpe is best quality)
F0_METHOD = "rmvpe"

print(f"Configuration:")
print(f"  Text: {TARGET_TEXT[:50]}...")
print(f"  RVC Model: {RVC_MODEL}")
print(f"  Pitch Shift: {PITCH_SHIFT}")
print(f"  F0 Method: {F0_METHOD}")

In [None]:
# Cell 6.2: Run full pipeline test
import subprocess

cmd = [
    "python", "tools/test_pipeline.py",
    "--triton-addr", "localhost",
    "--triton-port", "8001",
    "--prompt-audio", reference_audio_16k,
    "--prompt-text", REFERENCE_TEXT,
    "--text", TARGET_TEXT,
    "--rvc-model", RVC_MODEL,
    "--pitch-shift", str(PITCH_SHIFT),
    "--f0-method", F0_METHOD,
    "--output-dir", "./TEMP/test"
]

result = subprocess.run(cmd)
print(f"\nPipeline finished with code: {result.returncode}")

In [None]:
# Cell 6.3: Play output audio
from IPython.display import Audio, display
import os

tts_path = "TEMP/test/tts_output.wav"
rvc_path = "TEMP/test/rvc_output.wav"

if os.path.exists(tts_path):
    print("TTS Output (before RVC):")
    display(Audio(tts_path))
else:
    print(f"TTS output not found at {tts_path}")

if os.path.exists(rvc_path):
    print("\nRVC Output (final voice):")
    display(Audio(rvc_path))
else:
    print(f"RVC output not found at {rvc_path}")

---
## Section 7: Manual Pipeline Control (Optional)

Use these cells for more control over individual steps.

In [None]:
# Cell 7.1: Run TTS only
import time
import soundfile as sf
from rvc.triton_client import TritonSparkClient

client = TritonSparkClient(server_addr="localhost", server_port=8001)

text = "This is a test of Triton Spark TTS without voice conversion."

start = time.time()
wav = client.inference(
    text=text,
    prompt_speech=reference_audio_16k,
    prompt_text="",
)
elapsed = time.time() - start

duration = len(wav) / 16000
print(f"Generated {duration:.2f}s of audio in {elapsed:.2f}s (RTF: {elapsed/duration:.2f})")

sf.write("TEMP/tts/tts_only.wav", wav, 16000)
client.close()

from IPython.display import Audio
Audio("TEMP/tts/tts_only.wav")

In [None]:
# Cell 7.2: Run RVC only (on existing audio)
import time
import soundfile as sf
from rvc import init_rvc, load_model, convert_audio, is_initialized

if not is_initialized():
    init_rvc()

# Load model
model_info = load_model("SilverWolf.pth")
print(f"Loaded model: version={model_info.get('version')}, sr={model_info.get('tgt_sr')}")

# Convert
input_audio = "TEMP/tts/tts_only.wav"  # Use TTS output from previous cell

start = time.time()
info, (sr, audio) = convert_audio(
    audio_path=input_audio,
    pitch_shift=0,
    f0_method="rmvpe",
)
elapsed = time.time() - start

duration = len(audio) / sr
print(f"Converted {duration:.2f}s of audio in {elapsed:.2f}s")

sf.write("TEMP/rvc/rvc_only.wav", audio, sr)

from IPython.display import Audio
Audio("TEMP/rvc/rvc_only.wav")

---
## Section 8: Download Results

In [None]:
# Cell 8.1: Download output files
from google.colab import files
import os

# Download final RVC output
rvc_path = "TEMP/test/rvc_output.wav"
if os.path.exists(rvc_path):
    files.download(rvc_path)
else:
    print(f"File not found: {rvc_path}")

# Optionally download TTS output too
# files.download("TEMP/test/tts_output.wav")

---
## Troubleshooting

### Check server logs
```python
!tail -50 server_log.txt
```

### Check available RVC models
```python
!ls -la assets/weights/
!ls -la logs/
```

### Memory issues
- Restart runtime (Runtime -> Restart Runtime) and re-run
- Use a GPU with more memory (A100 if available)

### Network issues (Triton connection)
- udocker shares host network, so localhost:8001 should work
- Check if server started: `!ps aux | grep triton`