# TTS Comparison Test: Live API vs Regular TTS API

This notebook compares two approaches for adding personality to TTS:

## Approach 1: Live API with System Instruction
- **Model**: gemini-live-2.5-flash-preview-native-audio
- **Method**: Conversational model with system instruction
- **Behavior**: May add/modify content while applying persona
- **Best for**: Interactive tutoring, conversational responses

## Approach 2: Regular TTS with Prompt
- **Model**: gemini-2.5-flash-tts  
- **Method**: Pure TTS with prompt parameter for tone control
- **Behavior**: Reads exact text with specified tone/emotion
- **Best for**: Scripted content, precise text-to-speech

Testing with 3 Indonesian language prompts to compare:
1. **Latency** - Which is faster?
2. **Content fidelity** - Does it read exactly or add content?
3. **Voice quality** - Which sounds more natural/engaging?

## Setup and Configuration

In [1]:
# Install required packages
%pip install --upgrade --quiet google-genai google-cloud-texttospeech

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import libraries
import time
from datetime import datetime
from IPython.display import Audio, Markdown, display
import numpy as np
from google.api_core.client_options import ClientOptions
from google.cloud import texttospeech_v1beta1 as texttospeech
# Python SDK for Live API
from google import genai
from google.genai import types

In [3]:
# Configuration
PROJECT_ID = "my-project-0004-346516"
LOCATION = "us-central1"
TTS_LOCATION = "global"

# Live API configuration (using Python SDK)
LIVE_MODEL_ID = "gemini-live-2.5-flash-preview-native-audio"
VOICE_NAME = "Kore"  # Chirp 3 HD voice
LANGUAGE_CODE = "id-ID"  # Indonesian

# System instruction for Live API (Kak Caca persona) - OPTION A: Read exactly with emotion
SYSTEM_INSTRUCTION = """You are a text-to-speech system with the voice and personality of Kak Caca, a warm and encouraging Math tutor for students aged 12 to 18 in Bahasa Indonesia.

CRITICAL INSTRUCTIONS:
1. Read ONLY the exact text provided - do not add, remove, or change ANY words
2. Do NOT respond to the content, answer questions, or add commentary
3. Apply the warm, encouraging voice characteristics of Kak Caca:
   - Friendly, caring tone as if speaking to a student
   - Natural emotional expression and emphasis on key words
   - Appropriate pacing for educational content
   - The warmth and encouragement of a supportive young tutor

You are a TTS system with personality - read the exact text with Kak Caca's warm voice."""

# Create Live API client
live_client = genai.Client(vertexai=True, project=PROJECT_ID, location=LOCATION)

# Configure Live API session with system instruction (PROPER FORMAT from reference)
live_session_config = types.LiveConnectConfig(
    response_modalities=["AUDIO"],
    speech_config=types.SpeechConfig(
        voice_config=types.VoiceConfig(
            prebuilt_voice_config=types.PrebuiltVoiceConfig(
                voice_name=VOICE_NAME,
            )
        ),
        language_code=LANGUAGE_CODE,
    ),
    system_instruction=types.Content(
        parts=[types.Part(text=SYSTEM_INSTRUCTION)],
        role='user'
    ),
    input_audio_transcription=types.AudioTranscriptionConfig(),
    output_audio_transcription=types.AudioTranscriptionConfig(),
)

# Regular TTS configuration (no system instruction)
TTS_MODEL = "gemini-2.5-flash-tts"
TTS_VOICE = "Aoede"

API_ENDPOINT = (
    f"{TTS_LOCATION}-texttospeech.googleapis.com"
    if TTS_LOCATION != "global"
    else "texttospeech.googleapis.com"
)

tts_client = texttospeech.TextToSpeechClient(
    client_options=ClientOptions(api_endpoint=API_ENDPOINT)
)

print("✓ Live API client created with:")
print(f"  - Model: {LIVE_MODEL_ID}")
print(f"  - Voice: {VOICE_NAME}")
print(f"  - Language: {LANGUAGE_CODE}")
print(f"  - System Instruction: Kak Caca TTS (exact text, warm tone)")
print(f"\n✓ Regular TTS client created with:")
print(f"  - Model: {TTS_MODEL}")
print(f"  - Voice: {TTS_VOICE}")
print(f"  - Language: {LANGUAGE_CODE}")
print(f"  - System Instruction: None (neutral plain TTS)")

✓ Live API client created with:
  - Model: gemini-live-2.5-flash-preview-native-audio
  - Voice: Kore
  - Language: id-ID
  - System Instruction: Kak Caca TTS (exact text, warm tone)

✓ Regular TTS client created with:
  - Model: gemini-2.5-flash-tts
  - Voice: Aoede
  - Language: id-ID
  - System Instruction: None (neutral plain TTS)


## Load Test Prompts

In [4]:
# Load test prompts from file
with open('test.txt', 'r', encoding='utf-8') as f:
    test_prompts = [line.strip() for line in f.readlines() if line.strip()]

print(f"Loaded {len(test_prompts)} test prompts\n")
for i, prompt in enumerate(test_prompts, 1):
    print(f"Prompt {i}: {prompt[:100]}..." if len(prompt) > 100 else f"Prompt {i}: {prompt}")
    print()

Loaded 3 test prompts

Prompt 1: Murid sudah tepat dalam mengidentifikasi soal cerita dengan menuliskan panjang, lebar, dan jarak ant...

Prompt 2: Murid sudah tepat melakukan pemfaktoran dengan pohon faktor, dan menuliskan faktor dari kedua bilang...

Prompt 3: "Kesalahan dalam menerjemahkan pecahan campuran. Bisa jadi karena salah melihat soal, atau memang ku...



## Test 1: Live API with System Instruction (Python SDK)

Testing all 3 prompts using the Live API Python SDK with:
- **System instruction**: "Kak Caca" persona for warm, encouraging tutor voice
- **Audio transcription**: Enabled to see what was spoken
- **Timing measurements**: Latency and time-to-first-chunk

In [5]:
# Live API Test Function (using Python SDK with system instruction - PROPER FORMAT)
async def test_live_api(text_input, prompt_num):
    start_time = time.time()
    start_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
    
    display(Markdown(f"### Live API (with System Instruction) - Prompt {prompt_num}"))
    display(Markdown(f"**Start Time:** {start_timestamp}"))
    display(Markdown(f"**Input:** {text_input[:200]}..."))
    
    audio_data = []
    output_transcriptions = []
    first_chunk_time = None
    
    try:
        # Connect to Live API using Python SDK with system instruction
        async with live_client.aio.live.connect(
            model=LIVE_MODEL_ID, 
            config=live_session_config
        ) as session:
            # Send the text using proper types
            request_sent_time = time.time()
            await session.send_client_content(
                turns=types.Content(
                    role="user", 
                    parts=[types.Part(text=text_input)]
                )
            )
            
            # Receive audio response
            async for message in session.receive():
                # Mark first chunk time
                if first_chunk_time is None:
                    print("first_chunk_time was null :: " first_chunk_time)
                    first_chunk_time = time.time()
                
                # Collect output transcription
                if (message.server_content.output_transcription 
                    and message.server_content.output_transcription.text):
                    output_transcriptions.append(
                        message.server_content.output_transcription.text
                    )
                
                # Collect audio data
                if (message.server_content.model_turn 
                    and message.server_content.model_turn.parts):
                    for part in message.server_content.model_turn.parts:
                        if part.inline_data:
                            audio_data.append(
                                np.frombuffer(part.inline_data.data, dtype=np.int16)
                            )
        
        end_time = time.time()
        end_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
        
        total_latency = end_time - start_time
        time_to_first_chunk = (
            first_chunk_time - request_sent_time if first_chunk_time else None
        )
        
        display(Markdown(f"**End Time:** {end_timestamp}"))
        display(Markdown(f"**Total Latency:** {total_latency:.3f} seconds"))
        if time_to_first_chunk:
            display(Markdown(f"**Time to First Chunk:** {time_to_first_chunk:.3f} seconds"))
        
        if audio_data:
            full_audio = np.concatenate(audio_data)
            display(Audio(full_audio, rate=24000, autoplay=False))
            print(f"Audio chunks received: {len(audio_data)}")
        
        if output_transcriptions:
            display(Markdown(f"**Output transcription:** {''.join(output_transcriptions)}"))
        
        return {
            'prompt_num': prompt_num,
            'method': 'Live API (with System Instruction)',
            'total_latency': total_latency,
            'time_to_first_chunk': time_to_first_chunk,
            'start_time': start_timestamp,
            'end_time': end_timestamp,
            'success': True
        }
    except Exception as e:
        display(Markdown(f"**Error:** {e}"))
        import traceback
        traceback.print_exc()
        return {
            'prompt_num': prompt_num,
            'method': 'Live API (with System Instruction)',
            'success': False,
            'error': str(e)
        }

In [6]:
# Run Live API tests
live_api_results = []

for i, prompt in enumerate(test_prompts, 1):
    result = await test_live_api(prompt, i)
    live_api_results.append(result)
    display(Markdown("---"))

### Live API (with System Instruction) - Prompt 1

**Start Time:** 2025-10-23 15:40:53.697672

**Input:** Murid sudah tepat dalam mengidentifikasi soal cerita dengan menuliskan panjang, lebar, dan jarak antar pohon."Murid sudah melakukan prosedur yang tepat dengan membagi 60 dan 42 masing-masing dengan 3,...

**End Time:** 2025-10-23 15:41:19.054409

**Total Latency:** 25.357 seconds

**Time to First Chunk:** 0.778 seconds

Audio chunks received: 181


**Output transcription:** "Murid sudah tepat dalam mengidentifikasi soal cerita dengan menuliskan panjang, lebar, dan jarak antar pohon. "Murid sudah melakukan prosedur yang tepat dengan membagi 60 dan 42 masing-masing dengan 3, yaitu jarak antar pohon, dan kemudian menjumlahkannya. Kesalahan terjadi di mana ada pemahaman yang terlewat bahwa persegi panjang memiliki 2 sisi panjang dan 2 sisi lebar, sehingga menjumlahkan 20 dan 14 belum cukup untuk mendapatkan jawaban yang tepat. Rekomendasi untuk perbaikan adalah siswa dapat menggambar ilustrasi kebun dan menggambarkan pohon di sekelilingnya. Kemudian perhatikan bahwa area panjang dan lebar masing-masing memiliki 2 sisi. Jawaban siswa hanya mencakup 1 sisi panjang dan 1 sisi lebar."

---

### Live API (with System Instruction) - Prompt 2

**Start Time:** 2025-10-23 15:41:19.144245

**Input:** Murid sudah tepat melakukan pemfaktoran dengan pohon faktor, dan menuliskan faktor dari kedua bilangan dengan tepat. Kesalahan terjadi pada tahap menentukan KPK. KPK ditentukan berdasarkan faktor terb...

**End Time:** 2025-10-23 15:41:48.457038

**Total Latency:** 29.313 seconds

**Time to First Chunk:** 0.716 seconds

Audio chunks received: 149


**Output transcription:** Murid sudah tepat melakukan pemfaktoran dengan pohon faktor, dan menuliskan faktor dari kedua bilangan dengan tepat. Kesalahan terjadi pada tahap menentukan KPK. KPK ditentukan berdasarkan faktor terbesar dari masing-masing bilangan, kemudian mengalikannya. Sedangkan yang dilakukan siswa adalah mengalikan 2 * 3 * 2, padahal ada faktor yang lebih besar, yaitu 2 ^ 3. Rekomendasi koreksi untuk 6 bisa ditulis menjadi 2 * 3, dan 8 bisa ditulis menjadi 2 ^ 3, sehingga KPK-nya adalah 3 * 2 ^ 3, yaitu 24.

---

### Live API (with System Instruction) - Prompt 3

**Start Time:** 2025-10-23 15:41:48.520536

**Input:** "Kesalahan dalam menerjemahkan pecahan campuran. Bisa jadi karena salah melihat soal, atau memang kurang paham. Seharusnya pecahan campuran menjadi 3/2, bukan 11/2".  Kesalahan dalam mengubah desimal ...

**End Time:** 2025-10-23 15:42:14.143159

**Total Latency:** 25.623 seconds

**Time to First Chunk:** 0.686 seconds

Audio chunks received: 187


---

In [7]:
## Test 2: Regular TTS API (No System Instruction)

# Testing all 3 prompts using the regular TTS API with:
# - Standard Aoede voice
# - Indonesian language (id-ID)
# - No system instruction (baseline comparison)

In [8]:
# Regular TTS API Test Function WITH PROMPT for voice control
def test_regular_tts(text_input, prompt_num):
    start_time = time.time()
    start_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
    
    display(Markdown(f"### Regular TTS API (WITH Prompt for Kak Caca tone) - Prompt {prompt_num}"))
    display(Markdown(f"**Start Time:** {start_timestamp}"))
    display(Markdown(f"**Input:** {text_input[:200]}..."))
    
    # PROMPT for voice control in Regular TTS
    TTS_PROMPT = "You are Kak Caca, a warm and encouraging math tutor. Read the following with a friendly, caring tone as if speaking to a student."
    
    voice = texttospeech.VoiceSelectionParams(
        name=TTS_VOICE, language_code=LANGUAGE_CODE, model_name=TTS_MODEL
    )
    
    # Perform the text-to-speech request WITH PROMPT
    response = tts_client.synthesize_speech(
        input=texttospeech.SynthesisInput(
            text=text_input,
            prompt=TTS_PROMPT  # ← Add prompt for tone control
        ),
        voice=voice,
        audio_config=texttospeech.AudioConfig(
            audio_encoding=texttospeech.AudioEncoding.MP3
        ),
    )
    
    end_time = time.time()
    end_timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
    
    total_latency = end_time - start_time
    
    display(Markdown(f"**End Time:** {end_timestamp}"))
    display(Markdown(f"**Total Latency:** {total_latency:.3f} seconds"))
    
    # Play the generated audio
    display(Audio(response.audio_content, autoplay=False))
    
    return {
        'prompt_num': prompt_num,
        'method': 'Regular TTS (WITH Prompt)',
        'total_latency': total_latency,
        'start_time': start_timestamp,
        'end_time': end_timestamp,
        'success': True
    }

In [None]:
# Run Regular TTS tests
regular_tts_results = []

for i, prompt in enumerate(test_prompts, 1):
    result = test_regular_tts(prompt, i)
    regular_tts_results.append(result)
    display(Markdown("---"))

### Regular TTS API (WITH Prompt for Kak Caca tone) - Prompt 1

**Start Time:** 2025-10-23 15:42:14.263595

**Input:** Murid sudah tepat dalam mengidentifikasi soal cerita dengan menuliskan panjang, lebar, dan jarak antar pohon."Murid sudah melakukan prosedur yang tepat dengan membagi 60 dan 42 masing-masing dengan 3,...

**End Time:** 2025-10-23 15:42:45.354479

**Total Latency:** 31.091 seconds

---

### Regular TTS API (WITH Prompt for Kak Caca tone) - Prompt 2

**Start Time:** 2025-10-23 15:42:45.368747

**Input:** Murid sudah tepat melakukan pemfaktoran dengan pohon faktor, dan menuliskan faktor dari kedua bilangan dengan tepat. Kesalahan terjadi pada tahap menentukan KPK. KPK ditentukan berdasarkan faktor terb...

**End Time:** 2025-10-23 15:43:20.627409

**Total Latency:** 35.259 seconds

---

### Regular TTS API (WITH Prompt for Kak Caca tone) - Prompt 3

**Start Time:** 2025-10-23 15:43:20.640547

**Input:** "Kesalahan dalam menerjemahkan pecahan campuran. Bisa jadi karena salah melihat soal, atau memang kurang paham. Seharusnya pecahan campuran menjadi 3/2, bukan 11/2".  Kesalahan dalam mengubah desimal ...

## Conclusions

### The Key Difference:

**Live API (Conversational):**
- ✅ Natural, conversational persona
- ✅ Streaming audio capability
- ❌ May ADD/MODIFY content (not exact TTS)
- ❌ Longer latency for full response
- **Use case**: Interactive tutoring where model can add encouragement/guidance

**Regular TTS with Prompt (Pure TTS):**
- ✅ Reads EXACT text provided
- ✅ Faster for pre-scripted content
- ✅ Prompt parameter controls tone/emotion
- ❌ No streaming
- **Use case**: Narrating pre-written feedback, scripted educational content

### Metrics Compared:

1. **Total Latency**: End-to-end time for audio generation
2. **Content Fidelity**: Check transcriptions - does it match input exactly?
3. **Voice Quality**: Listen and compare naturalness and emotional expressiveness

### Recommendation:

- **For exact scripted content (Option A)**: Use **Regular TTS with prompt** parameter
- **For interactive tutoring**: Use **Live API** (accept that it may modify content)

Since you want **Option A** (exact text with warm tone), the **Regular TTS API with prompt parameter is actually the better choice** - it's designed for TTS and will read exactly what you provide while applying the emotional tone.

In [None]:
# Display comparison table
display(Markdown("## Summary Comparison"))
display(Markdown("\n### Latency Results\n"))

table = "| Prompt | Live API (s) | Regular TTS (s) | Difference (s) | Faster Method |\n"
table += "|--------|--------------|-----------------|----------------|---------------|\n"

for i in range(len(test_prompts)):
    live_latency = live_api_results[i]['total_latency']
    regular_latency = regular_tts_results[i]['total_latency']
    diff = abs(live_latency - regular_latency)
    faster = "Live API" if live_latency < regular_latency else "Regular TTS"
    
    table += f"| {i+1} | {live_latency:.3f} | {regular_latency:.3f} | {diff:.3f} | {faster} |\n"

# Add average row
avg_live = sum(r['total_latency'] for r in live_api_results) / len(live_api_results)
avg_regular = sum(r['total_latency'] for r in regular_tts_results) / len(regular_tts_results)
avg_diff = abs(avg_live - avg_regular)
avg_faster = "Live API" if avg_live < avg_regular else "Regular TTS"

table += f"| **Average** | **{avg_live:.3f}** | **{avg_regular:.3f}** | **{avg_diff:.3f}** | **{avg_faster}** |\n"

display(Markdown(table))

# Display time to first chunk for Live API
display(Markdown("\n### Live API - Time to First Chunk\n"))
ttfc_table = "| Prompt | Time to First Chunk (s) |\n"
ttfc_table += "|--------|-------------------------|\n"

for result in live_api_results:
    ttfc = result.get('time_to_first_chunk', 'N/A')
    ttfc_str = f"{ttfc:.3f}" if isinstance(ttfc, float) else ttfc
    ttfc_table += f"| {result['prompt_num']} | {ttfc_str} |\n"

if all(r.get('time_to_first_chunk') for r in live_api_results):
    avg_ttfc = sum(r['time_to_first_chunk'] for r in live_api_results) / len(live_api_results)
    ttfc_table += f"| **Average** | **{avg_ttfc:.3f}** |\n"

display(Markdown(ttfc_table))

## Conclusions

Key metrics to consider:

1. **Total Latency**: Time from request start to complete audio generation
2. **Time to First Chunk** (Live API only): How quickly the first audio chunk arrives
3. **Voice Quality**: Subjective evaluation (listen to the audio samples above)

The Live API's streaming capability means it can start playing audio before the entire response is generated, which can feel faster to end users even if total latency is similar.