In [1]:
# Copyright 2025 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Get started with Gemini-TTS voices using Text-to-Speech



## Overview

This notebook introduces [Gemini-TTS](https://cloud.google.com/text-to-speech/docs/gemini-tts), the latest evolution of our Text-to-Speech technology that's moving beyond just naturalness to giving granular control over generated audio using text-based prompts. Using Gemini-TTS, you can synthesize speech from short snippets to long-form narratives, precisely dictating style, accent, pace, tone, and even emotional expression, all steerable through natural-language prompts. You can create conversations between two speakers with the same emotional expression and steerability.


There are currently 30 distinct voice options. See [all available voices](https://cloud.google.com/text-to-speech/docs/gemini-tts#voice_options).

There are 80+ locale options to use for synthesis. See [all available locales](https://cloud.google.com/text-to-speech/docs/gemini-tts#language_availability)

In this tutorial, you learn how to:

- How to synthesize speech using real-time (online) processing
- How to use formatting and expressive tags to modify the tone of the speech.
- How to synthesize dialogues with two speakers.

## Get started

### Install Text-to-Speech SDK and other required packages

Minimum google-cloud-texttospeech version  2.31.0 is required to be able to use the Gemini-TTS related fields.

In [2]:
%%bash
# Detect the operating system
os=$(uname -s)

if [[ "$os" == "Linux" ]]; then
  # Linux installation
  sudo apt update -y -qq
  sudo apt install ffmpeg -y -qq
  echo "ffmpeg installed successfully on Linux."
elif [[ "$os" == "Darwin" ]]; then
  # macOS installation
  if command -v brew &> /dev/null; then
    brew install ffmpeg
    if [[ $? -eq 0 ]]; then
        echo "ffmpeg installed successfully on macOS using Homebrew."
    else
        echo "Error installing ffmpeg on macOS using Homebrew."
    fi
  else
    echo "Homebrew is not installed. Please install Homebrew and try again."
  fi
else
  echo "Unsupported operating system: $os"
fi



E: The repository 'https://deb.debian.org/debian bullseye-backports Release' does not have a Release file.




ffmpeg is already the newest version (7:4.3.9-0+deb11u1).
0 upgraded, 0 newly installed, 0 to remove and 20 not upgraded.
ffmpeg installed successfully on Linux.


In [3]:
%pip install --upgrade --quiet google-cloud-texttospeech
%pip show google-cloud-texttospeech

Note: you may need to restart the kernel to use updated packages.
Name: google-cloud-texttospeech
Version: 2.32.0
Summary: Google Cloud Texttospeech API client library
Home-page: https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-texttospeech
Author: Google LLC
Author-email: googleapis-packages@google.com
License: Apache 2.0
Location: /opt/conda/lib/python3.10/site-packages
Requires: google-api-core, google-auth, grpcio, proto-plus, protobuf
Required-by: 
Note: you may need to restart the kernel to use updated packages.


### Authenticate your notebook environment (Colab only)

If you're running this notebook on Google Colab, run the cell below to authenticate your environment.

In [4]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

### Set Google Cloud project information and initialize SDK

To get started using the Text-to-Speech API, you must have an existing Google Cloud project and [enable the API](https://console.cloud.google.com/flows/enableapi?apiid=texttospeech.googleapis.com).

Learn more about [setting up a project and a development environment](https://cloud.google.com/vertex-ai/docs/start/cloud-environment).

For regional availability, see [documentation](https://cloud.google.com/text-to-speech/docs/gemini-tts#regional_availability).

In [5]:
# Use the environment variable if the user doesn't provide Project ID.
import os

# fmt: off
PROJECT_ID = "my-project-0004-346516"  # @param {type: "string", placeholder: "[your-project-id]", isTemplate: true}
# fmt: on
if not PROJECT_ID or PROJECT_ID == "[your-project-id]":
    PROJECT_ID = str(os.environ.get("GOOGLE_CLOUD_PROJECT"))

TTS_LOCATION = "global"

In [6]:
! gcloud config set project {PROJECT_ID}
! gcloud auth application-default set-quota-project {PROJECT_ID}
! gcloud auth application-default login -q

Updated property [core/project].
[1;31mERROR:[0m (gcloud.auth.application-default.set-quota-project) Application default credentials have not been set up. Run $ gcloud auth application-default login to set it up first.
Go to the following link in your browser, and complete the sign-in prompts:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fapplicationdefaultauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=uRBsnRRgwdYlEFQrnHJWZNWSRuT0Da&prompt=consent&token_usage=remote&access_type=offline&code_challenge=kH7RShjG83tBjThkzWXrp3DpL4v-N7aY2mqXt16i2to&code_challenge_method=S256

Once finished, enter the verification code provided in your browser: ^C


Command killed by keyboard interru

### Import libraries

In [7]:
from IPython.display import Audio, display
from google.api_core.client_options import ClientOptions
from google.cloud import texttospeech_v1beta1 as texttospeech

### Set constants

Initiate the API endpoint and the text to speech client.


In [8]:
API_ENDPOINT = (
    f"{TTS_LOCATION}-texttospeech.googleapis.com"
    if TTS_LOCATION != "global"
    else "texttospeech.googleapis.com"
)

client = texttospeech.TextToSpeechClient(
    client_options=ClientOptions(api_endpoint=API_ENDPOINT)
)

## Synthesize using Gemini-TTS voices


### Synthesize speech using real-time (online) processing

You define the text you want to convert, select a specific voice and language, and then instruct the API to generate an audio of the spoken text.

This example uses the `Aoede` voice, which is a high-definition voice, offering improved clarity. Feel free to choose another voice from the `voice` drop-down menu.

The code will call the `synthesize_speech` method, which handles the core conversion process, and the output will be an MP3 audio as `bytes`.


In [9]:
# fmt: off
MODEL = "gemini-2.5-flash-tts"  # @param ["gemini-2.5-flash-tts", "gemini-2.5-pro-tts"]

VOICE = "Aoede"  # @param ["Achernar", "Achird", "Algenib", "Algieba", "Alnilam", "Aoede", "Autonoe", "Callirrhoe", "Charon", "Despina", "Enceladus", "Erinome", "Fenrir", "Gacrux", "Iapetus", "Kore", "Laomedeia", "Leda", "Orus", "Puck", "Pulcherrima", "Rasalgethi", "Sadachbia", "Sadaltager", "Schedar", "Sulafat", "Umbriel", "Vindemiatrix", "Zephyr", "Zubenelgenubi"]

LANGUAGE_CODE = 'id-ID' #'hi-IN' #"en-us"  # @param ["am-et", "ar-001", "ar-eg",  "az-az",  "be-by",  "bg-bg", "bn-bd", "ca-es", "ceb-ph", "cs-cz",  "da-dk",  "de-de",  "el-gr", "en-au", "en-gb", "en-in",  "en-us",  "es-es",  "es-419", "es-mx", "es-us", "et-ee", "eu-es",  "fa-ir",  "fi-fi",  "fil-ph", "fr-fr", "fr-ca", "gl-es", "gu-in",  "hi-in",  "hr-hr",  "ht-ht",  "hu-hu", "af-za", "hy-am", "id-id",  "is-is",  "it-it",  "he-il",  "ja-jp", "jv-jv", "ka-ge", "kn-in",  "ko-kr",  "kok-in", "la-va",  "lb-lu", "lo-la", "lt-lt", "lv-lv",  "mai-in", "mg-mg",  "mk-mk",  "ml-in", "mn-mn", "mr-in", "ms-my",  "my-mm",  "nb-no",  "ne-np",  "nl-nl", "nn-no", "or-in", "pa-in",  "pl-pl",  "ps-af",  "pt-br",  "pt-pt", "ro-ro", "ru-ru", "sd-in",  "si-lk",  "sk-sk",  "sl-si",  "sq-al", "sr-rs", "sv-se", "sw-ke",  "ta-in",  "te-in",  "th-th",  "tr-tr", "uk-ua", "ur-pk", "vi-vn",  "cmn-cn", "cmn-tw"]
# fmt: on


voice = texttospeech.VoiceSelectionParams(
    name=VOICE, language_code=LANGUAGE_CODE, model_name=MODEL
)

In [10]:
# @title capture emotion with prompts

# fmt: off
PROMPT = "You are having a conversation with a friend. Say the following in a happy and casual way"  # @param {type: "string"}
# fmt: on
TEXT = "hahaha, i did NOT expect that. can you believe it!"  # @param {type: "string"}

# Perform the text-to-speech request on the text input with the selected
# voice parameters and audio file type
response = client.synthesize_speech(
    input=texttospeech.SynthesisInput(text=TEXT, prompt=PROMPT),
    voice=voice,
    # Select the type of audio file you want returned
    audio_config=texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    ),
)

# play the generated audio
display(Audio(response.audio_content))

In [11]:
# @title Modify pace of the speech

# fmt: off
PROMPT = "Say the following very fast but still be intelligible"  # @param {type: "string"}
TEXT = "Availability and terms may vary. Check our website or your local store for complete details and restrictions."  # @param {type: "string"}
# fmt: on

# Perform the text-to-speech request on the text input with the selected
# voice parameters and audio file type
response = client.synthesize_speech(
    input=texttospeech.SynthesisInput(text=TEXT, prompt=PROMPT),
    voice=voice,
    # Select the type of audio file you want returned
    audio_config=texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    ),
)
# play the generated audio
display(Audio(response.audio_content))

In [12]:
# @title modify text with expressive tags

# NOTE: These tags are not strict syntax. Feel free to experiment with different
# expressions and formats.

PROMPT = "Say the following with a sarcastic tone"  # @param {type: "string"}
# fmt: off
TEXT = "So.. [chuckling] tell me about this [coughs] AI thing."  # @param {type: "string"}
# fmt: on

# Perform the text-to-speech request on the text input with the selected
# voice parameters and audio file type
response = client.synthesize_speech(
    input=texttospeech.SynthesisInput(text=TEXT, prompt=PROMPT),
    voice=voice,
    # Select the type of audio file you want returned
    audio_config=texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    ),
)
# play the generated audio
display(Audio(response.audio_content))

## Multi-speaker (Dialog) Speech Synthesis

You can create a dialog between two speakers. Using `multi_speaker_voice_config`, you can specify the speakers, and assign a custom speaker name to reference in the input text.

There are two ways to structure the multi-speaker input

In [13]:
# @title Explicit turn-based syntax

SPEAKER_ALIAS_1 = "Zizu"  # @param {type: "string"}
# fmt: off
SPEAKER_1 = "Fenrir"  # @param ["Achernar", "Achird", "Algenib", "Algieba", "Alnilam", "Aoede", "Autonoe", "Callirrhoe", "Charon", "Despina", "Enceladus", "Erinome", "Fenrir", "Gacrux", "Iapetus", "Kore", "Laomedeia", "Leda", "Orus", "Puck", "Pulcherrima", "Rasalgethi", "Sadachbia", "Sadaltager", "Schedar", "Sulafat", "Umbriel", "Vindemiatrix", "Zephyr", "Zubenelgenubi"]

SPEAKER_ALIAS_2 = "Gary"  # @param {type: "string"}
SPEAKER_2 = "Orus"  # @param ["Achernar", "Achird", "Algenib", "Algieba", "Alnilam", "Aoede", "Autonoe", "Callirrhoe", "Charon", "Despina", "Enceladus", "Erinome", "Fenrir", "Gacrux", "Iapetus", "Kore", "Laomedeia", "Leda", "Orus", "Puck", "Pulcherrima", "Rasalgethi", "Sadachbia", "Sadaltager", "Schedar", "Sulafat", "Umbriel", "Vindemiatrix", "Zephyr", "Zubenelgenubi"]

# LANGUAGE_CODE = "en-gb"  # @param ["am-et", "ar-001", "ar-eg",  "az-az",  "be-by",  "bg-bg", "bn-bd", "ca-es", "ceb-ph", "cs-cz",  "da-dk",  "de-de",  "el-gr", "en-au", "en-gb", "en-in",  "en-us",  "es-es",  "es-419", "es-mx", "es-us", "et-ee", "eu-es",  "fa-ir",  "fi-fi",  "fil-ph", "fr-fr", "fr-ca", "gl-es", "gu-in",  "hi-in",  "hr-hr",  "ht-ht",  "hu-hu", "af-za", "hy-am", "id-id",  "is-is",  "it-it",  "he-il",  "ja-jp", "jv-jv", "ka-ge", "kn-in",  "ko-kr",  "kok-in", "la-va",  "lb-lu", "lo-la", "lt-lt", "lv-lv",  "mai-in", "mg-mg",  "mk-mk",  "ml-in", "mn-mn", "mr-in", "ms-my",  "my-mm",  "nb-no",  "ne-np",  "nl-nl", "nn-no", "or-in", "pa-in",  "pl-pl",  "ps-af",  "pt-br",  "pt-pt", "ro-ro", "ru-ru", "sd-in",  "si-lk",  "sk-sk",  "sl-si",  "sq-al", "sr-rs", "sv-se", "sw-ke",  "ta-in",  "te-in",  "th-th",  "tr-tr", "uk-ua", "ur-pk", "vi-vn",  "cmn-cn", "cmn-tw"]
# fmt: on

PROMPT = "Read the following dialogue between two friends"  # @param {type: "string"}

multi_speaker_voice_config = texttospeech.MultiSpeakerVoiceConfig(
    speaker_voice_configs=[
        texttospeech.MultispeakerPrebuiltVoice(
            speaker_alias=SPEAKER_ALIAS_1, speaker_id=SPEAKER_1
        ),
        texttospeech.MultispeakerPrebuiltVoice(
            speaker_alias=SPEAKER_ALIAS_2, speaker_id=SPEAKER_2
        ),
    ]
)

multi_speaker_markup = texttospeech.MultiSpeakerMarkup(
    turns=[
        texttospeech.MultiSpeakerMarkup.Turn(
            speaker=SPEAKER_ALIAS_1,
            text="Murid sudah tepat dalam mengidentifikasi soal cerita dengan menuliskan panjang, lebar",
        ),
        texttospeech.MultiSpeakerMarkup.Turn(
            speaker=SPEAKER_ALIAS_2, text="dan jarak antar pohon"
        ),
    ]
)
response = client.synthesize_speech(
    input=texttospeech.SynthesisInput(
        multi_speaker_markup=multi_speaker_markup, prompt=PROMPT
    ),
    voice=texttospeech.VoiceSelectionParams(
        language_code=LANGUAGE_CODE,
        model_name=MODEL,
        multi_speaker_voice_config=multi_speaker_voice_config,
    ),
    audio_config=texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.LINEAR16
    ),
)
# play the generated audio
display(Audio(response.audio_content))

In [14]:
text_spoken = ''' Murid sudah tepat dalam mengidentifikasi soal cerita dengan menuliskan panjang, lebar, dan jarak antar pohon.
"Murid sudah melakukan prosedur yang tepat dengan membagi 60 dan 42 masing-masing dengan 3, yaitu jarak antar pohon, dan kemudian menjumlahkannya.
"
Kesalahan terjadi dimana ada pemahaman yang terlewat bahwa persegi panjang memiliki 2 sisi panjang dan 2 sisi lebar, sehingga menjumlahkan 20 dan 14 belum cukup untuk mendapatkan jawaban yang tepat
Rekomendasi untuk perbaikan adalah siswa dapat menggambar ilustrasi kebun dan menggambarkan pohon di sekelilingnya. Kemudian perhatikan bahwa area panjang dan lebar masing-masing memiliki 2 sisi. Jawaban siswa hanya mencakup 1 sisi panjang dan 1 sisi lebar
'''

In [15]:
# @title Inline dialog text input

multi_speaker_voice_config = texttospeech.MultiSpeakerVoiceConfig(
    speaker_voice_configs=[
        texttospeech.MultispeakerPrebuiltVoice(
            speaker_alias=SPEAKER_ALIAS_1,
            speaker_id=SPEAKER_1,
        ),
        texttospeech.MultispeakerPrebuiltVoice(
            speaker_alias=SPEAKER_ALIAS_2,
            speaker_id=SPEAKER_2,
        ),
    ]
)
response = client.synthesize_speech(
    input=texttospeech.SynthesisInput(
        text=text_spoken, #"Zizu: Have you tried the new multi-speaker feature on Gemini?\nGary: Yes! I am super excited about it",
        prompt=PROMPT,
    ),
    voice=texttospeech.VoiceSelectionParams(
        language_code=LANGUAGE_CODE,
        model_name=MODEL,
        multi_speaker_voice_config=multi_speaker_voice_config,
    ),
    audio_config=texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.LINEAR16
    ),
)
# play the generated audio
display(Audio(response.audio_content))

## Further details


Feel free to review the [Cloud Text-to-Speech Python SDK documentation](https://cloud.google.com/python/docs/reference/texttospeech/latest) to explore all available fields and options to customize the API behavior.


To learn more about Gemini-TTS offering on Vertex AI, make sure to check out the [Gemini-TTS Guide](https://cloud.google.com/text-to-speech/docs/gemini-tts).