# 2.1 Running inference on a TTS (Text-to-Speech) model.

The TTS turns the LLMs response into speech, enabling interaction with the LLM without requiring a keyboard. This notebook will show you how to interact with the TTS system, as well as Novas audio player.

Run this code so python can find the scripts. This is not required when importing Nova from outside the root folder.

In [None]:
import sys
from pathlib import Path
module_path = Path().absolute().parent.parent
if str(module_path) not in sys.path:
    sys.path.append(str(module_path))

In [None]:
from nova import *

nova = Nova()

The TTS system mirrors the LLM system in that you need to choose an inference engine, as well as prepare a "conditioning" object.  
By default, Nova comes with 2 inference engines for the TTS system. One using [Zonos](https://github.com/Zyphra/Zonos) and one using [Elevenlabs](https://elevenlabs.io/).

In [None]:
# Zonos:
inference_engine = InferenceEngineZonos()

# Elevenlabs:
inference_engine = InferenceEngineElevenlabs()

Now create a TTSConditioning object. Just like with the LLM system, each inference engine varies in what parameters they need and they differ in what values should be used. Below is a set of starting values for both engines.

In [None]:
# Zonos conditioning:
conditioning = TTSConditioning(
    model="Zyphra/Zonos-v0.1-transformer",
    voice="Laura",
    expressivness=100,
    stability=2.0,
    language="en-us",
    speaking_rate=15
)

Parameters:  
- model: Which model to use. You can find the available models [here](https://huggingface.co/collections/Zyphra/zonos-v01-67ac661c85e1898670823b4f)
- voice: The name of the voice to use. By default, Nova comes with 1 default voice "Laura", but you can clone other voices and use them. More on voice cloning in [2.2](2.2%20Cloning%20a%20voice.ipynb).
- expressivness: How expressive the voice should sound. Higher means the voice speaks with more emotion anc variation but also loses stability.
- stability: How stable the voice should be.
- language: The language the voice should speak. Should be the same language the input text is in.
- speaking_rate: How fast the voice should speak.

In [None]:
# Elevenlabs conditioning:
conditioning = TTSConditioning(
    model="eleven_multilingual_v2",
    voice="Xb7hH8MSUJpSbSDYk0k2",
    expressivness=0.5,
    stability=0.5,
    similarity_boost=0.75,
    use_speaker_boost=True
)

Parameters:  
- model: Which model to use. You can find all available models on the Elevenlabs [website](https://elevenlabs.io/app/home)
- voice: The voice ID of your desired voice. The voice IDs can also be found on the Elevenlabs [website](https://elevenlabs.io/app/home)
- expressivness: How expressive the voice should sound. Higher means the voice speaks with more emotion anc variation but also loses stability.
- stability: How stable the voice should be.
- similarity_boost: How consistent the voice should be. High values can cause artifacts.
- use_speaker_boost: An additional boost to voice consistency at the cost of generation latency.

If you are using Elevenlabs, you also need to pass your API key.

In [None]:
nova.edit_secret(Secrets.ELEVENLABS_API, "YOUR-API-KEY")

Now we set up the TTS system, similar to how the LLM system is set up.

In [None]:
nova.configure_tts(inference_engine=inference_engine, conditioning=conditioning)

# You need to apply your new configuration.
# Only after applying the configuration will the model be loaded into memory.
nova.apply_config_tts()

Now we can run inference.

In [None]:
speech = nova.run_tts("We choose to go to the Moon, not because its easy, but because it is hard")

We now have an "AudioData" object that can be parsed to the built in audio player to be played.

In [None]:
nova.play_audio(audio_data=speech)

We can then wait until the audio has finished playing.

In [None]:
nova.wait_for_audio_playback_end()