# Lesson 4: Voice Agent Components

<div style="background-color:#fff6ff; padding:13px; border-width:3px; border-color:#efe6ef; border-style:solid; border-radius:6px">
<p> 💻 &nbsp; <b>Access <code>requirements.txt</code>:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Open"</em>.

<p> ⬇ &nbsp; <b>Download Notebooks:</b> 1) click on the <em>"File"</em> option on the top menu of the notebook and then 2) click on <em>"Download as"</em> and select <em>"Notebook (.ipynb)"</em>.</p>

<p> 📒 &nbsp; For more help, please see the <em>"Appendix – Tips, Help, and Download"</em> Lesson.</p>

</div>

<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> 🚨
&nbsp; <b>Different Run Results:</b> The output generated by AI models can vary with each execution due to their dynamic, probabilistic nature. Don't be surprised if your results differ from those shown in the video.</p>

## Step 1: Import LiveKit Agent Modules and Plugins

In [1]:
import logging

from dotenv import load_dotenv
_ = load_dotenv(override=True)

logger = logging.getLogger("dlai-agent")
logger.setLevel(logging.INFO)

from livekit import agents
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, jupyter
from livekit.plugins import (
    openai,
    elevenlabs,
    silero,
)

## Step 2: Define Your Custom Agent

In [2]:
class Assistant(Agent):
    def __init__(self) -> None:
        llm = openai.LLM(model="gpt-4o")
        stt = openai.STT()
        tts = elevenlabs.TTS()
        #tts = elevenlabs.TTS(voice_id="CwhRBWXzGAHq8TQ4Fs17")  # example with defined voice
        silero_vad = silero.VAD.load()

        super().__init__(
            instructions="""
                You are a helpful assistant communicating 
                via voice
            """,
            stt=stt,
            llm=llm,
            tts=tts,
            vad=silero_vad,
        )

## Step 3: Create the Entrypoint

In [3]:
async def entrypoint(ctx: JobContext):
    await ctx.connect()

    session = AgentSession()

    await session.start(
        room=ctx.room,
        agent=Assistant()
    )

## Step 4: Setting up the app to run
- To speak to the agent, unmute the microphone symbol on the left. You can ignore the 'Start Audio' button.
- The agent will try to detect the language you are speaking. To help it, start by speaking a long phrase like "hello, how are you today" in the language of your choice.

In [4]:
jupyter.run_app(
    WorkerOptions(entrypoint_fnc=entrypoint), 
    jupyter_url="https://jupyter-api-livekit.vercel.app/api/join-token"
)

2025-05-08 10:43:57,649 - [36mDEBUG[0m asyncio - Using selector: EpollSelector [90m[0m
2025-05-08 10:43:57,652 - [32mINFO[0m livekit.agents - starting worker [90m{"version": "1.0.11", "rtc-version": "1.0.6"}[0m
2025-05-08 10:43:57,655 - [32mINFO[0m livekit.agents - [1msee tracing information at http://localhost:38763/debug[0m [90m[0m
2025-05-08 10:43:57,656 - [32mINFO[0m livekit.agents - initializing job runner [90m{"tid": 17205}[0m
2025-05-08 10:43:57,657 - [32mINFO[0m livekit.agents - job runner initialized [90m{"tid": 17205}[0m
2025-05-08 10:43:57,658 - [36mDEBUG[0m asyncio - Using selector: EpollSelector [90m[0m
2025-05-08 10:43:58,090 - [31mERROR[0m livekit.agents - unhandled exception while running the job task [90m[0m
Traceback (most recent call last):
  File "/home/linux-pc/anaconda3/envs/ai-voice-agent/lib/python3.10/asyncio/tasks.py", line 232, in __step
    result = coro.send(None)
  File "/tmp/ipykernel_16019/1781549358.py", line 8, in entrypoi

## Step 5: Try new voices
Update step 2 with voice id's. For example:  
`tts = elevenlabs.TTS(voice_id="CwhRBWXzGAHq8TQ4Fs17") `

In [None]:
# Roger: CwhRBWXzGAHq8TQ4Fs17
# Sarah: EXAVITQu4vr4xnSDxMaL
# Laura: FGY2WhTYpPnrIDTdsKH5
# George: JBFqnCBsd6RMkjVDRZzb

## Experiment with ElevenLabs:
For more information about using Elevenlabs in your voice projects, look for more information at their [website](https://elevenlabs.io/conversational-ai). 



# Creating my own ai-agent-model as a Proof-of-Concept


In [None]:
# # How do identify the number of tokens

# from transformers import GPT2TokenizerFast

# # Load tokenizer
# tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# # Long input string (provided content)
# text = """
# Mozilla Public License Version 2.0
# ==================================

# 1. Definitions
# --------------

# 1.1. "Contributor"
#     means each individual or legal entity that creates, contributes to
#     the creation of, or owns Covered Software.

# ... (truncated for brevity)
# """

# # Since the full text is too long for this placeholder, simulate with a realistic length (you would replace this with the full actual content)
# # Let's use a simulated representative sample
# sample_text = text * 100  # simulate longer input
# tokens = tokenizer.encode(sample_text)
# len(tokens)


In [None]:
# End-to-End Voice Assistant Stack (Proof-of-Concept Outline)

#     Speech-to-Text: Open-source Whisper STT

#     Local LLM: quantized Llama 3.2 1B parameter model

#     Text-to-Speech: Silero TTS 

#     Voice Activity Detection: Silero VAD

# All components run locally, require no paid keys, and can be orchestrated from the command line or a single Python script. Once you’ve quantized the model, you can launch FastChat:

In [None]:
from livekit import agents
from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, jupyter


import logging
import torch
import silero_vad # VAD
# import silero_tts # TTS

from IPython.display import Audio
import librosa


# Load model directly: Whisper STT
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq

whisper_processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
whisper_model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-large-v3")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

# Llama 3.2 1B (quantized)
llm_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B") 
llm_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") # Local LLM


In [66]:
# Setting GPU default device
if torch.cuda.is_available():
    torch.set_default_device('cuda')
else:
    torch.set_default_device('cpu')
device = torch.get_default_device()
print(torch.get_default_device())

cuda:0


In [55]:
# VAD Example

from silero_vad import load_silero_vad, read_audio, get_speech_timestamps
model = load_silero_vad()
wav = read_audio('./speech_orig.wav')
speech_timestamps = get_speech_timestamps(wav, model)
print(speech_timestamps)
print(f"len(wav): {len(wav)}")
audio, sr = librosa.load('./speech_orig.wav')
Audio(audio, rate=sr)

[{'start': 1568, 'end': 41952}, {'start': 44064, 'end': 172800}]
len(wav): 172800


In [64]:
# TTS Example

from silero_tts.silero_tts import SileroTTS

# Get available models
models = SileroTTS.get_available_models()
print("Available models:", models)

# Get available languages
languages = SileroTTS.get_available_languages()
print("Available languages:", languages)

# Get the latest model for a specific language
latest_model = SileroTTS.get_latest_model('en')
print("Latest model for English:", latest_model)

# Get available sample rates for a specific model and language
sample_rates = SileroTTS.get_available_sample_rates_static('en', latest_model)
print("Available sample rates for the latest English model:", sample_rates)

# Initialize the TTS object
tts = SileroTTS(model_id='v3_en', language='en', speaker='en_0', sample_rate=48000, device=device)

# Synthesize speech from text
text = "Hello world, How are you today?"
output =tts.tts(text, 'output.wav')
audio, rate = librosa.load('./output.wav', sr=48000)
Audio(audio, rate = 48000)

# # Synthesize speech from a text file
# # tts.from_file('input.txt', 'output.wav')

# # Get available speakers for the current model
# speakers = tts.get_available_speakers()
# print("Available speakers for the current model:", speakers)

# # Change the language
# tts.change_language('en')
# print("Language changed to:", tts.language)
# print("New model ID:", tts.model_id)
# print("New available speakers:", tts.get_available_speakers())

# # Change the model
# tts.change_model('v3_en')
# print("Model changed to:", tts.model_id)
# print("New available speakers:", tts.get_available_speakers())

# # Change the speaker
# tts.change_speaker('en_0')
# print("Speaker changed to:", tts.speaker)

# # Change the sample rate
# tts.change_sample_rate(24000)
# print("Sample rate changed to:", tts.sample_rate)


Available models: {'ru': ['v4_ru', 'v3_1_ru', 'ru_v3', 'aidar_v2', 'aidar_8khz', 'aidar_16khz', 'baya_v2', 'baya_8khz', 'baya_16khz', 'irina_v2', 'irina_8khz', 'irina_16khz', 'kseniya_v2', 'kseniya_8khz', 'kseniya_16khz', 'natasha_v2', 'natasha_8khz', 'natasha_16khz', 'ruslan_v2', 'ruslan_8khz', 'ruslan_16khz'], 'en': ['v3_en', 'v3_en_indic', 'lj_v2', 'lj_8khz', 'lj_16khz'], 'de': ['v3_de', 'thorsten_v2', 'thorsten_8khz', 'thorsten_16khz'], 'es': ['v3_es', 'tux_v2', 'tux_8khz', 'tux_16khz'], 'fr': ['v3_fr', 'gilles_v2', 'gilles_8khz', 'gilles_16khz'], 'ba': ['aigul_v2'], 'xal': ['v3_xal', 'erdni_v2'], 'tt': ['v3_tt', 'dilyara_v2'], 'uz': ['v4_uz', 'v3_uz', 'dilnavoz_v2'], 'ua': ['v4_ua', 'v3_ua', 'mykyta_v2'], 'indic': ['v4_indic', 'v3_indic'], 'cyrillic': ['v4_cyrillic'], 'multi': ['multi_v2']}
Available languages: ['ru', 'en', 'de', 'es', 'fr', 'ba', 'xal', 'tt', 'uz', 'ua', 'indic', 'cyrillic', 'multi']
Latest model for English: v3_en_indic
Available sample rates for the latest Engl

[32m2025-05-08 20:36:06.074[0m | [32m[1mSUCCESS [0m | [36msilero_tts.silero_tts[0m:[36mload_models_config[0m:[36m48[0m - [32m[1mModels config loaded from: /home/linux-pc/anaconda3/envs/ai-voice-agent/lib/python3.10/site-packages/silero_tts/latest_silero_models.yml[0m
[32m2025-05-08 20:36:06.074[0m | [1mINFO    [0m | [36msilero_tts.silero_tts[0m:[36minit_model[0m:[36m148[0m - [1mInitializing model[0m
[32m2025-05-08 20:36:06.074[0m | [1mINFO    [0m | [36msilero_tts.silero_tts[0m:[36minit_model[0m:[36m187[0m - [1mLoading model[0m
[32m2025-05-08 20:36:06.304[0m | [1mINFO    [0m | [36msilero_tts.silero_tts[0m:[36minit_model[0m:[36m192[0m - [1mModel to device takes 0.23 seconds[0m
[32m2025-05-08 20:36:06.304[0m | [32m[1mSUCCESS [0m | [36msilero_tts.silero_tts[0m:[36minit_model[0m:[36m199[0m - [32m[1mModel is loaded[0m
[32m2025-05-08 20:36:06.384[0m | [1mINFO    [0m | [36msilero_tts.silero_tts[0m:[36mpreprocess_text[0m:[

In [None]:
# End-to-End Proof-of-Concept
# wss://demo-4dndsnvv.livekit.cloud
from dotenv import load_dotenv
load_dotenv('.env')

True