# The Gift of Conversation

This is the accompanying notebook for my [blog post](https://cannect.canyalniz.com/001-gift-of-conversation.html)

My girlfriend Çağla is a die hard fan of the movie Legally Blonde and an avid admirer of the movie's heroine Elle Woods. Elle's indomitable spirit, loyalty to her friends and high held moral values are among the reasons Çağla cherishes her so much. So, I thought I would use my programming skills and the capabilities of today's artificial intelligence models to facilitate a meeting between them.

## Setting API Keys with Environment Variables
In order to authenticate ourselves during the API calls we'll need to provide our program with our keys. The LangChain OpenAI API module expects to find the API key in its dedicated environment variable while the SpeechRecognition Whisper API module and the ElevenLabs API module expect their respective keys as function arguments. To suit all needs and prevent possible leakage of the keys, I recommend setting the environment variables for both keys using your preferred method. You can load the keys to variables within the program.

In [None]:
import os

In [None]:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") # Your OpenAI API Key goes here
ELEVENLABS_API_KEY = os.getenv("ELEVENLABS_API_KEY") # Your Eleven Labs API Key goes here

## Voice Input
We'll use the [SpeechRecognition](https://github.com/Uberi/speech_recognition#readme) library for capturing the user's voice and converting it to text. Every turn the program will start recording audio when the user starts speaking and record until they stop speaking. Once the recording is done the audio file will be sent to Whisper to be converted into text.

We'll start by initializing our recognizer object which we will use to both capture the audio and make the API call to Whisper.

In [None]:
import speech_recognition as sr

In [None]:
# obtain audio from the microphone
r = sr.Recognizer()

Our `Recognizer` object uses its `energy_threshold` to detect speech over normal background noise so that it can start recording. Let's configure it automatically for our environment. The below code should work well for the majority, however if you have issues with the energy threshold after automatic calibration check out the sectionin the appendix for manual configuration.

In [None]:
with sr.Microphone() as source: # use the default microphone as the audio source
    r.adjust_for_ambient_noise(source, duration=5) # listen for 5 seconds to calibrate the energy threshold for ambient noise levels

Once the energy threshold is set to the appropriate value, the `Recognizer` can start recording when it detects speech. Let's listen for some input and store the recording in a variable.

In [None]:
with sr.Microphone() as source:
    audio = r.listen(source)
    print("Captured next line...")

## The Heart, Soul and Brain of the Character
Now that we have the user's input, we will process it and generate a response as our character within the given context, while also paying attention to the history of the conversation. We'll achieve this using the `ConversationChain` class from the `LangChain` Python library. `ConversationClass` will abstract away a lot of details for us to be able to focus on building the character and the story without worrying about the lower level details of prompting the LLM iteratively or manually handling the history of the conversation.

### the LLM
We use the `ChatOpenAI` module offered by LangChain to initialize our llm. We need to take care of two things when setting up our OpenAI llm what model to use and the temperature of the model. For this application, among the currently available models, `gpt-3.5-turbo` is the tool for the job. It has the dialogue operation capabilities we're looking for and it is very reasonably priced. As of the writing of this post `gpt-3.5-turbo` is the default model `ChatOpenAI` uses. As for the temperature, setting it to 0.7 yielded good results for my use case, feel free to try out different values yourself.

In [None]:
from langchain.chat_models import ChatOpenAI

In [None]:
llm = ChatOpenAI(temperature=0.7)

### Custom Prompt Template for the ConversationChain
With the core of the prompt ready, we will create a custom prompt template with which our ConversationChain object will be able to drive the conversation. We want our prompt template to accept two input variables: `history` to keep track of the conversation so far and `input` to inject the user's input into the prompt. We position the input variables in curly braces inside our prompt template, and set the template up to end on `Elle:` to have our llm generate the next line as Elle.

In [None]:
from langchain import PromptTemplate

In [None]:
conversation_prompt_template = PromptTemplate(
    input_variables=['history', 'input'],
    output_parser=None,
    partial_variables={},
    template="""\
    The following is a friendly phone call between the character Elle Woods from the movie Legally Blonde and her best friend Chaala.\
    Elle is compassionate, caring, supportive, talkative and empathetic.\
    Chaala is on a journey to find herself, Elle will support her and encourage Chaala to believe in herself.\
    Today is Chaala's birthday.\
    Elle wants to learn more about Chaala's life and catch-up.\
    Elle pays attention to details from the context of the conversation and accurately represents her character from Legally Blonde.\
    

    
    Current conversation:
    {history}
    Chaala: {input}
    Elle:""",
    template_format='f-string',
    validate_template=True)

### Conversation Memory
To generate the response and drive a meaningful conversation, the llm must have an idea of what has been talked about up to the current line. We can convey this information to our llm using one of the [many memory modules offered by LangChain](https://python.langchain.com/docs/modules/memory/types/). Our most notable options are:

- [ConversationBufferWindowMemory](https://python.langchain.com/docs/modules/memory/types/buffer): Retains the last `k` turns of the conversation verbatim, where `k` is the window size
- [ConversationSummaryMemory](https://python.langchain.com/docs/modules/memory/types/summary): At each turn performs a call to the llm using a custom prompt to keep a running summary of the conversation history
- [ConversationSummaryBufferMemory](https://python.langchain.com/docs/modules/memory/types/summary_buffer): Keeps the most recent turns of the conversation verbatim and progressively summarizes older lines that fall above a token limit

We will be using the `ConversationSummaryBufferMemory` module which in my opinion strikes a good balance regarding information retention without getting too complicated. To utilize the summary feature offered by this module we need a slight modification. The default progressive summarization prompt refers to the user as `the human` and the character as `the AI`. Let's create a custom PromptTemplate by referencing the participants appropriately, the user as `Chaala` and the character as `Elle`.

In [None]:
summarizer_prompt_template = PromptTemplate(
    input_variables=['summary', 'new_lines'],
    output_parser=None,
    partial_variables={},
    template='Progressively summarize the lines of conversation provided, adding onto the previous summary returning a new summary.\n\nEXAMPLE\nCurrent summary:\nChaala asks what Elle thinks of artificial intelligence. Elle thinks artificial intelligence is a force for good.\n\nNew lines of conversation:\nChaala Why do you think artificial intelligence is a force for good?\nElle: Because artificial intelligence will help humans reach their full potential.\n\nNew summary:\nChaala asks what Elle thinks of artificial intelligence. Elle thinks artificial intelligence is a force for good because it will help humans reach their full potential.\nEND OF EXAMPLE\n\nCurrent summary:\n{summary}\n\nNew lines of conversation:\n{new_lines}\n\nNew summary:',
    template_format='f-string',
    validate_template=True)

Let's set the maximum token limit above which summarization will occur to be 350 tokens and initialize our memory object.

In [None]:
from langchain.memory import ConversationSummaryBufferMemory

In [None]:
memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=350,
    prompt=summarizer_prompt_template,
    ai_prefix="Elle",
    human_prefix="Chaala")

### Initializing the ConversationChain
We can now use the parts we have built to create our ConversationChain

In [None]:
from langchain.chains import ConversationChain

In [None]:
conversation = ConversationChain(
    llm=llm,
    prompt = conversation_prompt_template,
    memory=memory
)

## Giving Your Character a Voice

### Voice Cloning
You can check out [ElevenLabs](https://elevenlabs.io/) for a speech synthesis interface which you can use to capture the voice of the character you have chosen. You are going to need some high quality clips of your character speaking for the best results.

### Speech Generation
Once we have the voice designed to our liking, we can make the API call to have it read for us. To start we need to create identifiers for the voice we have created and the model we wish to use for generation. I was given access to the Eleven English v2 model from ElevenLabs upon request. Although this model is in Beta as of the writing of this post it produced better results for me. Make sure to refer to your voice with the name you gave it on the ElevenLabs platform.

In [None]:
from elevenlabs import set_api_key
from elevenlabs.api import Models
from elevenlabs.api import Voices

In [None]:
set_api_key(ELEVENLABS_API_KEY)

models = Models.from_api()
elle_model = [model for model in models if model.name == "Eleven English v2"][0]

voices = Voices.from_api()
elle_voice = [voice for voice in voices if voice.name == "Elle Woods"][0]

## The Conversation Loop
We are now ready to put everything together. Since our conversation will consist of the user and the character taking turns to speak, we'll wrap the whole thing in a loop and have it run until interruption. We can also wrap the loop with a try-except block to pickle the conversation memory before exiting the execution.

### Prelude
Since this is likely to be a surprise birthday gift, let's include an introduction to the conversation. For my gift I had one of the default voices of ElevenLabs read the following text.
```markdown
Initiating contact with Elle Woods. This is an inter-dimensional phone call. Voice delays and awkward pauses are expected. Connecting now...
```
And followed this introduction with a [phone line sound effect](https://www.youtube.com/watch?v=nHRVQKY1xh4) that eventually gets picked up. Upon picking up the phone a pre-generated audio of the character saying `Hello Chaala!` is played. I chose to play the files using mpv.

In [None]:
os.system("mpv initiating-2.mp3")

In [None]:
os.system("mpv --volume=65 ring.opus")
os.system("mpv elle-greeting.mp3")

### The Conversation

In [None]:
try:
    while True:
        # capture spoken user input
        with sr.Microphone() as source:
            audio = r.listen(source)
            print("Captured next line...")

        # convert the user input into text
        input_text = r.recognize_whisper_api(audio, api_key=OPENAI_API_KEY)
        
        # generate text response to the user input
        response_text = conversation.predict(input=input_text)

        # generate and stream the character's voice
        audio_stream = generate(
            text=response_text,
            voice=elle_voice,
            model=elle_model,
            stream=True
        )
        stream(audio_stream)
        
except KeyboardInterrupt:
    # play disconnection effect
    os.system("mpv end.opus")
    # save the conversation memory to disk
    with open("conversation_memory", "wb") as f:
        pickle.dump(conversation.memory, f)

With all our components working together, this chain of artificial intelligence models will imitate a conversation with the chosen character.

## Stress Points
As you will see upon testing it out yourself, for the most part this system works pretty well. **SpeechRecognition** knows where to start and stop the recording, **Whisper** is quite successful in discerning what is said, **ChatGPT** generates theme-appropriate responses, and **ElevenLabs English v2** does a good job of generating convincing speech. However there are points in the system that require special attention and one caveat I haven't been able to mitigate.

- **SpeechRecognition Configuration**: Timing the recording successfully requires the correct configuration of the `Energy Threshold` and `Pause Threshold` parameters. If the automatic configuration isn't working out for you check out the [manual configuration guides in the appendix](#manual-configuration-of-the-speechrecognition-energy-threshold).
- **ElevenLabs Voice Lab**: Getting the voice design right requires high quality of the character speaking as well as a certain amount of experimentation.
- **Response Latency**: Unfortunately there is one aspect of the use experience I haven't been able to fix yet: the latency of the spoken response. The biggest culprit here is the ElevenLabs API call. Even with streaming enabled, the speech generation takes a long time to complete. Paired with the `PauseThreshold` amount of seconds `SpeechRecognition` waits before concluding its recording, you can expect to experience anywhere from 10 to 20 seconds of total response latency. Although this is quite high for normal conversations, the *inter-dimensional* nature of the conversation made it acceptable in my experience. If you have powerful enough hardware you can attempt to get around this problem by opting to run a tts model locally (such as [tortoise-tts](https://github.com/neonbjb/tortoise-tts)).

## Moderating the Conversation
You are now ready to give the gift of inter-dimensional conversation. Let me tell you how I moderated the call. To preserve the mystery around what was about to happen I had Çağla sit across from me so that she couldn't see the screen. Gave her the headphones and went through the Manual Configuration of the SpeechRecognition Energy Threshold. Once I was happy with the configuration, I executed Prelude together with Conversation Loop so that recording started as soon as Elle greeted Çağla. When the conversation was over, I interrupted the execution of the program to start the Epilogue and save the conversation memory to disk. I also recommend setting up a camera to record a video of the conversation, the reaction I got made the effort of putting this together well worth it.