# A Voice Assistant for Your Knowledge Base

• Find the link to the “[JarvisBase](https://github.com/peterw/JarvisBase)” GitHub repository for this section at  [towardsai.net/book](http://towardsai.net/book).

This tutorial focuses on voice capabilities. In this project, we will create a voice assistant that integrates OpenAI’s Whisper to convert voice inputs into text. After the transcription is complete, voice responses will be generated using Eleven Labs, a company renowned for its high-quality text-to-speech API that adeptly captures emotion and tone. Using this API will ensure that the voice assistant can communicate with users in a clear and natural tone.

At the heart of this project is a question-answering system. When a question is asked, the system retrieves relevant documents from this database. These documents and the question are then processed by a large language model (LLM). The LLM utilizes this information to formulate an appropriate response.

The project includes the Streamlit service to create an interactive user interface (UI), enhancing user interaction with the assistant. This basic frontend allows users to ask questions using either natural language or voice and generates responses in both text and audio formats.

![image](../images/jarvis-base.jpg)

Start by installing the necessary libraries for this project. While it’s best to use the most recent versions of these packages for the best results, the provided code was used with specific versions. They can be installed using the pip packages manager. A link to this requirement file is accessible at [towardsai.net/book](http://towardsai.net/book).

Set the API keys and tokens. They need to be set in the environment variable as described below.

In [None]:
import os
from langchain_custom_utils.helper import get_openai_api_key, get_activeloop_api_key, get_eleven_api_key
OPENAI_API_KEY = get_openai_api_key()
ACTIVELOOP_API_KEY = get_activeloop_api_key()
ELEVEN_API_KEY = get_eleven_api_key()

## Getting Content from Hugging Face Hub

We’ll begin by gathering documents from the Hugging Face Hub. These articles will form the foundation of our voice assistant’s knowledge base. We will use web scraping methods to collect relevant knowledge documents.

Let’s look at and run the script.py file.

Import the required modules, load environment variables, and establish the path for Deep Lake. It also creates an instance of OpenAIEmbeddings, which will be used later to embed the scraped articles:

In [None]:
import os
import requests
from bs4 import BeautifulSoup
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
import re

# TODO: use your organization id here. (by default, org id is your username)
my_activeloop_org_id = "<YOUR-ACTIVELOOP-ORG-ID>"
my_activeloop_dataset_name = "langchain_course_jarvis_assistant"
dataset_path= 'hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}'

embeddings =  OpenAIEmbeddings(model_name="text-embedding-ada-002")

Compile a list of relative URLs that lead to knowledge documents from the Hugging Face Hub. To do this, define the function `get_documentation_urls()` and attach these relative URLs to the base URL of the Hugging Face Hub using another function, `construct_full_url()`, effectively establishing full URLs that can be accessed directly.

In [None]:
def get_documentation_urls():
    # List of relative URLs for Hugging Face documentation pages,
    # commented a lot of these because it would take too long to scrape
    # all of them
    return [
            '/docs/huggingface_hub/guides/overview',
            '/docs/huggingface_hub/guides/download',
            '/docs/huggingface_hub/guides/upload',
            '/docs/huggingface_hub/guides/hf_file_system',
            '/docs/huggingface_hub/guides/repository',
            '/docs/huggingface_hub/guides/search',
            # You may add additional URLs here or replace all of them
    ]

def construct_full_url(base_url, relative_url):
    # Construct the full URL by appending the relative URL to the base URL
    return base_url + relative_url

The script compiles the gathered content from various URLs. This is executed by the `scrape_all_content()` function, which invokes the `scrape_page_content()` function for each URL. Next, the resulting text is stored in a file.

In [None]:
def scrape_page_content(url):
    # Send a GET request to the URL and parse the HTML response using
    # BeautifulSoup
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract the desired content from the page (in this case, the body text)
    text=soup.body.text.strip()
    # Remove non-ASCII characters
    text = re.sub(r'[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f-\xff]', '', text)
    # Remove extra whitespace and newlines
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

def scrape_all_content(base_url, relative_urls, filename):
    # Loop through the list of URLs, scrape content and add it to the
    # content list
    content = []
    for relative_url in relative_urls:
        full_url = construct_full_url(base_url, relative_url)
        scraped_content = scrape_page_content(full_url)
        content.append(scraped_content.rstrip('\n'))

    # Write the scraped content to a file
    with open(filename, 'w', encoding='utf-8') as file:
        for item in content:
            file.write("%s\n" % item)
    
    return content

## Loading and Splitting Texts

To prepare the gathered text into our vector database, the content is first retrieved from the file using the `load_docs()` function, which separates it into distinct documents. These documents are then divided into smaller segments using the `split_docs()` function.

The command `text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)` initializes a text splitter designed to segment the text into character-based chunks. It divides the documents into sections of roughly 1000 characters with no overlapping content in the consecutive sections within docs.

In [None]:
# Define a function to load documents from a file
def load_docs(root_dir, filename):
    # Create an empty list to hold the documents
    docs = []
    try:
        # Load the file using the TextLoader class and UTF-8 encoding
        loader = TextLoader(os.path.join(
            root_dir, filename), encoding='utf-8')
        # Split the loaded file into separate documents and add them to the list
        # of documents
        docs.extend(loader.load_and_split())
    except Exception as e:
        # If an error occurs during loading, ignore it and return an empty list
        # of documents
        pass
    # Return the list of documents
    return docs
  
def split_docs(docs):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    return text_splitter.split_documents(docs)

## Embedding and Storing in Deep Lake

The next phase is embedding the articles and storing them in Deep Lake.

The following code sets up a Deep Lake instance, specifying the dataset path and the **OpenAIEmbeddings** function as an embedding function to use. The **OpenAIEmbeddings** function transforms the text segments into their embedding vectors, a format compatible with the vector database. With the `.add_documents` method, the texts are processed and stored within the database.

In [None]:
# Define the main function
def main():
    base_url = 'https://huggingface.co'
    # Set the name of the file to which the scraped content will be saved
    filename='content.txt'
    # Set the root directory where the content file will be saved
    root_dir ='./'
    relative_urls = get_documentation_urls()
    # Scrape all the content from the relative URLs and save it to the content
    # file
    content = scrape_all_content(base_url, relative_urls, filename)
    # Load the content from the file
    docs = load_docs(root_dir, filename)
    # Split the content into individual documents
    texts = split_docs(docs)
    # Create a DeepLake database with the given dataset path and embedding
    # function
    db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
    # Add the individual documents to the database
    db.add_documents(texts)
    # Clean up by deleting the content file
    os.remove(filename)

# Call the main function if this script is being run as the main program
main()

These steps are organized within the main function. It establishes the required parameters, activates the outlined functions, and manages the entire procedure, from scraping web content to integrating it into the Deep Lake database. It also removes the content file, ensuring a clean workspace.

## Creating the Voice Assistant

You can find the relevant code in the *chat.py* file within the GitHub repository. To test it out, execute streamlit run *chat.py*.

The libraries used below are essential to create web applications with Streamlit. They help manage audio input, generate text responses, and efficiently access information stored in the Deep Lake:

In [None]:
import openai

import streamlit as st
from audio_recorder_streamlit import audio_recorder
from elevenlabs import generate
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from streamlit_chat import message

# Constants
TEMP_AUDIO_PATH = "temp_audio.wav"
AUDIO_FORMAT = "audio/wav"

# Load environment variables from .env file and return the keys
openai.api_key = OPENAI_API_KEY
eleven_api_key = ELEVEN_API_KEY

In [None]:
def load_embeddings_and_database(active_loop_data_set_path):
    embeddings = OpenAIEmbeddings()
    db = DeepLake(
        dataset_path=active_loop_data_set_path,
        read_only=True,
        embedding_function=embeddings
    )
    return db

Next, prepare the code for transcribing audio:

In [None]:
# Transcribe audio using OpenAI Whisper API
def transcribe_audio(audio_file_path, openai_key):
    openai.api_key = openai_key
    try:
        with open(audio_file_path, "rb") as audio_file:
            response = openai.Audio.transcribe("whisper-1", audio_file)
        return response["text"]
    except Exception as e:
        print(f"Error calling Whisper API: {str(e)}")
        return None

Transcribe an audio file into text using the OpenAI Whisper API. It requires the path of the audio file and the OpenAI key as input parameters:

In [None]:
# Record audio using audio_recorder and transcribe using transcribe_audio
def record_and_transcribe_audio():
    audio_bytes = audio_recorder()
    transcription = None
    if audio_bytes:
        st.audio(audio_bytes, format=AUDIO_FORMAT)

        with open(TEMP_AUDIO_PATH, "wb") as f:
            f.write(audio_bytes)

        if st.button("Transcribe"):
            transcription = transcribe_audio(TEMP_AUDIO_PATH, openai.api_key)
            os.remove(TEMP_AUDIO_PATH)
            display_transcription(transcription)

    return transcription

# Display the transcription of the audio on the app
def display_transcription(transcription):
    if transcription:
        st.write(f"Transcription: {transcription}")
        with open("audio_transcription.txt", "w+") as f:
            f.write(transcription)
    else:
        st.write("Error transcribing audio.")

# Get user input from Streamlit text input field
def get_user_input(transcription):
    return st.text_input("", value=transcription if transcription else "",
    key="input")

The following code allows users to record audio straight from the program. The recorded audio is transcribed into text using the Whisper API and presented on the application. The user will be notified if an error occurs during the transcription process.

In [None]:
# Search the database for a response based on the user's query
def search_db(user_input, db):
    print(user_input)
    retriever = db.as_retriever()
    retriever.search_kwargs['distance_metric'] = 'cos'
    retriever.search_kwargs['fetch_k'] = 100
    retriever.search_kwargs['maximal_marginal_relevance'] = True
    retriever.search_kwargs['k'] = 4
    model = ChatOpenAI(model_name='gpt-3.5-turbo')
    qa = RetrievalQA.from_llm(model, retriever=retriever,
    return_source_documents=True)
    return qa({'query': user_input})

The provided code searches the vector database for responses most relevant to the user’s query. Initially, it transforms the database into a retriever, a mechanism designed to identify the closest embeddings in the vector space. The process involves setting various search parameters, such as the metric for measuring distances within the embedding space, the initial number of documents to retrieve, the decision to employ maximal marginal relevance for balancing the diversity and relevance of outcomes, and the total number of results to be returned. Subsequently, the results are processed through a language model, GPT-3.5 Turbo, in this case, to formulate the most suitable response to the user’s inquiry.

In [None]:
# Display conversation history using Streamlit messages
def display_conversation(history):
    for i in range(len(history["generated"])):
        message(history["past"][i], is_user=True, key=str(i) + "_user")
        message(history["generated"][i], key=str(i))
        #Voice using Eleven API
        voice= "Bella"
        text= history["generated"][i]
        audio = generate(text=text, voice=voice, api_key=eleven_api_key)
        st.audio(audio, format='audio/mp3')

### User Interaction

The next stage is user interaction. The voice assistant is coded to receive requests through voice recordings or text.

In [None]:
# Main function to run the app
def main():
    # Initialize Streamlit app with a title
    st.write("# JarvisBase 🧙")
   
    # Load embeddings and the DeepLake database
    db = load_embeddings_and_database(dataset_path)

    # Record and transcribe audio
    transcription = record_and_transcribe_audio()

    # Get user input from text input or audio transcription
    user_input = get_user_input(transcription)

    # Initialize session state for generated responses and past messages
    if "generated" not in st.session_state:
        st.session_state["generated"] = ["I am ready to help you"]
    if "past" not in st.session_state:
        st.session_state["past"] = ["Hey there!"]
        
    # Search the database for a response based on user input and update the
    # session state
    if user_input:
        output = search_db(user_input, db)
        print(output['source_documents'])
        st.session_state.past.append(user_input)
        response = str(output["result"])
        st.session_state.generated.append(response)

    #Display conversation history using Streamlit messages
    if st.session_state["generated"]:
        display_conversation(st.session_state)

# Run the main function when the script is executed
main()

The provided code serves as the core functionality of the application. It initializes the Streamlit application and loads the Deep Lake vector database and embeddings. The application offers two modes for user input: textual input or an audio recording, which is transcribed afterward.

The application tracks previous user inputs and responses using a session state to maintain continuity. Upon receiving new input from the user, it searches the database to find the most appropriate response, updating the session state accordingly.

Finally, the application shows the complete conversation history, encompassing user inputs and chatbot responses. For voice inputs, the chatbot’s responses are also presented in an audio format, leveraging the Eleven Labs API.

To run the whole application, execute the following command in your terminal: `streamlit run chat.py`

When you execute your program with the Streamlit command, it will launch a local web server and provide you with a URL where your application can be browsed.

Your application will run as long as the command in your terminal is active, and it will terminate when you stop the command `(ctrl+C)` or close the terminal.

Trying Out the UI
Now, test the Streamlit app!

By clicking on the microphone icon, your microphone will be active for seconds, and you can ask a question. Let’s try “How do I search for models in the Hugging Face Hub?”.

![image](../images/jarvis-flow-1.jpg)

After a few seconds, the app will show an audio player to listen to your registered audio. You may then click on the “Transcribe” button.

This button will invoke a call to the Whisper API and transcribe your audio. The produced text will be pasted to the chat text entry:

![image](../images/jarvis-flow-2.jpg)

Here, the Whisper API didn’t perfectly transcribe “Hugging Face” correctly and instead wrote “Huggy Face.” But let’s see if our LLM can still understand the query and give it an appropriate answer by leveraging the knowledge documents stored in Deep Lake.

After a few more seconds, the underlying chat will be populated with your audio transcription, along with the chatbot’s textual response and its audio version, generated by calling the ElevenLabs API. As we can see, the LLM could understand that “Huggy Face” was a misspelling and was still able to give an appropriate answer.

![image](../images/jarvis-flow-3.jpg)