# 3.3 The transcriptor

The transcriptor is a built in system that can read an audio stream from the microphone and transcribe speech into text using OpenAI's [Whisper](https://github.com/openai/whisper). It also generates speaker embeddings which are high dimensional vectors that represent the voice of a speaker which can be used to differentiate between multiple speakers. The transcriptor used in Nova is a slighly modified version of my [Voice Analysis Toolkit](https://github.com/00Julian00/Voice-Analysis-Toolkit). This notebook will show you how to use the transcriptor and how to connect it to the context system.

Run this code so python can find the scripts. This is not required when importing Nova from outside the root folder.

In [None]:
import sys
from pathlib import Path
module_path = Path().absolute().parent.parent
if str(module_path) not in sys.path:
    sys.path.append(str(module_path))

In [None]:
from nova import *

nova = Nova()

Setting up the transcriptor mirrors how the LLM and TTS system is set up, except that you only need a conditioning object and no inference engine:

In [None]:
conditioning = TranscriptorConditioning(
    microphone_index=0
)

Make sure you pass the correct microphone index of the device you intend to use. You can find a list of all microphones and their indices by using the "sounddevice" library (which you should already have installed if you installed the requirements):

In [None]:
import sounddevice as sd

sd.query_devices()

As always, configure the transcriptor and apply the configuration:

In [None]:
nova.configure_transcriptor(conditioning=conditioning)

nova.apply_config_transcriptor()

We now want to start the transcriptor. This will give us a "ContextGenerator" object which is esentially a wrapper for all systems that continously yield context data.

In [None]:
context_generator = nova.start_transcriptor()

To automatically add the yielded context data to the context, we need to bind the context generator to the context system.

In [None]:
nova.bind_context_source(source=context_generator)

Now all of the data produced by the transcriptor will automatically be added to the context.  
Additionally, any voices the transcriptor encounters will be stored in a database as a voice embedding. They will initially be represented as "UnknownVoiceX" in the context, but the LLM can rename the voice if it learns your name. Note that this is only possible if default tools are loaded.

Additional parameters of "TranscriptorConditoning":

- model: Which whisper model to use. You can find all available models [here](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages)
- device: Choose between "cuda" and "cpu".
- voice_boost: How much the speech's volume should be boosted compared to other sounds in the audio data whisper receives. Can be usefull in noisy environments.
- language: Must be a valid language code or "None". Force whisper to interpret the speech in a certain language. Can improve results if you are only talking in one language. Set to "None" to let whisper automatically determine the spoken language.
- vad_threshold: The threshold the voice-activity-detection system needs to surpass in its evaluation wether an audio chunk contains speech.