# Overview

This basic implementation for verbal backchannels follows the following procedure:
- listens for audio input
- waits for audio input to pause
- convert audio chunk to text
- generate response from text
- convert response to audio
- play audio

Implementation notes:
- there is very high latency between each phrase and the returned backchannel which is an issue for maintaining conversational pace
- each cell in this notebook is labeled with a description of its function and can be run independently

Experimentation areas:
- speech to text systems
  - google (current)
  - assembly ai
  - whisper (very good but larger models can be slow)
- prompts for generating responses
  - current: 'Respond with a verbal backchannel as if you are actively listening to someone say "{user input}"'
- text to audio systems
  - pyttsx3 (current)
  - eleven ai (very human-like but not real time)

In [None]:
import speech_recognition as sr
import pyttsx3
import openai
import time
import config

## Speech to Text
First, we use SpeechRecognition to detect spoken phrases and transcribe them.

In [None]:
r = sr.Recognizer() # recognizer instance
m = sr.Microphone() # audio input instance
# if this part isn't working, try printing m.list_microphone_names() and setting sr.Microphone(device_index) to the microphone you want to use

# set audio input threshold and calibrate
r.energy_threshold = 1000
r.dynamic_energy_threshold = False
with m:
  r.adjust_for_ambient_noise(m)


# set pause between phrases threshold, default is 0.8
# r.pause_threshold = 0.8

while True:
  print('listening...')
  # this works by listening for an audio input, waiting for a pause, then returning the audio chunk
  with m:
    audio = r.listen(m) 

  print('audio received. transcribing...')
  # use google speech to text to transcribe audio
  try:
    print(r.recognize_google(audio))
  except:
    print('failed to transcribe audio')

## Generate Backchannel
Then we prompt gpt3 with the transcribed text for a backchannel. 

In [None]:
# I layed out the variables in this way, but just change gpt3_input to the string you want to prompt gpt3 with. 
prompt = 'Respond with a verbal backchannel as if you are actively listening to someone say "{}"'
user_input = 'Your backchannels are a bit delayed'
gpt3_input = prompt.format(user_input)

# I used the default parameters, but these can be changed. 
# More details available at https://platform.openai.com/docs/api-reference/completions.
openai.api_key = config.api_key
result = openai.Completion.create(
    model='text-davinci-003',
    prompt=gpt3_input,
    max_tokens=256,
)
print(result)

## Play Text
Then we play the text with pyttsx3 (turn volume on).

In [None]:
engine = pyttsx3.init()  # initialize pyttsx3 instance
engine.say('text to say')  # say something
engine.runAndWait() # clean up
engine.stop()

## Demo
Putting everything together, we get this demo.

In [None]:
openai.api_key = config.api_key # set api key

r = sr.Recognizer() # recognizer instance
m = sr.Microphone() # audio input instance
engine = pyttsx3.init() # audio output instance

# prompt for gpt3 input
prompt = 'Respond with a verbal backchannel as if you are actively listening to someone say "{}"'

# set audio input threshold and calibrate
r.energy_threshold = 1000
r.dynamic_energy_threshold = False
with m:
  r.adjust_for_ambient_noise(m)

# set pause between phrases threshold, default is 0.8 seconds
# r.pause_threshold = 0.8

while True:
  # listen for audio input
  start = time.time()
  print('listening...')
  with m:
    audio = r.listen(m)
  print('audio received after {0:.4f} seconds'.format(time.time() - start))

  # transcribe
  try: 
    lap = time.time()
    print('\ntranscribing...')
    transcript = r.recognize_google(audio)
    print(transcript)
    print('transcribed in {0:.4f} seconds'.format(time.time() - lap))
  except:
    print('failed to transcribe\nplease try again')
    continue

  # generate response
  try:
    lap = time.time()
    print('\ngenerating response...')
    gpt3_input = prompt.format(transcript)
    gpt3_output = openai.Completion.create(
      model='text-davinci-003',
      prompt=gpt3_input,
      max_tokens=256,
    )
    print(gpt3_output)
    response = gpt3_output['choices'][0]['text'].strip()
    print('generated in {0:.4f} seconds'.format(time.time() - lap))
  except:
    print('failed to generate response\ncheck to make sure your api key is valid')
    continue

  # play response
  lap = time.time()
  print('\nplaying response...')
  print(response)
  engine.say(response)
  print('played in {0:.4f} seconds'.format(time.time() - lap))
  print('finished in {0:.4f} seconds'.format(time.time() - start))
  engine.runAndWait()
  engine.stop()