# Speakeasy GPT
Speakeasy GPT is a Jupyter notebook that utilizes several natural language processing utilities to provide a seamless and low-latency speech interface to ChatGPT and other large language models.  

Voice prompts are transcribed using OpenAI's whisper model, run locally on CPU or GPU.  The transcription is sent as a prompt to the OpenAI gpt-3.5.turbo API.  The response is synthesized into speech by several text to speech engines, including ElevenLabs' API, Mimic 3, and Coqui TTS.

## Mount Drive

In [None]:
import os
from google.colab import drive
nb_dir = '/content/drive/MyDrive/COLAB/speakeasy-gpt'
drive.mount('/content/drive/',force_remount=True)
if not os.path.exists(nb_dir):
    os.makedirs(nb_dir)
os.chdir(nb_dir)
print('Current path: ' + os.getcwd())

Mounted at /content/drive/
Current path: /content/drive/MyDrive/COLAB/speakeasy-gpt


## Installs

In [None]:
!sudo apt-get install espeak
!pip install -q elevenlabs git+https://github.com/openai/whisper.git openai ffmpeg-python pydub TTS pytictoc
!pip install -q mycroft-mimic3-tts

Reading package lists... Done
Building dependency tree       
Reading state information... Done
espeak is already the newest version (1.48.04+dfsg-8build1).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━

## Imports

In [None]:
# Imports
# System
import sys
import ipywidgets as widgets
import warnings
from pytictoc import TicToc
import time

# Audio
from IPython.display import display
import scipy.io.wavfile as wav
from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
from scipy.io.wavfile import write as wav_write
import io
import ffmpeg
from IPython.display import Javascript
from google.colab import output
from base64 import b64decode
from io import BytesIO

# LLM
import openai

# STT
import whisper

# TTS
from elevenlabs import generate, play
from mimic3_tts import (
    AudioResult,
    Mimic3Settings,
    Mimic3TextToSpeechSystem,
    SSMLSpeaker,
)
from TTS.api import TTS

# Notebook settings
warnings.filterwarnings("ignore", category=UserWarning)
t = TicToc()


## Check CUDA

In [None]:
import torch
torch.cuda.is_available()

True

## Load Whisper

In [None]:
!wget -O test.wav http://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0061_8k.wav 

whisper_model = whisper.load_model('base.en')
t.tic()
transcription = whisper_model.transcribe("test.wav")['text']
t.toc()
sentences = [sentence + '.' for sentence in transcription.split('.')[:-1]]
for sentence in sentences: print(sentence)

--2023-06-01 04:38:04--  http://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0061_8k.wav
Resolving www.voiptroubleshooter.com (www.voiptroubleshooter.com)... 162.241.218.124
Connecting to www.voiptroubleshooter.com (www.voiptroubleshooter.com)|162.241.218.124|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 857512 (837K) [audio/x-wav]
Saving to: ‘test.wav’


2023-06-01 04:38:05 (6.13 MB/s) - ‘test.wav’ saved [857512/857512]



100%|███████████████████████████████████████| 139M/139M [00:02<00:00, 63.6MiB/s]


Elapsed time is 8.288790 seconds.
 The mute muffled the high tones of the horn.
 The gold ring fits only a pierced ear.
 The old pan was covered with hard fudge.
 Watch the log float in the wide river.
 The node on the stock of wheat grew daily.
 The heap of fallen leaves was set on fire.
 Right fast if you want to finish early.
 His shirt was clean, but one button was gone.
 The barrel of beer was a brew of malt and hops.
 Tin cans are absent from store shelves.


## Load Mimic TTS

In [None]:
mimic = Mimic3TextToSpeechSystem(settings=Mimic3Settings(
    voice='en_US/hifi-tts_low',
    speaker='92',
    noise_w=0.2,
    rate=0.7,
    length_scale=1,
    use_deterministic_compute=True
))
_ = mimic.text_to_wav('Initializing...');

t.tic()
result = mimic.text_to_wav("I'm sorry Dave. I'm afraid I can't do that.")
t.toc()
play(result,notebook=True)

Elapsed time is 0.476969 seconds.


## Load Coqui TTS

In [None]:
tts = TTS(model_name="tts_models/en/vctk/vits", progress_bar=False, gpu=False)
print('CPU Performance:')
tts.tts_to_file(text="I'm sorry Dave. I'm afraid I can't do that.", speaker='p227', file_path='test.mp3', speed=1)

 > Downloading model to /root/.local/share/tts/tts_models--en--vctk--vits
 > Model's license - apache 2.0
 > Check https://choosealicense.com/licenses/apache-2.0/ for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
CPU Perfo

'test.mp3'

In [None]:
tts = TTS(model_name="tts_models/en/vctk/vits", progress_bar=False, gpu=True)
print('GPU Performance:')
tts.tts_to_file(text="I'm sorry Dave. I'm afraid I can't do that.", speaker='p227', file_path='test.mp3', speed=1)

 > tts_models/en/vctk/vits is already downloaded.
 > Model's license - apache 2.0
 > Check https://choosealicense.com/licenses/apache-2.0/ for more info.
 > Using model: vits
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > log_func:np.log10
 | > min_level_db:0
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:None
 | > fft_size:1024
 | > power:None
 | > preemphasis:0.0
 | > griffin_lim_iters:None
 | > signal_norm:None
 | > symmetric_norm:None
 | > mel_fmin:0
 | > mel_fmax:None
 | > pitch_fmin:None
 | > pitch_fmax:None
 | > spec_gain:20.0
 | > stft_pad_mode:reflect
 | > max_norm:1.0
 | > clip_norm:True
 | > do_trim_silence:False
 | > trim_db:60
 | > do_sound_norm:False
 | > do_amp_to_db_linear:True
 | > do_amp_to_db_mel:True
 | > do_rms_norm:False
 | > db_level:None
 | > stats_path:None
 | > base:10
 | > hop_length:256
 | > win_length:1024
 > initialization of speaker-embedding layers.
GPU Performance:
 > Text splitted

'test.mp3'

## Load ElevenLabs

In [None]:
t.tic()
audio = generate(
  text="This is a test.  One two three. Souncheck check check.",
  voice="Bella",
  model="eleven_monolingual_v1"
)
t.toc()

play(audio,notebook=True)

Elapsed time is 1.331317 seconds.


## Load ChatGPT

In [None]:
API_KEY = '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'
openai.my_api_key = API_KEY

In [None]:
messages = [ {"role": "system", "content": "You are a intelligent assistant.  Answer my questions in less than 20 words.  Keep your responses simple and as short as possible.  Optimize your outputs for speech synthesis.  Instead of using parentheses, use a comma and the appropriate conjunction."} ]


## Record prompt from microphone

In [None]:
#@title
RECORD = """
const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
const b2text = blob => new Promise(resolve => {
  const reader = new FileReader()
  reader.onloadend = e => resolve(e.srcElement.result)
  reader.readAsDataURL(blob)
})
var record = time => new Promise(async resolve => {
  stream = await navigator.mediaDevices.getUserMedia({ audio: true })
  recorder = new MediaRecorder(stream)
  chunks = []
  recorder.ondataavailable = e => chunks.push(e.data)
  recorder.start()
  await sleep(time)
  recorder.onstop = async ()=>{
    blob = new Blob(chunks)
    text = await b2text(blob)
    resolve(text)
  }
  recorder.stop()
})
"""

def record(sec=5, save_file='output.wav'):
    print('Recording...')
    sys.stdout.flush()
    display(Javascript(RECORD))
    s = output.eval_js('record(%d)' % (sec*1000))
    print('Recording ended!')
    sys.stdout.flush()
    b = b64decode(s.split(',')[1])
    with open(save_file,'wb') as f:
        f.write(b)

In [None]:
record(5, save_file = 'prompt_audio.wav')

Recording...


<IPython.core.display.Javascript object>

Recording ended!


## Transcribe prompt audio to text prompt

In [None]:
t.tic()
whisper_prompt = whisper_model.transcribe("prompt_audio.wav")['text'][1:]
t.toc()
print(whisper_prompt)

Elapsed time is 0.531142 seconds.
Repeat after me. I'm sorry Dave. I'm afraid I can't do that.


## Prompt ChatGPT

In [None]:
messages.append(
    {"role": "user", "content": whisper_prompt},
)
t.tic()
chat = openai.ChatCompletion.create(
    openai.my_api_key, model="gpt-3.5-turbo", messages=messages,
)
t.toc()
    
reply = chat.choices[0].message.content
print(f"ChatGPT: {reply}")
messages.append({"role": "assistant", "content": reply})

Elapsed time is 1.510004 seconds.
ChatGPT: I'm sorry Dave. I'm afraid I can't do that.


## Generate response audio

In [None]:
audio = generate(
  text=reply,
  voice="Bella",
  model="eleven_monolingual_v1"
)

play(audio,notebook=True)

## Record prompt audio (testing purposes)

In [None]:
record(7, save_file = 'prompt_audio.wav')

Recording...


<IPython.core.display.Javascript object>

Recording ended!


## Full loop -- ElevenLabs TTS

In [None]:
t.tic();

# Transcribe audio
whisper_prompt = whisper_model.transcribe("prompt_audio.wav")['text'][1:]
print(whisper_prompt)

# Append transcription to chatGPT history
messages.append(
    {"role": "user", "content": whisper_prompt},
)

# Prompt chatGPT
chat = openai.ChatCompletion.create(
    openai.my_api_key, model="gpt-3.5-turbo", messages=messages,
)
    
# Receive reply and append to chatGPT history
reply = chat.choices[0].message.content
print(f"ChatGPT: {reply}")
messages.append({"role": "assistant", "content": reply})

# Generate audio for reply
audio = generate(
  text=reply,
  voice="Bella",
  model="eleven_monolingual_v1"
)

t.toc()

# Play reply audio
play(audio,notebook=True)

Why did vans make a Peanuts themed shoe?
ChatGPT: Vans created a Peanuts themed shoe to showcase their creativity and artisanship while paying tribute to the beloved characters of the Peanuts comic strip and offering fans a new way to engage with the brand.
Elapsed time is 7.809531 seconds.


## Full loop -- Coqui TTS

In [None]:
t.tic()

# Transcribe audio
whisper_prompt = whisper_model.transcribe("prompt_audio.wav")['text'][1:]
print(whisper_prompt)

# Append transcription to chatGPT history
messages.append(
    {"role": "user", "content": whisper_prompt},
)

# Prompt chatGPT
chat = openai.ChatCompletion.create(
    openai.my_api_key, model="gpt-3.5-turbo", messages=messages,
)
    
# Receive reply and append to chatGPT history
reply = chat.choices[0].message.content
print(f"ChatGPT: {reply}")
messages.append({"role": "assistant", "content": reply})

# Generate audio for reply
tts.tts_to_file(text=reply, speaker='p227', file_path='test.wav', speed=1)

t.toc()
Audio("test.wav",autoplay=True)

Why did vans make a Peanuts themed shoe?
ChatGPT: Vans created a Peanuts themed shoe to tap into the nostalgia and cultural significance of the Peanuts comic strip, while also providing fans with a fresh and unique style option.
 > Text splitted to sentences.
['Vans created a Peanuts themed shoe to tap into the nostalgia and cultural significance of the Peanuts comic strip, while also providing fans with a fresh and unique style option.']
 > Processing time: 0.34844374656677246
 > Real-time factor: 0.04022778237725838
Elapsed time is 3.925959 seconds.


## Full loop -- Mimic TTS

In [None]:
t.tic()

# Transcribe audio
whisper_prompt = whisper_model.transcribe("prompt_audio.wav")['text'][1:]
print(whisper_prompt)

# Append transcription to chatGPT history
messages.append(
    {"role": "user", "content": whisper_prompt},
)

# Prompt chatGPT
chat = openai.ChatCompletion.create(
    openai.my_api_key, model="gpt-3.5-turbo", messages=messages,
)
    
# Receive reply and append to chatGPT history
reply = chat.choices[0].message.content
print(f"ChatGPT: {reply}")
messages.append({"role": "assistant", "content": reply})

# Generate audio for reply
result = mimic.text_to_wav(reply)

t.toc()
Audio(result,autoplay=True)

Why did vans make a Peanuts themed shoe?
ChatGPT: Vans made a Peanuts themed shoe as a collaboration with Peanuts Worldwide LLC to create a new and unique product line that would appeal to fans of both brands and offer a fresh and playful addition to the footwear market.
Elapsed time is 7.352780 seconds.


In [None]:
# Print prompt file size
print('Prompt file size:  ' + format((1/1024)*(os.path.getsize('prompt_audio.wav')),'.2f') + ' kb')

# Print response file size
print('Response file size:  ' + format((1/1024)*(os.path.getsize('test.wav')),'.2f') + ' kb')

Prompt file size:  42.13 kb
Response file size:  108.57 kb
