<h1 style="text-align: center;text-transform: uppercase;">Conversational Based Agent</h1>

<br>

In this project, you will build an end-to-end voice conversational agent, which can take a voice input audio line, and synthesize a response. The chatbot agent will be executed locally on your computer. 

<img style="width:550px; height:300px;" src="assets/intro.png">

This project consists of the following parts:
1. __Speech Recognition:__ <br>In this part, you will create a speech recognition that can convert your voice into a text format.<br><br>
2. __Chatbot:__ <br>This is the core of your conversational based agent. You will build a chatbot that will answer your questions. <br><br>
3. __Text to Speech:__ <br>After getting the answer from your chatbot, it should be converted into a voice format and that is what you should create in this part. <br><br>
4. __Finalize your Conversational Based Agent:__ <br>At the very end step, you will put everything together and create your Conversational Based Agent.

<br>

# 1. Speech Recognition

---

We will use the Mozilla <a href="https://github.com/mozilla/DeepSpeech">DeepSpeech</a> open-sourced implementation originally developed by Baidu. This allows speech recognition directly on your computer instead of requiring an internet connection or setting up a cloud account.

While DeepSpeech is not the state-of-the-art speech recognizer (there is now DeepSpeech2, Wave2Letter by Facebook, and 
the RNN Transducer by Google), DeepSpeech is a fast, lightweight implementation which is suitable for real-time transcription with very high accuracy. Its code is also well-maintained with new features being added regularly.


In this project, we will not train our own speech recognition model (a fairly challenging project), but will use an open-sourced pre-trained model.
<br>


In [92]:
import deepspeech
import sounddevice as sd
import soundfile as sf
from scipy.io.wavfile import write
from time import sleep
import numpy as np
from tqdm import tqdm
import random
from datetime import datetime
import queue
import pickle

First we need to download the DeepSpeech model, with the matching version (0.7.4) that we installed.
There two files that are required. Please download and save them under the `speech_recognizer` folder.

1. https://github.com/mozilla/DeepSpeech/releases/download/v0.7.4/deepspeech-0.7.4-models.pbmm
2. https://github.com/mozilla/DeepSpeech/releases/download/v0.7.4/deepspeech-0.7.4-models.scorer

Once you have these two files, we are ready to perform speech recognition by instantiating the DeepSpeech Model.

In [2]:
ds = deepspeech.Model('speech_recognizer/deepspeech-0.7.4-models.pbmm')
ds.enableExternalScorer('speech_recognizer/deepspeech-0.7.4-models.scorer')
_ = ds.setScorerAlphaBeta(0.75, 1.85)

### 1.1 Speech-recognition on single audio file

In this section, let's set up the basic functionality of running speech recognition on a single audio file. 

1. recording a .wav audio file with a fix d length (say 3 seconds)

2. perform speech recognition from the saved .wav file using the DeepSpeech model

In [51]:
test_file_name = 'audio_files/test_audio.wav'
sample_rate = 16000
seconds = 3

In [299]:
sleep(0.5)
print("Recording...")
audio_array = sd.rec(int(seconds * sample_rate), samplerate = sample_rate, channels = 1)

# Wait until recording is finished
sd.wait() 

# Finished recording print
print("Recording Finished!")

# Save as WAV file 
write(test_file_name, sample_rate, audio_array) 

Recording...
Recording Finished!


The `sd.rec` function gives us numpy array directly! We can check its shape

The number of rows is seconds * sample_rate = 16000 * 4, the number of columns is the channels = 1

In [300]:
audio_array.shape

(48000, 1)

We can check the recording by playing it back from the numpy array

In [302]:
sd.playrec(audio_array, sample_rate, channels=1)

array([[ 2.0721478e-02],
       [ 8.8054694e-02],
       [-2.7549866e-01],
       ...,
       [ 8.7510690e-04],
       [ 1.4510446e-41],
       [ 1.7442986e+28]], dtype=float32)

Or from the .wav file that it is saved to.

In [303]:
data, fs = sf.read(test_file_name, dtype='float32')
sd.play(data, sample_rate, device=1)
status = sd.wait()

If the playback did not work, chooose another output device by checking what is available on your machine

In [37]:
sd.query_devices()

> 0 Built-in Microphone, Core Audio (2 in, 0 out)
< 1 Built-in Output, Core Audio (0 in, 2 out)
  2 USB PnP Audio Device, Core Audio (0 in, 2 out)
  3 USB PnP Audio Device, Core Audio (1 in, 0 out)

While sound device outputs numpy array in float32 datatype (from -1 to 1), DeepSpeech speech recognizer expects a 16bit int type (-32768 to 32767). Let's convert the numpy array and set the correct data type.

In [305]:
audio_array *= 32768
audio_array = audio_array.astype('int16')

In [306]:
ds.stt(audio_array[:,0])

'this is a test recording'

### 1.2 Streaming Speech Recognition in Real-Time

Recording your voice then running speech recognition on a audio file works fine, but it is not very user friendly. The interaction is slow and not easy to use in a continuous setting.

In this section, let's setup a function to recording your voice AND recognize the text at the same time!

In [307]:
import queue

In [308]:
def callback(indata, frames, time, status):
    """This is called (from a separate thread) for each audio block."""
    if status:
        print(status, file=sys.stderr)
    q.put(indata.copy())

In [344]:
q = queue.Queue()
recognizer_stream = ds.createStream()
try:
    with sd.InputStream(samplerate=sample_rate, device=0, channels=2, callback=callback) as audio_stream:
        print('#' * 80)
        print('press Interrupt to stop the recording')
        print('#' * 80)
        print()
        i = 0
        while True:
            i += 1
            audio_chunk = q.get()
            audio_chunk *= 32768
            audio_chunk = audio_chunk.astype('int16')
            recognizer_stream.feedAudioContent(audio_chunk[:,0])
            text = recognizer_stream.intermediateDecode()
            print(f'\r{text}', end='')
except KeyboardInterrupt:
#     print('\r\nRecording finished.\r\n')
    pass
finally:
    audio_stream.stop()
    audio_stream.close()
    audio_chunks = []
    while True:
        if not q.empty():
            chunk = q.get()
            audio_chunks.append(chunk)
        else:
            break
    if audio_chunks:
        audio_chunks = np.concatenate(audio_chunks)
        audio_chunk *= 32768
        audio_chunk = audio_chunk.astype('int16')
        recognizer_stream.feedAudioContent(audio_chunk[:,0])
    text = recognizer_stream.finishStream()
    print(f'\r{text}')

################################################################################
press Ctrl+C to stop the recording
################################################################################

the moon is about two hundred and fifty thousand miles from earth on average


In [3]:
def streaming_recognition():
    q = queue.Queue()
    recognizer_stream = ds.createStream()
    
    def callback(indata, frames, time, status):
        """This is called (from a separate thread) for each audio block."""
        if status:
            print(status, file=sys.stderr)
        q.put(indata.copy())
    
    try:
        with sd.InputStream(samplerate=sample_rate, device=0, channels=2, callback=callback) as audio_stream:
            while True:
                audio_chunk = q.get()
                audio_chunk *= 32768
                audio_chunk = audio_chunk.astype('int16')
                recognizer_stream.feedAudioContent(audio_chunk[:,0])
                text = recognizer_stream.intermediateDecode()
                print(f"\r - YOU SAID: {text}", end='')
    except KeyboardInterrupt:
    #     print('\r\nRecording finished.\r\n')
        pass
    finally:
        audio_stream.stop()
        audio_stream.close()
        audio_chunks = []
        while True:
            if not q.empty():
                chunk = q.get()
                audio_chunks.append(chunk)
            else:
                break
        if audio_chunks:
            audio_chunks = np.concatenate(audio_chunks)
            audio_chunk *= 32768
            audio_chunk = audio_chunk.astype('int16')
            recognizer_stream.feedAudioContent(audio_chunk[:,0])
        text = recognizer_stream.finishStream()
        print(f"\r - YOU SAID: {text}", end='\r\n')
        
    return text

In [359]:
streaming_recognition()

################################################################################
press Interrupt to stop the recording
################################################################################

this is a test


'this is a test'

#### Congratulations! You are now able to run your own speech-to-text!