# Real-time speech recognition
### Or: Why Voice Assistants don't work properly...

##### By: Aaron Alef (Email: aaron.alef@code.berlin - Slack: [@aaron](https://slack.com/app_redirect?team=T54B2S3T9&channel=U82F166U9))

----
**It can be hard to listen for a long time and it's easy to miss details... So why not make use of our omnipresent smartphones to support us?  
We want to find out where the limits of our microphones are when it comes to the core of our project - audio transcription!  
Using machine learning, specifically neuronal networks, we search for words in audio streams.  
Result: Real-time transcription is *hard!***

----

## Introduction

According to [this Medium article](https://medium.com/descript/which-automatic-transcription-service-is-the-most-accurate-2018-2e859b23ed19) Google currently achieves the highest transcription accuracy, with a Word-Error-Rate (short WER) of 16% - low enough to make sense of the transcript and high enough to annoy potential customers. Google in particular is focussing on the recognition of speech commands - that is, maximising the potential of their Google Voice Assistant, aiming on making it accurate and fast, as mentioned [here](https://www.wired.com/story/google-made-truly-usable-voice-assistant/), and so do other companies like Amazon or Microsoft.  
However, none of them really focus on actual *speech* recognition.
On the `transcription-gcloud` branch you can try out it's accuracy - it works, but isn't really satisfying to use.  
So, we asked ourselves, what makes it so hard?  
And how far can we come with an own algorithm adapted to be used on speeches?  
To answer these questions we started working on our own speech-to-text converter, using the open-source [Common Voice](https://voice.mozilla.org/en) data set crowdsourced by Mozilla.  



##### Now, our goal is to provide a tool that

1. Takes a continous stream of audio data with unknown length
2. converts this speech into meaningful sentences
3. Gives these sentences back, as sequential stream of single words


A server would handle the connection to the mobile phone, using gRPC, a real-time streaming protocol.
But as this is about the language processing part, I will leave out unimportant details about where the audio comes from and focus on its processing only.

## Getting started

#### Technical setup

To reproduce everything, please make sure that all the modules in `requirements.txt` are properly installed.  
In the best case, running the command below works (otherwise please install it manually, using the given command):

In [5]:
COMMAND = "pip install --user -r ../../requirements.txt"

from os import system
system(COMMAND)

0

#### Process

The continous audio stream means our tool gets small chunks of audio data, a few hundred samples, as an array, at a time.
A first dive into the matter resulted in the following plan:

1. Extract features from one chunk of data, called window.
2. Feed these features into a neural network trained on the aforementioned data set; the result being single phonemes.
3. Put these phonemes back together to meaningful words

The feature extraction part is easy.
Using the python library `librosa` we calculated the mfcc on each of these windows, that is, the mel-frequency cepstrum coefficients. These coefficients would then be our features.  
Internally, this would do a fourier transform (Split the sound into its frequencies) mapped onto a mel-scale plus some further processing. 

I will visualise this process on two separate audio files found in this folder, randomly taken out of the data set:  
* **a.mp3**
  * Original name:
    *  77c44851bac797d08b5724fcd0412c0a073f7888adc34a5ab588669e08319729647c4bc28e10a4b0609a7cdd15016f91e5a945314fab8392e4225df53744e51f.mp3
  * Sentence:
    * "One of the first problems you'll run into is recognition errors, particularly with any command that allows raw dictation."
* **b.mp3**
  * Original name:
    * 897cde9ed3b2abb15d79a11aca4965eae80b2b42604797e9bb477ebfd93ca29a5a1fddea18f315746aea497c06a3f3fe5daf3f100ba1b35dc22b0800d3e93ee2.mp3
  * Sentence:
    * "It was only when I got this close to it that the strangeness of it was at all evident to me."


In [6]:
from transcription import visualise

In [8]:
# Create a new visualiser which loads the given file:
a = visualise.Visualise("a.mp3")
b = visualise.Visualise("b.mp3")