# Speech to Text conversions using different API's

### Introduction and Background

*Speech recognition* primarily involves recognition and **transcription** of spoken language to text. It can be done real-time similar to the way done in Google voice keyboards, auto transcribers etc. 

In this markdown we would *transcribe* an audio recording to text using Python.

Note: Transcription should not be confused with *translation*, which means representing the meaning of a source language text in a target language. Also not to be confused with *transliteration* which  means representing accurate text spelling from one script to another.

Reg the relevance of speech2text methods to DC, what business situations or use-cases might be out there?

The example audio we'll use in our demonstration today is a small part of Donald Trump's speech during his presidential campaign. We'll use 'SpeechRecognition' package of python, which uses different API's based on the function invoked. 

Speech recognition engine/API support (in the SpeechRecognition package):

<ul>
    <li>CMU Sphinx (works offline)</li>
    <li>Google Speech Recognition</li>
    <li>Google Cloud Speech API</li>
    <li>Wit.ai</li>
    <li>Microsoft Bing Voice Recognition</li>
    <li>Houndify API</li>
    <li>IBM Speech to Text</li>
    <li>Snowboy Hotword Detection (works offline)</li>
</ul>

Based on different situations, we may choose an offline engine or an online API. Each has its pros and cons. Any guesses as to what these might be? Which one would you prefer a priori?

### Installation

Am using the *import sys* and *!pip install moduleName* method to check for already present packages and if not present to download and pip-install them. 

See below.

In [1]:
import sys

!pip install SpeechRecognition

Collecting SpeechRecognition
[?25l  Downloading https://files.pythonhosted.org/packages/26/e1/7f5678cd94ec1234269d23756dbdaa4c8cfaed973412f88ae8adf7893a50/SpeechRecognition-3.8.1-py2.py3-none-any.whl (32.8MB)
[K    100% |████████████████████████████████| 32.8MB 993kB/s ta 0:00:011
[?25hInstalling collected packages: SpeechRecognition
Successfully installed SpeechRecognition-3.8.1


### Import the required libraries

In [2]:
import time  # for timing funcs
import speech_recognition as sr
sr.__version__  # which version is installed?

'3.8.1'

In [3]:
# obtain path to "trump_speech.wav" in the same folder as this script
from os import path
path1 = 'D:/audio py files/'
AUDIO_FILE = path1 + "data/trump_speech.wav"

# use the audio file as the audio source
t1 = time.time()

r = sr.Recognizer()  # creating a Recognizer instance

with sr.AudioFile(AUDIO_FILE) as source:
    audio = r.record(source)  # read the entire audio file

time.time() - t1

FileNotFoundError: [Errno 2] No such file or directory: 'D:/audio py files/data/trump_speech.wav'

In [6]:
type(audio)

speech_recognition.AudioData

### Sphinx

In [7]:
# check for installation
import sys
!pip install pocketsphinx



The Recognizer class is crucial in SpeechRecognition. Below, look for the *r.recognize_sphinx()* func.

Each *recognize_xyz()* method will throw a *speech_recognition.RequestError* exception if the API is unreachable. 

For recognize_sphinx(), this could happen as the result of a missing, corrupt or incompatible Sphinx installation. 

In [8]:
# recognize speech using Sphinx
t1 = time.time()

try:
    print("The text from audio is:\n" + r.recognize_sphinx(audio))
except:
    print("There was error processing audio")
    
time.time() - t1    

The text from audio is:
my administration has accomplished more than almost any administration in the history of our country martyrs sojourn to type their reaction of that so united states will not tell you how to live for work or worship we only ask that you are our sovereignty in return i would like to thank chairman kim for his carriage and four of the steps he is taking so much more remains to be that our shared goals must be the de escalation of military conflict along with a political solution that honors the will of the syrian people the united states will respond if chemical weapons are employed by the us side


20.768925428390503

The above took 20 odd secs for a 1 minute clip. Significant time taken. 

So, for large audio files, one way maybe to break into pieces and parallelize the transcription op. This is easy with offline functions.

#### Assessing Transcription quality

You've heard the clip. How clear was it? How hard or easy was it to follow? 

You've read the transcription above. Compare the two - speech and text. Arrive at some estimate of 'transcription quality'. 

### Google speech recognition

Caution: The default API key for Goog speech services provided by SpeechRecognition is for *testing* purposes only. And Google may revoke it at any time. 

It is **not** a good idea to use the Google Web Speech API in production. 

Even with a valid API key, you’ll be limited to only 50 requests per day, and apparently there's no way to raise this quota. 

In [11]:
t1 = time.time()
# recognize speech using Google Speech Recognition (default without credentials)   
try:
    print("The text from audio is:\n" + r.recognize_google(audio))
except:
    print("There was error processing audio")
    
time.time() - t1

The text from audio is:
my administration is more than most any administration in the history of America


13.358235597610474

The above output was all I could get. Insufficient to judge transcription quality on its basis.

All said and done, between sphinx and goog, I'd go with the former, for now.

Next, let's look at another service on offer from __[SoundHound](https://www.houndify.com/)__.

P.S. You might wanna check out the website for the cool AI-ish promises being made aajkal...

### Houndify 

In [12]:
# recognize speech using Houndify
HOUNDIFY_CLIENT_ID = "K-PDcIt1-UIYGmwLbe19mg=="  
HOUNDIFY_CLIENT_KEY = "glsGO73hcZPBgdQ8toWXkGSsqS-lqcQyPTyxd71CsF3byCrG-0uRhx4D5cY4ulTOfSCVXGMSyVvordcxiBLFJA=="

t1 = time.time()

try:
    print("The text from audio is:\n" + r.recognize_houndify(audio, client_id=HOUNDIFY_CLIENT_ID, client_key=HOUNDIFY_CLIENT_KEY))
except:
    print("There was error processing audio")
    
time.time() - t1

The text from audio is:
my administration has accomplished more than almost any administration in the history of our country america's center didn't expect that reaction but that's ok united states will not tell you how to live for work or worship we only ask that you honor our sovereignty in return i would like to thank chairman kim for his courage and for the steps he has taken though much work remains to be done our shared goals must be the de escalation of military conflict along with a political solution that honors the will of the syrian people the united states will respond if chemical weapons are deployed by the a-side


28.29644799232483

Took longer (at ~ 28 secs) but a quick assessment shows its way better quality. Its API based rather than offline. Tradeoffs galore. 

The other functions which are a part of this package are:

<ul>
    <li>recognize_bing()</li>
    <li>recognize_google_cloud()</li>
    <li>recognize_ibm(): IBM Speech to Text</li>
    <li>recognize_wit(): Wit.ai</li>
</ul>

A note worthy mention (outside of SpeechRecognition) bt within the py ecosystem is __[APIAI](https://pypi.org/project/apiai/)__. 

Among the above, wit and apiai—offer built-in features, like natural language processing for identifying a speaker’s intent, which go beyond basic speech recognition. 

Others, like sphinx and google-cloud-speech, focus solely on speech-to-text conversion.


### Audio data collection

So far I've assumed you've ready-made .wav (or other format) audio files available for analysis. Else, one would have to collect speech data. 

This however will require a few dependencies. Notably, the **PyAudio** package is needed for capturing microphone input.

Nowadays recording conversations is easy via the mobile phone, so am not stressing this aspect much. Desktops attached with a microphone can also directly take voice recordings and convert to .wav files.

# Text to Speech conversions using different API's

Text to Speech achieves the opposite objective of Speech to text. 

The practical applications involves audiobooks, voice guides in tourist attractions and etc. We would use 'gTTS' package for text-to-speech conversion. 

gTTS stands for 'google Text To Speech' and uses google voice API for coversion.

### Installation

In [13]:
import sys

!pip install gTTS

Collecting gTTS
  Downloading https://files.pythonhosted.org/packages/e6/37/f55346a736278f0eb0ae9f7edee1a61028735ef0010db68a2e6fcd0ece56/gTTS-2.0.3.tar.gz
Collecting bs4 (from gTTS)
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting gtts_token (from gTTS)
  Downloading https://files.pythonhosted.org/packages/e7/25/ca6e9cd3275bfc3097fe6b06cc31db6d3dfaf32e032e0f73fead9c9a03ce/gTTS-token-1.1.3.tar.gz
Building wheels for collected packages: gTTS, bs4, gtts-token
  Building wheel for gTTS (setup.py): started
  Building wheel for gTTS (setup.py): finished with status 'done'
  Stored in directory: C:\Users\20052\AppData\Local\pip\Cache\wheels\ac\d3\52\db6c154b20dfaab7e0b514eb5eef92cecd057e40e16fdda58b
  Building wheel for bs4 (setup.py): started
  Building wheel for bs4 (setup.py): finished with status 'done'
  Stored in directory: C:\Users\20052\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b

### Google Text to Speech

In [14]:
from gtts import gTTS 
import os 
  
mytext = 'Welcome to Indian School of Business!'

t1 = time.time()
# Language in which you want to convert 
language = 'en'
myobj = gTTS(text=mytext, lang=language, slow=False)  

time.time() - t1   # runtime in seconds

0.7452943325042725

Play the audio saved with below filename in the working directory

In [15]:
myobj.save(path1 + "welcome.mp3") 

In [16]:
## Get supported languages ka list
import gtts.lang
gtts.lang.tts_langs()

{'af': 'Afrikaans',
 'sq': 'Albanian',
 'ar': 'Arabic',
 'hy': 'Armenian',
 'bn': 'Bengali',
 'bs': 'Bosnian',
 'ca': 'Catalan',
 'hr': 'Croatian',
 'cs': 'Czech',
 'da': 'Danish',
 'nl': 'Dutch',
 'en': 'English',
 'eo': 'Esperanto',
 'et': 'Estonian',
 'tl': 'Filipino',
 'fi': 'Finnish',
 'fr': 'French',
 'de': 'German',
 'el': 'Greek',
 'hi': 'Hindi',
 'hu': 'Hungarian',
 'is': 'Icelandic',
 'id': 'Indonesian',
 'it': 'Italian',
 'ja': 'Japanese',
 'jw': 'Javanese',
 'km': 'Khmer',
 'ko': 'Korean',
 'la': 'Latin',
 'lv': 'Latvian',
 'mk': 'Macedonian',
 'ml': 'Malayalam',
 'mr': 'Marathi',
 'my': 'Myanmar (Burmese)',
 'ne': 'Nepali',
 'no': 'Norwegian',
 'pl': 'Polish',
 'pt': 'Portuguese',
 'ro': 'Romanian',
 'ru': 'Russian',
 'sr': 'Serbian',
 'si': 'Sinhala',
 'sk': 'Slovak',
 'es': 'Spanish',
 'su': 'Sundanese',
 'sw': 'Swahili',
 'sv': 'Swedish',
 'ta': 'Tamil',
 'te': 'Telugu',
 'th': 'Thai',
 'tr': 'Turkish',
 'uk': 'Ukrainian',
 'vi': 'Vietnamese',
 'cy': 'Welsh',
 'zh-cn': 

In [17]:
# trying something in Hindi
mytext = 'hum terey bin ab reh nahi sakteyy, terey binaa kyaa wajood meraa'

t1 = time.time()
# Language in which you want to convert 
language = 'hi'
myobj = gTTS(text=mytext, lang=language, slow=False)  

# save as mp3 file
myobj.save(path1 + "bollywood.mp3")

time.time() - t1   # runtime in seconds

1.6135022640228271

Play the above file bollywood.mp3 and see for yourself. 

Now its an unreasonable standard, expecting the machine to 'sing' but did it at least talk prose in Hindi properly? 

Could we improve the pronunciation by changig input text? Etc and other things.

I've now half a mind to try some of my olden favorite and *slow* English classics (Think Bob Dylan's "Blowing in the wind" or Led Zep's "tairway to Heaven" to test TTS quality....

But, shall logoff here for now on this notebook.

Ciao
Sudhir