# Speech Recognition

The first component of speech recognition is, of course, speech. Speech must be converted from physical sound to an electrical signal with a microphone, and then to digital data with an analog-to-digital converter. Once digitized, several models can be used to transcribe the audio to text.

Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM). This approach works on the assumption that a speech signal, when viewed on a short enough timescale (say, ten milliseconds), can be reasonably approximated as a stationary process—that is, a process in which statistical properties do not change over time.

In a typical HMM, the speech signal is divided into 10-millisecond fragments. The power spectrum of each fragment, which is essentially a plot of the signal’s power as a function of frequency, is mapped to a vector of real numbers known as cepstral coefficients. The dimension of this vector is usually small—sometimes as low as 10, although more accurate systems may have dimension 32 or more. The final output of the HMM is a sequence of these vectors.

To decode the speech into text, groups of vectors are matched to one or more phonemes—a fundamental unit of speech. This calculation requires training, since the sound of a phoneme varies from speaker to speaker, and even varies from one utterance to another by the same speaker. A special algorithm is then applied to determine the most likely word (or words) that produce the given sequence of phonemes.

One can imagine that this whole process may be computationally expensive. In many modern speech recognition systems, neural networks are used to simplify the speech signal using techniques for feature transformation and dimensionality reduction before HMM recognition. Voice activity detectors (VADs) are also used to reduce an audio signal to only the portions that are likely to contain speech. This prevents the recognizer from wasting time analyzing unnecessary parts of the signal.

Fortunately, as a Python programmer, you don’t have to worry about any of this. A number of speech recognition services are available for use online through an API, and many of these services offer Python SDKs.

A handful of packages for speech recognition exist on PyPI. A few of them include:

- apiai
- assemblyai
- google-cloud-speech
- pocketsphinx
- SpeechRecognition
- watson-developer-cloud
- wit

To install speech recognition library use "!pip install SpeechRecognition"

### Import the library and let's get started  

In [2]:
import speech_recognition as sr

### Convert audio file to text 

In [3]:
r = sr.Recognizer()
    
audio_file = sr.AudioFile('male.wav')    
with audio_file as source:
    audio1 = r.record(source)

google = r.recognize_google(audio1)        # Google Web Speech API

print("The file said:", '\n', google)

# To reduce the duration of hearing from file add duration variable in record
with audio_file as source:
    audio1 = r.record(source, duration=10)

google = r.recognize_google(audio1)        

print("This is said in first 10 seconds:", '\n', google)

# To read from file in segments  
with audio_file as source:
    audio1 = r.record(source, duration=10)
    audio2 = r.record(source, duration=10)
google = r.recognize_google(audio1)
google2 = r.recognize_google(audio2)

print("This is said in first 10 seconds:", '\n', google)
print("This is said in next 10 seconds:", '\n', google2)

# To skip a part of the audio file we use offset
with audio_file as source:
    audio1 = r.record(source, offset=10, duration=10)

google = r.recognize_google(audio1)        

print("This is said after skipping first 10 seconds:", '\n', google)


The file said: 
 summary the sites to break a teacher for the you keep error code coverage work for places to save money baby is taking longer to getting squared away then the bank is expected in the life event company in AVN had Tak Sahi retirement income the top news of the saving ragnarok latest update you naked Bond what a discussion can insert when the title of this type of song is in question or waxing or gasing needed I provide 90% discount by workplace leather lace work on a flat surface and smooth out this time system uses its angles of intent Unity asset store holds a good mechanical isliye bad boss 12 figures with Gauhar in lete Samay beautiful chairs cabinets test for houses
This is said in first 10 seconds: 
 summary the sites to break a teacher for the you keep adequate coverage work for places to save money baby is taking longer to getting squared away
This is said in first 10 seconds: 
 summary the sites to break a teacher for the you keep adequate coverage work for pla

### Dealing with noisy data 

In [4]:
# To remove the noise from audio 
with audio_file as source:
    r.adjust_for_ambient_noise(source)
    audio1 = r.record(source, offset=10, duration=10)

google = r.recognize_google(audio1)        

print("This is said after noise reduction:", '\n', google)

# To adjust the calibration time taken by the method so that we don't skip the information
with audio_file as source:
    r.adjust_for_ambient_noise(source, duration=0.5) # This defines the time the adjuster waits for calibration to avoid missing words
    audio1 = r.record(source, offset=10, duration=10)

google = r.recognize_google(audio1)        

print("This is said after noise reduction:", '\n', google)


This is said after noise reduction: 
 is expected in the wife Reliance company Mai when hurt accident vitamin encounter 200 top news of saving drugs
This is said after noise reduction: 
 bankers expected during the wife event company in AVN had accepted vitamin encounter 200 top news of saving drugs


### Possible outputs

In [5]:
with audio_file as source:
    audio1 = r.record(source, offset=5, duration=5)

google = r.recognize_google(audio1, show_all=True)

print("These are the possible outputs", '\n', google)


These are the possible outputs 
 {'alternative': [{'transcript': 'places to save money is taking longer to getting squared', 'confidence': 0.70617819}, {'transcript': 'places to sleep my baby is taking longer to getting squared'}, {'transcript': 'places to save money is taking longer to getting scared'}, {'transcript': 'places to sleep my baby is taking longer to getting scared'}, {'transcript': 'places to save money is taking longer to getting square'}], 'final': True}


### Working with audio inputs from microphone

In [9]:
# The pyaudio library is used by speech recognition library to take input from microphone
# !pip install pipwin
# !pipwin install pyaudio

In [7]:
sr.Microphone.list_microphone_names()

['Microsoft Sound Mapper - Input',
 'Microphone Array (Realtek(R) Au',
 'Microsoft Sound Mapper - Output',
 'Speaker/Headphone (Realtek(R) A',
 'Primary Sound Capture Driver',
 'Microphone Array (Realtek(R) Audio)',
 'Primary Sound Driver',
 'Speaker/Headphone (Realtek(R) Audio)',
 'Speaker/Headphone (Realtek(R) Audio)',
 'Microphone Array (Realtek(R) Audio)',
 'Speakers (Realtek HD Audio output)',
 'Microphone Array 1 (Realtek HD Audio Mic input with SST)',
 'Microphone Array 2 (Realtek HD Audio Mic input with SST)',
 'Microphone Array 3 (Realtek HD Audio Mic input with SST)']

In [8]:
# Here the listen method is used instead of record incase of mic audio input
with sr.Microphone() as source:
    audio1 = r.listen(source)
    
print("You said: ", r.recognize_google(audio1))

You said:  1234 exclamation @ hash
