1\. SpeechRecognition Python library
------------------------------------

00:00 - 00:15

To get started with spoken language recognition, let's check out the SpeechRecognition Python Library. We'll start with why the SpeechRecognition Library. And then we'll get into seeing how we can use Google's web speech API to transcribe speech to text.

2\. Why the SpeechRecognition library?
--------------------------------------

00:15 - 00:55

Automatic speech recognition is a tough challenge. And there's no shortage of companies and research institutions working on libraries to help solve it. There's the Sphinx library by Carnegie Mellon University, Kaldi, SpeechRecognition, and more. Some have more robust features than others but they all have the same goal of transcribing audio files to text. We're going to be focused on the SpeechRecognition library because of its low barrier to entry and its compatibility with many available speech recognition APIs we'll see shortly.

#### Some existing python libraries

• CMU Sphinx

• Kaldi

• SpeechRecognition 

• Wav2letter++ by Facebook

3\. Getting started with SpeechRecognition
------------------------------------------

00:55 - 01:15

We can get started with the SpeechRecognition library by installing it from PyPi using pip and running the pip install SpeechRecognition command in a terminal or shell. It's compatible with Python 2 and 3 but we'll be using Python 3.

#### Install from PyPi:
```
python$ pip install SpeechRecognition
```
• Compatible with Python 2 and 3
• We'll use Python 3


4\. Using the Recognizer class
------------------------------

01:15 - 02:23

Now we have SpeechRecognition installed, let's check out where all the magic happens, the recognizer class. So how do we use it? To access the Recognizer class, we'll first import the SpeechRecognition module as the abbreviation sr. Then we'll create an instance of the recognizer class by calling it from sr and assigning to a variable, recognizer. Finally, we'll set the recognizers energy threshold to 300. The energy threshold can be thought of as the loudness of audio which is considered speech. Values below the threshold are considered silence, values above are considered speech. A silent room is typically between 0 and 100. SpeechRecognition's documentation recommends 300 as a starting value which covers most speech files. The energy threshold value will adjust automatically as the recognizer listens to an audio file.

```python
# Import the SpeechRecognition library
import speech_recognition as sr
# Create an instance of Recognizer
recognizer = sr.Recognizer()
# Set the energy threshold
recognizer.energy_threshold = 300
```

5\. Using the Recognizer class to recognize speech
--------------------------------------------------

02:23 - 03:14

Now we've got a recognizer instance ready, it's time to recognize some speech. We chose SpeechRecognition for its flexibility. Here's what I mean. SpeechRecognition has functions built-in to work with many of the best speech recognition APIs. Recognize bing accesses Microsoft's cognitive services, recognize Google uses Google's free web speech API, recognize Google Cloud accesses Google's cloud speed API. And recognize wit uses the wit dot ai platform. They all accept an audio file and return text, which is hopefully the transcribed speech from the audio file. Remember, speech recognition is still far from perfect.

• `Recognizer` class has built-in functions which interact with speech APIs
  - `recognize_bing()`
  - `recognize_google()`
  - `recognize_google_cloud()`
  - `recognize_wit()`

Input: `audio_file`

Output: transcribed speech from `audio_file`

6\. SpeechRecognition Example
-----------------------------

03:14 - 04:18

We'll be using the recognize google function since it's free and doesn't require an API key. However, this limits us to 50 requests per day and if our audio files are too long, it may time out. In my experience, I've had no issues with audio files under 5-minutes. So if you have more audio files or long audio files, you may want to look into one of the paid API services. Let's put everything together with an example. We'll start by importing the speech recognition library as sr. Then we'll initialize a recognizer class. Finally we call recognize google which takes the required parameter audio data. We can also pass it the language our audio file is in. The default language is US English. We're using a mocked version of recognize google for this course so we don't go over the API limit. Running the function returns the speech detected in the audio file as text.

• Focus on `recognize_google()`

• Recognize speech from an audio file with SpeechRecognition:

```python
# Import SpeechRecognition library  
import speech_recognition as sr
# Instantiate Recognizer class
recognizer = sr.Recognizer()
# Transcribe speech using Goole web API
recognizer.recognize_google(audio_data=audio_file,
                          language="en-US")
```

```
Learning speech recognition on DataCamp is awesome!
```

7\. Your turn!
--------------

04:18 - 04:26

Now you've seen a starter example of the SpeechRecognition library, it's time to try it out for yourself!

Pick the wrong speech_recognition API
=====================================

Which of the following is **not** a speech recognition API within the `speech_recognition` library?

An instance of the `Recognizer` class has been created and saved to `recognizer`. You can try calling the API on `recognizer` to see what happens.

Instructions
------------

### Possible answers

`recognize_google()`

`recognize_bing()`

`recognize_wit()`

[/] `what_does_this_say()`

Using the SpeechRecognition library
===================================

To save typing `speech_recognition` every time, we'll import it as `sr`.

We'll also setup an instance of the `Recognizer`class to use later.

The `energy_threshold` is a number between 0 and 4000 for how much the `Recognizer` class should listen to an audio file.

`energy_threshold` will dynamically adjust whilst the recognizer class listens to audio.

Instructions
------------

-   Import the `speech_recognition` library as `sr`.
-   Setup an instance of the `Recognizer` class and save it to `recognizer`.
-   Set the `recognizer.energy_threshold` to 300.

In [None]:
# Importing the speech_recognition library
import speech_recognition as sr

# Create an instance of the Recognizer class
recognizer = sr.Recognizer()

# Set the energy threshold
recognizer.energy_threshold = 300


Using the Recognizer class
==========================

Now you've created an instance of the `Recognizer` class we'll use the `recognize_google()` method on it to access the Google web speech API and turn spoken language into text.

`recognize_google()` requires an argument `audio_data` otherwise it will return an error.

US English is the default language. If your audio file isn't in US English, you can change the language with the `language` argument. A list of language codes can be seen [here](https://cloud.google.com/speech-to-text/docs/languages).

An audio file containing English speech has been imported as `clean_support_call_audio`. You can [listen to the audio file here](https://assets.datacamp.com/production/repositories/4637/datasets/393a2f76d057c906de27ec57ea655cb1dc999fce/clean-support-call.wav). SpeechRecognition has also been imported as `sr`.

To avoid hitting the API request limit of Google's web API, we've mocked the `Recognizer` class to work with our audio files. This means some functionality will be limited.

Instructions
------------

-   Call the `recognize_google()` method on `recognizer` and pass it `clean_support_call_audio`.
-   Set the language argument to `"en-US"`.

In [None]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Transcribe the support call audio
text = recognizer.recognize_google(
  audio_data=clean_support_call_audio, 
  language="en-US")

print(text)

1\. Reading audio files with SpeechRecognition
----------------------------------------------

00:00 - 00:14

In the last lesson, we transcribed a portion of a customer support audio file. But as you'll remember from earlier lessons, audio files require a bit of preprocessing before they can be worked with.

2\. The AudioFile class
-----------------------

00:14 - 01:08

Luckily, the SpeechRecognition library has a built-in class, AudioFile, along with another handy method in the Recognizer class, record. We can use these to take care of the preprocessing for us. It was done for us in the last lesson but in this lesson we'll go end-to-end. To begin, we import the SpeechRecognition library and instantiate a recognizer instance as before. Then to read in our audio file we access the AudioFile class and pass it our audio file filename and save it to a variable. In this case, our AudioFile variable is called clean support call. Now if we check the type of clean support call, we can see it's an instance of AudioFile.

```python
import speech_recognition as sr

# Setup recognizer instance 
recognizer = sr.Recognizer()

# Read in audio file
clean_support_call = sr.AudioFile("clean-support-call.wav")

# Check type of clean_support_call
type(clean_support_call)
```
`<class 'speech_recognition.AudioFile'>`

3\. From AudioFile to AudioData
-------------------------------

01:08 - 02:08

Let's see what happens if we pass our clean support call variable to the recognize google method. It errors, stating that the audio data parameter must be of type audio data. Our clean support call variable is currently of the type AudioFile. To convert it to the audio data type we can use the recognizer class's built-in record method. Let's see it. We use a context manager, also known as with, to open and read the audio file we've saved to clean support call as source. Then we create clean support call audio using the record method and passing it source. Now before we call recognize google again, let's check the type of clean support call audio. Beautiful, it's an instance of AudioData, just what we needed.

```python
recognizer.recognize_google(audio_data=clean_support_call)
```

`AssertionError: ``audio_data`` must be audio data`

```python
# Convert from AudioFile to AudioData
with clean_support_call as source:
    # Record the audio
    clean_support_call_audio = recognizer.record(source)

# Check the type
type(clean_support_call_audio)
```

`<class 'speech_recognition.AudioData'>`

4\. Transcribing our AudioData
------------------------------

02:08 - 02:23

Now our clean support call audio is in the AudioData format, let's call recognize google and pass it our instance of audio data. Much better. Before you try it out for yourself, there are two parameters of the record method you should know about, duration and offset.

```python
# Transcribe clean support call
recognizer.recognize_google(audio_data=clean_support_call_audio)
```

`hello I'd like to get some help setting up my account please`

5\. Duration and offset
-----------------------

02:23 - 03:41

The record method records up to duration seconds of audio from source starting at offset. They're both set to None by default. This means that by default, record will record from the beginning of the file until there is no more audio. You can change this by setting them to a float value. For example, let's say you only wanted the first 2 seconds of all your audio files, you could set duration to 2. The offset parameter can be used to cut off or skip over a specified amount of seconds at the start of an audio file. For example, if you didn't want the first 5 seconds of your audio files, you could set offset to 5. These parameters could be helpful if you knew there were parts of your audio files you didn't need. But remember, altering these parameters may cut off your audio in undesirable locations. The most ideal values will be found by experimentation. We'll see more audio file manipulation later in the course.

```python
# duration and offset both None by default

# Leave duration and offset as default
with clean_support_call as source:
    clean_support_call_audio = recognizer.record(source,
                                               duration=None,
                                               offset=None)

# Get first 2-seconds of clean support call  
with clean_support_call as source:
    clean_support_call_audio = recognizer.record(source,
                                               duration=2.0)
```

`hello I'd like to get`

6\. Let's practice!
-------------------

03:41 - 03:51

Alright, enough talk, let's see speech transcription with SpeechRecognition in action!

From AudioFile to AudioData
===========================

As you saw earlier, there are some transformation steps we have to take to make our audio data useful. The same goes for SpeechRecognition. 

In this exercise, we'll import the `clean_support_call.wav` [audio file](https://assets.datacamp.com/production/repositories/4637/datasets/393a2f76d057c906de27ec57ea655cb1dc999fce/clean-support-call.wav) and get it ready to be recognized.

We first read our audio file using the `AudioFile` class. But the `recognize_google()` method requires an input of type `AudioData`.

To convert our `AudioFile` to `AudioData`, we'll use the `Recognizer` class's method `record()` along with a context manager. The `record()` method takes an `AudioFile` as input and converts it to `AudioData`, ready to be used with `recognize_google()`.

SpeechRecognition has already been imported as `sr`.

Instructions
------------

-   Pass the AudioFile class `clean_support_call.wav`.
-   Use the context manager to open and read `clean_support_call` as `source`.
-   Record `source` and run the code.

In [None]:
# Instantiate Recognizer
recognizer = sr.Recognizer()

# Convert audio to AudioFile
clean_support_call = sr.AudioFile("clean_support_call.wav")

# Convert AudioFile to AudioData
with clean_support_call as source:
    clean_support_call_audio = recognizer.record(source)

# Transcribe AudioData to text
text = recognizer.recognize_google(clean_support_call_audio,
                                   language="en-US")
print(text)

Recording the audio we need
===========================

Sometimes you may not want the entire audio file you're working with. The `duration` and `offset`parameters of the `record()` method can help with this.

After exploring your dataset, you find there's one file, imported as `nothing_at_end` which has [30-seconds of silence at the end](https://assets.datacamp.com/production/repositories/4637/datasets/ca799cf2a7b093c06e1a5ae1dd96a49d48d65efa/30-seconds-of-nothing-16k.wav) and a support call file, imported as `out_of_warranty` has [3-seconds of static at the front](https://assets.datacamp.com/production/repositories/4637/datasets/dbc47d8210fdf8de42b0da73d1c2ba92e883b2d2/static-out-of-warranty.wav).

Setting `duration` and `offset` means the `record()` method will record up to `duration` audio starting at `offset`. They're both measured in seconds.

Instructions 1/2
----------------

-   Let's get the first 10-seconds of `nothing_at_end_audio`. To do this, you can set `duration` to 10.

In [None]:
# Convert AudioFile to AudioData
with nothing_at_end as source:
    nothing_at_end_audio = recognizer.record(source,
                                             duration=10,  # Set duration to 10 to get the first 10 seconds
                                             offset=None)

# Transcribe AudioData to text
text = recognizer.recognize_google(nothing_at_end_audio,
                                   language="en-US")

print(text)


Instructions 2/2
----------------

-   Let's remove the first 3-seconds of static of `static_at_start` by setting `offset` to 3.

In [None]:
# Convert AudioFile to AudioData
with static_at_start as source:
    static_art_start_audio = recognizer.record(source,
                                               duration=None,
                                               offset=3)

# Transcribe AudioData to text
text = recognizer.recognize_google(static_art_start_audio,
                                   language="en-US")

print(text)

1\. Dealing with different kinds of audio
-----------------------------------------

00:00 - 00:19

The SpeechRecognition library is very powerful out of the box, however, since speech recognition is still an active area of research, there are some limitations to the library. In this lesson, we'll start exploring some of the challenges you might run into and what you can do about them.

2\. What language?
------------------

00:19 - 00:45

Although SpeechRecognition is capable of transcribing audio, it doesn't necessarily know what kind of audio it's transcribing. For example, if you pass recognizer an audio file with Japanese speech but the language tag was English US, you'd get back the Japanese audio in English characters. That makes sense.

3\. What language?
------------------

00:45 - 01:25

And if you passed the same audio file with the language tag set to jay ay for Japanese, you'd get the audio transcribed into Japanese characters. The thing to note is the SpeechRecognition library doesn't automatically detect languages. So you'll have to ensure this parameter is set manually and make sure the API you're using has the capability to transcribe the language your audio files are in. We've seen the language tag in previous lessons. But what happens with non-speech audio?

4\. Non-speech audio
--------------------

01:25 - 01:43

If we pass SpeechRecogition an audio file of a leopard roaring, it will return an unknown value error because no speech was detected. Which also makes sense because a leopard roar, although very cool, isn't human speech.

5\. Non-speech audio
--------------------

01:43 - 02:07

We can prevent errors by using the show all parameter. The show all parameter shows a list of all the potential transcriptions the recognize google function came up with. In the case of our leopard roar, the list comes back empty but we avoid raising an error.

6\. Showing all
---------------

02:07 - 02:16

Or in the case of our Japanese file, you can see the different potential transcriptions.

7\. Multiple speakers
---------------------

02:16 - 02:54

Next comes multiple speakers. The free Google Web API transcribes speech and returns it as a single block of text no matter how many speakers there are. A returned single text block can still be useful, however, if your problem requires knowing who said what, you may want to consider the free API we're using as a proof of a concept. And then use one of the paid versions for more complex tasks. The process of splitting more than one speaker from a single audio file is called speaker diarization, however, it is beyond the scope of this course.

8\. Multiple speakers
---------------------

02:54 - 03:10

To get around the multiple speakers problem, you could ensure your audio files are recorded separately for each speaker. Then transcribe the individual speakers audio.

9\. Noisy audio
---------------

03:10 - 04:04

Finally, there's the problem of background noise. A rule of thumb to remember is if you have trouble understanding what is being said on an audio file due to background noise, chances are, a speech recognition system will too. To try and accommodate for background noise, the recognizer class has a built-in function, adjust for ambient noise, which takes a parameter, duration. The recognizer class then listens for duration seconds at the start of the audio file and adjusts the energy threshold, or the amount the recognizer class listens, to a level suitable for the background noise. How much space you have at the start of your audio file will dictate what you can set the duration value to. The SpeechRecognition documentation recommends somewhere between zero point five to one second as a good starting point.

10\. Let's practice!
--------------------

04:04 - 04:23

As you can see, speech has a whole lot of variability which makes transcribing it a tough challenge. But now we've talked about some of the ways to deal with different kinds of audio, let's head over to the console and see it all in action!

Different kinds of audio
========================

Now you've seen an example of how the `Recognizer` class works. Let's try a few more. How about speech from a different language?

What do you think will happen when we call the `recognize_google()` function on a Japanese version of `good_morning.wav` ([file](https://assets.datacamp.com/production/repositories/4637/datasets/cd9b801670d0664275cdbd3a24b6b70a8c2e5222/good-morning-japanense.wav)) (`japanese_audio`)? 

The default language is `"en-US"`, are the results the same with the `"ja"` tag?

How about non-speech audio? Like this [leopard roaring](https://assets.datacamp.com/production/repositories/4637/datasets/5720832b2735089d8e735cac3e0b0ad9b5114864/leopard.wav) (`leopard_audio`).

Or speech where the sounds may not be real words, such as [a baby talking](https://assets.datacamp.com/production/repositories/4637/datasets/e9fd46a06d74431e3baa942c489e1b119d85a233/charlie-bit-me-5.wav) (`charlie_audio`)?

To familiarize more with the `Recognizer` class, we'll look at an example of each of these.

Instructions 1/4
----------------

-   Pass the Japanese version of good morning (`japanese_audio`) to `recognize_google()` using `"en-US"` as the language.

In [None]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Pass the Japanese audio to recognize_google
text = recognizer.recognize_google(japanese_audio, language="en-US")

# Print the text
print(text)

Instructions 2/4
----------------

-   Pass the same Japanese audio (`japanese_audio`) using `"ja"` as the language parameter. Do you see a difference?

In [None]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Pass the Japanese audio to recognize_google
text = recognizer.recognize_google(japanese_audio, language="ja")

# Print the text
print(text)

Instructions 3/4
----------------

-   What about about non-speech audio? Pass `leopard_audio` to `recognize_google()` with `show_all` as `True`.

In [None]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Pass the leopard roar audio to recognize_google
text = recognizer.recognize_google(leopard_audio, 
                                   language="en-US", 
                                   show_all=True)

# Print the text
print(text)

Instructions 4/4
----------------

-   What if your speech files have non-audible human sounds? Pass `charlie_audio` to `recognize_google()` to find out.

In [None]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Pass charlie_audio to recognize_google
text = recognizer.recognize_google(charlie_audio, 
                                   language="en-US")

# Print the text
print(text)

Multiple Speakers 1
===================

If your goal is to transcribe conversations, there will be more than one speaker. However, as you'll see, the `recognize_google()` function will only transcribe speech into a single block of text.

You can hear in [this audio file](https://assets.datacamp.com/production/repositories/4637/datasets/925c8c31d6e4af9c291c692f13e4f41c7b5e86b2/multiple-speakers-16k.wav) there are three different speakers.

But if you transcribe it on its own, `recognize_google()` returns a single block of text. Which is still useful but it doesn't let you know which speaker said what.

We'll see an alternative to this in the next exercise.

The multiple speakers audio file has been imported and converted to `AudioData` as `multiple_speakers`.

Instructions
------------

-   Create an instance of `Recognizer`.
-   Recognize the `multiple_speakers` variable using the `recognize_google()` function.
-   Set the language to US English (`"en-US"`).

In [None]:
# Create a recognizer class
recognizer = sr.Recognizer()

# Recognize the multiple speaker AudioData
text = recognizer.recognize_google(multiple_speakers, 
                       			   language="en-US")

# Print the text
print(text)

Multiple Speakers 2
===================

Deciphering between multiple speakers in one audio file is called speaker diarization. However, you've seen the free function we've been using, `recognize_google()` doesn't have the ability to transcribe different speakers. 

One way around this, without using one of the paid speech to text services, is to ensure your audio files are single speaker.

This means if you were working with phone call data, you would make sure the caller and receiver are recorded separately. Then you could transcribe each file individually.

In this exercise, we'll transcribe each of the speakers in our [multiple speakers audio file](https://assets.datacamp.com/production/repositories/4637/datasets/925c8c31d6e4af9c291c692f13e4f41c7b5e86b2/multiple-speakers-16k.wav) individually.

Instructions
------------

-   Pass `speakers` to the `enumerate()` function to loop through the different speakers.
-   Call `record()` on `recognizer` to convert the `AudioFile`s into `AudioData`.
-   Use `recognize_google()` to transcribe each of the `speaker_audio` objects.

In [None]:
recognizer = sr.Recognizer()

# Multiple speakers on different files
speakers = [sr.AudioFile("speaker_0.wav"), 
            sr.AudioFile("speaker_1.wav"), 
            sr.AudioFile("speaker_2.wav")]

# Transcribe each speaker individually
for i, speaker in enumerate(speakers):
    with speaker as source:
    # Call record() on recognizer to convert the AudioFiles into AudioData.
        speaker_audio = recognizer.record(source)
    print(f"Text from speaker {i}:")
    # Use recognize_google() to transcribe each of the speaker_audio objects.
    print(recognizer.recognize_google(speaker_audio,
         				  language="en-US"))

Working with noisy audio
========================

In this exercise, we'll start by transcribing a clean speech sample to text and then see what happens when we add some background noise.

A clean audio sample has been imported as `clean_support_call`.

[Play clean support call](https://assets.datacamp.com/production/repositories/4637/datasets/393a2f76d057c906de27ec57ea655cb1dc999fce/clean-support-call.wav).

We'll then do the same with the noisy audio file saved as `noisy_support_call`. It has the same speech as `clean_support_call` but with additional background noise.

[Play noisy support call](https://assets.datacamp.com/production/repositories/4637/datasets/f3edd5024944eac2f424b592840475890c86d405/2-noisy-support-call.wav).

To try and negate the background noise, we'll take advantage of `Recognizer`'s `adjust_for_ambient_noise()` function.

Instructions 1/4
----------------

-   Let's transcribe some clean audio. Read in `clean_support_call` as the source and call `recognize_google()` on the file.

In [None]:
recognizer = sr.Recognizer()

# Record the audio from the clean support call
with clean_support_call as source:
    clean_support_call_audio = recognizer.record(source)

# Transcribe the speech from the clean support call
text = recognizer.recognize_google(clean_support_call_audio,  # Use the correct audio data here
                                   language="en-US")

print(text)


Instructions 2/4
----------------


-   Let's transcribe some clean audio. Read in `clean_support_call` as the source and call `recognize_google()` on the file.

-   Let's do the same as before but with a noisy audio file saved as `noisy_support_call` and `show_all` parameter as `True`.

In [None]:
recognizer = sr.Recognizer()

# Record the audio from the noisy support call
with noisy_support_call as source:
    noisy_support_call_audio = recognizer.record(source)

# Transcribe the speech from the noisy support call
text = recognizer.recognize_google(noisy_support_call_audio,
                                   language="en-US",
                                   show_all=True)  # Set show_all to True to return all possible transcriptions

print(text)


Instructions 3/4
----------------

-   Set the `duration` parameter of `adjust_for_ambient_noise()` to 1 (second) so `recognizer`adjusts for background noise.

In [None]:
recognizer = sr.Recognizer()

# Record the audio from the noisy support call
with noisy_support_call as source:
    # Adjust the recognizer energy threshold for ambient noise
    recognizer.adjust_for_ambient_noise(source, duration=1)  # Set duration to 1 second
    noisy_support_call_audio = recognizer.record(source)

# Transcribe the speech from the noisy support call
text = recognizer.recognize_google(noisy_support_call_audio,
                                   language="en-US",
                                   show_all=True)

print(text)


Instructions 4/4
----------------

-   A `duration` of 1 was too long and it cut off some of the audio. Try setting `duration` to 0.5.

In [None]:
recognizer = sr.Recognizer()

# Record the audio from the noisy support call
with noisy_support_call as source:
    # Adjust the recognizer energy threshold for ambient noise
    recognizer.adjust_for_ambient_noise(source, duration=0.5)  # Set duration to 0.5 seconds
    noisy_support_call_audio = recognizer.record(source)

# Transcribe the speech from the noisy support call
text = recognizer.recognize_google(noisy_support_call_audio,
                                   language="en-US",
                                   show_all=True)

print(text)