# Speech Recognition

Speech recognition is the ability of computer software to identify words and phrases in spoken language and convert them to human-readable text. In this tutorial, you will learn how you can convert speech to text in Python using the SpeechRecognition library.

As a result, we do not need to build any machine learning model from scratch, this library provides us with convenient wrappers for various well-known public speech recognition APIs (such as Google Cloud Speech API, IBM Speech To Text, etc.).

## 1 | Install New Dependencies

In [1]:
#Speech Recognition. Might have to install from terminal
%pip install SpeechRecognition pydub

Collecting SpeechRecognition
  Downloading SpeechRecognition-3.10.4-py2.py3-none-any.whl.metadata (28 kB)
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading SpeechRecognition-3.10.4-py2.py3-none-any.whl (32.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.8/32.8 MB[0m [31m222.7 kB/s[0m eta [36m0:00:00[0m00:01[0m00:05[0m
[?25hDownloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub, SpeechRecognition
Successfully installed SpeechRecognition-3.10.4 pydub-0.25.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
#Sklearn sentiment analysis
%pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting click (from nltk)
  Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.7.24-cp311-cp311-macosx_11_0_arm64.whl.metadata (40 kB)
Collecting tqdm (from nltk)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m426.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading regex-2024.7.24-cp311-cp311-macosx_11_0_arm64.whl (278 kB)
Downloading click-8.1.7-py3-none-any.whl (97 kB)
Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, click, nltk
Successfully installed click-8.1.7 nltk-3.9.1 regex-2024.7.24 tqdm-4.66.5
Note: you may need to restart the kernel to use updated packages.


In [3]:
#Converts video & audio footage between operating systems
%pip install ffmpeg

Collecting ffmpeg
  Downloading ffmpeg-1.4.tar.gz (5.1 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: ffmpeg
  Building wheel for ffmpeg (setup.py) ... [?25ldone
[?25h  Created wheel for ffmpeg: filename=ffmpeg-1.4-py3-none-any.whl size=6082 sha256=cf439cdd748ba4ff9d3154f9c6c61f9f9da5bf552f4c1309918254f683652f0e
  Stored in directory: /Users/dd/Library/Caches/pip/wheels/56/30/c5/576bdd729f3bc062d62a551be7fefd6ed2f761901568171e4e
Successfully built ffmpeg
Installing collected packages: ffmpeg
Successfully installed ffmpeg-1.4
Note: you may need to restart the kernel to use updated packages.


## 2 | Transcribing Small Audio Files

In [4]:
#Surpress project warnings
import warnings
warnings.filterwarnings("ignore")

In [5]:
#Alright, let's get started, installing the library using pip:
import speech_recognition as sr

#We gonna use Google Speech Recognition here, as it's straightforward 
#and doesn't require any API key.

In [6]:
#Reading from a File
#Make sure you have an audio file in the current directory that 
#contains English speech

filename = "machine-learning_speech-recognition_16-122828-0002.wav"

#This file was grabbed from the LibriSpeech dataset, but you can use 
#any audio WAV file you want, just change the name of the file, 
#let's initialize our speech recognizer:

In [7]:
# initialize the recognizer
r = sr.Recognizer()

In [8]:
#The below code is responsible for loading the audio file, and 
#converting the speech into text using Google Speech Recognition:

# open the file
with sr.AudioFile(filename) as source: #calls the file
    # listen for the data (load audio to memory)
    audio_data = r.record(source) #understand it is recording
    # recognize (convert from speech to text)
    text = r.recognize_google(audio_data) # convert speech-to-text
    print(text)
    
#here is my result:

I believe you're just talking nonsense


The above code works well for small or medium size audio files. In the next section, we gonna write code for large files.

## 3 | Transcribing Large Audio Files

In [9]:
#Reading Large Audio Files

#If you want to perform speech recognition of a long audio file, 
#then the below function handles that quite well:
# importing libraries 
import speech_recognition as sr 
import os 
from pydub import AudioSegment
from pydub.silence import split_on_silence

# create a speech recognition object
r = sr.Recognizer()

In [10]:
# a function that splits the audio file into chunks
# and applies speech recognition
def get_large_audio_transcription(path):
    """
    Splitting the large audio file into chunks
    and apply speech recognition on each of these chunks
    """
    # open the audio file using pydub
    sound = AudioSegment.from_wav(path)  
    # split audio sound where silence is 700 miliseconds or more and get chunks
    chunks = split_on_silence(sound,
        # experiment with this value for your target audio file
        min_silence_len = 500,
        # adjust this per requirement
        silence_thresh = sound.dBFS-14,
        # keep the silence for 1 second, adjustable as well
        keep_silence=500,
    )
    folder_name = "audio-chunks"
    # create a directory to store the audio chunks
    if not os.path.isdir(folder_name):
        os.mkdir(folder_name)
    whole_text = ""
    # process each chunk 
    for i, audio_chunk in enumerate(chunks, start=1):
        # export audio chunk and save it in
        # the `folder_name` directory.
        chunk_filename = os.path.join(folder_name, f"chunk{i}.wav")
        audio_chunk.export(chunk_filename, format="wav")
        # recognize the chunk
        with sr.AudioFile(chunk_filename) as source:
            audio_listened = r.record(source)
            # try converting it to text
            try:
                text = r.recognize_google(audio_listened)
            except sr.UnknownValueError as e:
                print("Error:", str(e))
            else:
                text = f"{text.capitalize()}. "
                print(chunk_filename, ":", text)
                whole_text += text
    # return the text for all chunks detected
    return whole_text

The above function uses split_on_silence() function from pydub.silence module to split audio data into chunks on silence. The min_silence_len parameter is the minimum length of silence to be used for a split.

silence_thresh is the threshold in which anything quieter than this will be considered silence, I have set it to the average dBFS minus 14, keep_silence argument is the amount of silence to leave at the beginning and the end of each chunk detected in milliseconds.

These parameters won't be perfect for all sound files, try to experiment with these parameters with your large audio needs.

After that, we iterate over all chunks and convert each speech audio into text, and then adding them up altogether, here is an example run:

In [11]:
path= "barakgotjokes.wav"
print("\nFull text:", get_large_audio_transcription(path))

audio-chunks/chunk1.wav : What do you call a factory that makes ok products. 
audio-chunks/chunk2.wav : A satisfactory. 
audio-chunks/chunk3.wav : Dear math grow up in solve your own problems. 
audio-chunks/chunk4.wav : What did the janitor say when he jumped out of the clos. 
audio-chunks/chunk5.wav : Surprise. 
audio-chunks/chunk6.wav : What did the ocean say to the beach. 
audio-chunks/chunk7.wav : Nothing you just wave. 
audio-chunks/chunk8.wav : I heard that the national dutch crime agency is better than the national crime agency of the uk. 
audio-chunks/chunk9.wav : Oh sorry well that's supposed to be funny. 

Full text: What do you call a factory that makes ok products. A satisfactory. Dear math grow up in solve your own problems. What did the janitor say when he jumped out of the clos. Surprise. What did the ocean say to the beach. Nothing you just wave. I heard that the national dutch crime agency is better than the national crime agency of the uk. Oh sorry well that's suppose

## 4 | Sentiment Analysis

Sentiment analysis is the use of natural language to classify the opinion of people. It helps to classify words (written or spoken) into positive, negative, or neutral depending on the use case. The sentiment analyzed can help identify the pattern of a product; it helps to know what the users are saying and take the necessary steps to mitigate any problems.

In [49]:
#Imports 
import pydub
import speech_recognition as sr

#install ffmpeg to work with different video format (Linux)
#sudo snap install ffmpeg
import ffmpeg

#install nltk-- Natural language tool kit from sklearn
#pip install nltk
import nltk

In [50]:
#download some packages from nltk; pretrained sentiment models
nltk.download("punkt")
nltk.download("vader_lexicon")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\atoth\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\atoth\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [51]:
#let’s write a function that reads our audio data and converts the data format to a .wav file from any other audio format (mp3, mp4, etc.).
#You won't need it for this exercise, but you'll have it for future references.

from pydub import AudioSegment
import speech_recognition as sr


def convert_to_wav(filename):

  """Takes an audio file of non .wav format and converts to .wav"""
  # Import audio file
  audio = AudioSegment.from_file(filename)

  # Create new filename
  new_filename = filename.split(".")[0] + ".wav"

  # Export file as .wav
  audio.export(new_filename, format='wav')
  print(f"Converting {filename} to {new_filename}...")

In the block of code above, the AudioSegment class of the pydub library was instantiated (it contains many of the methods you would be using) and the SpeechRecognition library was imported as sr. Also, the function takes in the argument filename (name of the audio file) and uses the from_file method of the AudioSegment class to read the filename and save it as an audio variable. The next line uses the method split to separate the filename from its extension and add it to the .wav using the ‘+’ arithmetic which will concatenate the string. The result is then saved as variable new_filename. Lastly, the audio file was further exported in the .wav file format.

In [52]:
def transcribe_audio(filename):
    """Takes a .wav format audio file and transcribes it to text."""
    # Setup a recognizer instance
    recognizer = sr.Recognizer()
    
    #Import the audio file and convert to audio data
    audio_file = sr.AudioFile(filename)
    with audio_file as source:
        audio_data = recognizer.record(source)
        
        # Return the transcribed text
        return recognizer.recognize_google(audio_data)

Finally, let's transcribe an audio file and analyse its sentiment

In [53]:
#from speech_helpers import convert_to_wav, show_pydub_stats, transcribe_audio
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
filename = "barakgotjokes.wav"

trans_text = transcribe_audio(filename)
print(trans_text)
print(sid.polarity_scores(trans_text))

what do you call a factory that makes okay products a satisfactory dear math grow up in solve your own problems what did the janitor say when he jumped out of the closet supplies what did the ocean say to the beach nothing you just waved I heard that the national Dutch crime agency is better than the National Crime agency of the UK oh sorry well that's supposed to be funny
{'neg': 0.128, 'neu': 0.677, 'pos': 0.195, 'compound': 0.5719}


In the code above, I imported the functions we wrote earlier and the SentimentIntensityAnalyzer from the Vader model of NLTK. An instance of the analyzer is stored in sid and the audio data is saved in the filename (to be used as argument of the function) and I named the .wav version of the data in variable new_name (this will be generated when audio is changed to .wav with convert_to_wav function). I transcribed the audio to test with the function and used the polarity_scores of the SentimentIntensityAnalyzer to get the score of the sentiments.

The results show the negative (neg), neutral (neu), positive (pos), and compound. The corresponding values for each key show the degree to which the word is negative, neutral, positive, and a combined inference (compound). From the results, we could see that the world is more neutral. However, the compound is scaled within a -1 and +1 such that as scores move closer to -1, the more negative and vice-versa.

## 5 | Exploratory Data Analysis

Exploratory data analysis is the act of analyzing a dataset to show its main attributes or characteristics. For this project, we shall be using pydub; a Python library for manipulation of audio with a simple and easy interface to extract the following from the audio data: Channels, sample width, frame rate, and length.

In [54]:
#The function below will generate the above-listed attributes of the audio data:

def show_pydub_stats(filename):
    """Returns different audio attributes related to an audio file."""
    # Create AudioSegment instance
    audio_segment = AudioSegment.from_file(filename)
    
    # Print audio attributes and return AudioSegment instance
    print(f"Channels: {audio_segment.channels}")
    print(f"Sample width: {audio_segment.sample_width}")
    print(f"Frame rate (sample rate): {audio_segment.frame_rate}")
    print(f"Frame width: {audio_segment.frame_width}")
    print(f"Length (ms): {len(audio_segment)}")
    return audio_segment

In [55]:
try:
    path = "barakgotjokes.wav"
    show_pydub_stats(path)
except Exception as E: 
    pass

Channels: 2
Sample width: 2
Frame rate (sample rate): 48000
Frame width: 4
Length (ms): 27093
