# Speech to text tutorial 🎤

What if you want to know what people say in some audio recordings but you are too lazy to listen to them? No problem, we have you covered! In this short tutorial, we will show you how `fab`'s `speech_to_text` interface can help you transcribing your audio recordings. For this demo, we'll download a dataset called "EmoDB." EmoDB has 535 voice recordings in German, spoken by 10 actors, each expressing 7 different emotions.

### Setting up some libraries, params, and utility variables

Before proceeding, we have to define the necessary variables such as `dataset_url`, `data_folder`, and `dataset_name`. These variables will guide us to the EmoDB dataset's web address, determine where to store the data, and name the downloaded dataset. Also, the `speech_to_text_tool_name` will determine what transcription tool we will use.

In [1]:
# This variable holds the web address from which we'll download the EmoDB dataset.
# It's like a treasure map guiding us to the wonderful voice recordings!
dataset_url = "http://emodb.bilderbar.info/download/download.zip"

# The data_folder variable points to the location where we'll store all the data and audio recordings.
# Think of it as our backstage area, well-organized and ready to showcase the talents of our voices!
data_folder = "./data/"

# The dataset_name variable will be the name we give to the EmoDB dataset once we download it.
# Just a friendly label to recognize it easily when we work with it later on.
dataset_name = "emodb_dataset"

# anonymization_tool_name holds the enchanting name of our magical tool, "coqui."
# "coqui" will help us weave the cloak of anonymity around the voices.
# It's like the sorcerer behind the voice masks!
speech_to_text_tool_name = "whisper"

In [2]:
# Let's import some essential libraries that will assist us in our voice anonymization journey.
import os
import torchaudio
import torch
import sys
import pathlib
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image, Audio
import random
import logging

# tool_folder points to the path where we store all our magical tools, including the "voice_anonymization" interface.
# It's like the wizard's library, filled with powerful incantations!
tool_folder = "../../tools/"

# Now, we define the target folder and script directory paths for later use.
target_folder = tool_folder + 'speech_to_text'
script_directory = os.getcwd()
target_folder_absolute_path = os.path.join(script_directory, target_folder)

# We add the target_folder to the system path, so we can import the modules from there.
sys.path.insert(0, target_folder_absolute_path)

# Next, we import the AudioRepresentation class from the audio_representation module.
from speech_to_text import Transcriber

# dataset_path holds the path to our downloaded EmoDB dataset.
dataset_path = data_folder + dataset_name

# audio_folder_path points to the folder containing the original audio recordings in the EmoDB dataset.
audio_folder_path = dataset_path + "/wav/"

# Finally, we set some environment variables, just to have them accessible throughout our code.
os.environ['dataset_url'] = dataset_url
os.environ['data_folder'] = data_folder
os.environ['dataset_name'] = dataset_name
os.environ['dataset_path'] = dataset_path

# Suppress debug messages
logging.getLogger('matplotlib.font_manager').disabled = True


  def backtrace(trace: np.ndarray):


### Setting up the dataset

Let's start by downloading the EmoDB dataset. 🎤📥

The following code will fetch the EmoDB dataset from the provided `dataset_url` and save it in the `data_folder`.

In [3]:
%%bash

# This bash script checks if the EmoDB dataset has already been downloaded.
# If the dataset folder exists, it means the dataset is already downloaded.
# Otherwise, it proceeds with the download process.

if [ -d "$dataset_path" ]; then
  # The dataset folder exists, so the dataset is already downloaded.
  echo "$dataset_name already downloaded in $dataset_path."
else
  # The dataset folder does not exist, indicating the dataset needs to be downloaded.
  echo "Downloading..."

  # Create the dataset folder and its parent directories, if they don't exist.
  mkdir -p "$dataset_path"

  # Use the 'wget' command to fetch the EmoDB dataset from the provided URL ($dataset_url).
  # Save the downloaded file as "$dataset_name.zip" in the "$dataset_path" folder.
  wget -O "$dataset_path"/"$dataset_name".zip "$dataset_url"

  # Unzip the downloaded dataset file ($dataset_name.zip) into the "$dataset_path" folder.
  # The '-d' option specifies the destination directory for the extracted files.
  unzip "$dataset_path"/"$dataset_name".zip -d "$dataset_path"

  # Remove the downloaded zip file, as we don't need it anymore.
  rm "$dataset_path"/"$dataset_name".zip
fi


emodb_dataset already downloaded in ./data/emodb_dataset.


Now that we've successfully downloaded the EmoDB dataset and have it at our disposal, it's time to embark on a delightful exploration of its voice recordings! 🎧🔍

What better way than by playing some random audio recordings from the dataset? 🌟🔊


In [4]:
# This function, get_random_file_from_folder, takes a folder_path as input.
# It returns the name of a random file from the specified folder.

def get_random_file_from_folder(folder_path):
    # Use the os.listdir() function to obtain a list of all files in the folder_path.
    files = os.listdir(folder_path)

    # Randomly select a file from the list of files using random.choice().
    random_file = random.choice(files)

    # Return the name of the randomly selected file.
    return random_file

In [5]:
# We'll use the previously defined function, get_random_file_from_folder, to obtain a random audio file name.
# The function takes the audio_folder_path as input and returns the name of a random file from the folder.

random_original_file = get_random_file_from_folder(audio_folder_path)

# Now, we play the randomly selected audio file using the Audio() function.
# The Audio() function takes the complete path to the audio file as input.
# The path is created by concatenating the audio_folder_path and the randomly selected file name (random_original_file).

Audio(audio_folder_path + random_original_file)

### Transcribing the speech in the selected audio recording

Now comes the thrilling part of our voice transcription adventure! We will use the powerful `whisper` speech to text model.

In [6]:
# The load_audio function loads an audio file using torchaudio and returns the waveform and sample rate.
def load_audio(file_path):
    waveform, sample_rate = torchaudio.load(file_path)
    return waveform

In [8]:
transcriber = Transcriber(
    model_name='whisper',
    model_checkpoint=None,
    language=None,
    models_save_dir=None,
    extra_params=None,
)

audio = load_audio(audio_folder_path + random_original_file)
raw_response, text = transcriber.transcribe(audio)
#print(raw_response[0])
print(text[0])

Model is multilingual and has 1,541,384,960 parameters.
 Heute Abend könnte ich es ihm sagen.


## That's all, folks!