## Introduction to Language and Speech Technology - REma (RU)
*Seminar 8*

Last update: 2024/11/01

Aditya Kamlesh Parikh - @aditya.parikh@ru.nl

.


Welcome to the introductory course on language and speech technology! In this course, we'll explore various tools and libraries related to Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). These technologies form the backbone of many modern applications, such as virtual assistants, speech-to-text systems, and language translation tools. They will help you to get hands-on in language and speech technology.


### Google Colab

We wil do our excercise in **Google Colab**. So before starting our exercise it is important to get to know about it (if you don't know). Google Colab (short for "Collaboratory") is a free, cloud-based platform that allows you to write and execute Python code in a web-based environment. It is also perticularly useful as it provides the access to powerful computing resources like GPU without the need of setup.

#### Key Features:
Running Code Cells:

1.  Colab notebooks are made up of individual cells that can contain code or text.
2.  To run a code cell, click on the cell and press Ctrl + Enter.
3. Or by pressing Shift + Enter you can run the cell and go to the next cell.  
4.  You can also click the "Play" button on the left side of the cell to execute it.

Inbuilt Libraries:
1. Google Colab supports Python programming language, and it has many built-in libraries that are pre-installed and ready to use.
2. You can write Python code in cells just like you would in any other code editor.

But if the library is not in colab by default you can import it by using `!pip install` or `!apt-get install`.

Get more information about installation: https://colab.research.google.com/notebooks/snippets/importing_libraries.ipynb

### Saving and Sharing Notebooks:

Google Colab automatically saves your notebook to your Google Drive.
You can also share your notebooks with others by clicking the "Share" button in the top right corner.

Easy Collaboration: You can easily share your Colab notebooks with your peers or instructors and work on projects together.

Now coming back to the exercise.

## Hands-on Exercise
In this exercise, we'll explore the basics of ASR and TTS using models from the Hugging Face 🤗 library. We will see, how to download the models from Huggingface hub, getting transcriptions with different models in various languages in ASR. We will also get fimilier with opensource TTS models where by providing different prompt we can generate synthatic audio.   

### Incase you are interested: What is Hugging Face🤗?
Hugging Face 🤗 is a leading open-source platform in the field of natural language processing (NLP) and machine learning. It provides a vast collection of state-of-the-art models and tools that make it easy to work on tasks like text classification, language translation, text generation, speech recognition, and much more.

Hugging Face 🤗's Transformers library is one of the most popular libraries for building AI models, and it supports a wide range of pre-trained models for different applications. You can get more information about this here: https://huggingface.co/


# Setup
#### Installing Required Libraries
First, we'll install the necessary libraries. Run the following cell to install them

In [None]:
%%capture
# %%capture at the beginning of a code cell prevents the output of that cell from being displayed in the notebook unless you specifically request it later.
# They are also known as Magic Commands and there are various magic commands which colab supports. You can learn more about this in below notebook.
# https://colab.research.google.com/github/jdwittenauer/ipython-notebooks/blob/master/notebooks/language/IPythonMagic.ipynb
! pip install transformers
! pip install datasets
! pip install librosa
! pip install soundfile
! pip install jiwer

We also recommand you to sign up with Hugging Face. It will help you not just exploring different models and datasets but also share your models with others. You can sign up here: https://huggingface.co/join

In the below code `notebook_login()` will ask you for a token. Which you can get it from https://huggingface.co/settings/tokens

Copy - paste your token here; and you are good to go. Your notebook has already signed up to Hugging Face hub.  


In [None]:
from huggingface_hub import notebook_login
notebook_login()

Now we will explore some pretrained and finetuned ASR models from 🤗. If you open 🤗 models, filter the task with Automatic Speech Recognition, you will see 19,985 speech related models as of October 2024.   
https://huggingface.co/models?pipeline_tag=automatic-speech-recognition&sort=trending

Now we will try one model which is has the highest number of downloads in ASR.

https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english

This model is finetuned on [Common Voice English dataset](https://commonvoice.mozilla.org/en/datasets) on a pretrained model [XLSR-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53).

# Automatic Speech Recognition (ASR)

First we will download a sample audio file and see how does it sound.

In [1]:
!wget --no-check-certificate 'https://upload.wikimedia.org/wikipedia/commons/f/f6/Appuru.wav'

'wget' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
from IPython.display import Audio

# Play the audio file
Audio('../data/tutorial_8/Appuru.wav')

The audio containes an utterance from a woman in her mid-20s from Southern California pronouncing the idiom

"**the apple does not fall far from the tree**"

Now let's see what ASR model recognize.

In [5]:
# Import required libraries
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor  # Libraries to use the Wav2Vec2 model for speech recognition
import librosa    # Library to handle audio files
import torch      # PyTorch library for tensor operations
from transformers import logging # Managing the warnings
logging.set_verbosity_error()


# Load the pre-trained Wav2Vec2 model and processor
# The model 'wav2vec2-large-xlsr-53-english' is specifically trained for English speech recognition
model = Wav2Vec2ForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-english")
processor = Wav2Vec2Processor.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-english")

# Now we read the audio file using librosa
# 'sr=16000' specifies the sample rate to ensure the audio is processed correctly by the model
audio, rate = librosa.load("../data/tutorial_8/Appuru.wav", sr=16000)

# Process the audio data using the Wav2Vec2 processor
# This converts the audio waveform into input values that the model can understand
# 'padding="longest"' ensures that the audio input is properly padded for batch processing
input_values = processor(audio, sampling_rate=16_000, return_tensors="pt", padding="longest").input_values

# Pass the input values through the model to get the logits (non-normalized prediction values)
# The logits are raw scores from the model, representing the likelihood of each token being correct
logits = model(input_values).logits

# Get the predicted ids by finding the highest values in the logits
# This step identifies the most likely tokens for each time step in the audio input
prediction = torch.argmax(logits, dim=-1)

# Decode the predicted tokens into a human-readable transcription
# The batch_decode function converts the token ids back into words to form the final text
transcription = processor.batch_decode(prediction)[0]

transcription



'the apple does not fall far from the tree'

The transcription is exactly same as the reference text ("the apple does not fall far from the tree")

That's cool !!

Let's try recording your voice using the microphone on your device. This time, please speak in English while we record your voice.

If you run the cell below, it will ask for permission to use the microphone in your browser. Please allow access to the microphone when prompted. The recording will last for 10 seconds, and once it's done, your voice will be saved in a file named `recorded_audio.wav`.

Don't worry about the below code. It is not important.

In [6]:
#@title
from IPython.display import Javascript, display
from google.colab import output
import base64
import io

def record():
    js = Javascript("""
    async function recordAudio() {
        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        const mediaRecorder = new MediaRecorder(stream);
        let audioChunks = [];

        mediaRecorder.ondataavailable = event => {
            audioChunks.push(event.data);
        };

        mediaRecorder.start();

        alert('Recording started! Please start speaking in English, around 7 seconds.');  // Alert to inform the user
        await new Promise(resolve => setTimeout(resolve, 7000)); // Record for 7 seconds
        mediaRecorder.stop();

        await new Promise(resolve => mediaRecorder.onstop = resolve);

        const audioBlob = new Blob(audioChunks);
        const reader = new FileReader();
        reader.readAsDataURL(audioBlob);
        reader.onloadend = () => {
            const base64data = reader.result.split(',')[1];
            google.colab.kernel.invokeFunction('notebook.recorded_audio', [base64data], {});
        };
    }

    recordAudio();
    """)
    display(js)

audio_data = None

def save_audio(data):
    global audio_datas
    audio_data = base64.b64decode(data)
    with open('recorded_English_audio.wav', 'wb') as f:
        f.write(audio_data)
    print("Audio saved as 'recorded_English_audio.wav'")

output.register_callback('notebook.recorded_audio', save_audio)

record()


ModuleNotFoundError: No module named 'google'

Let's listen your recording.

In [None]:
from IPython.display import Audio

# Play the audio file
Audio('/data/tutorial_8/recorded_English_audio.wav')

audio, rate = librosa.load("../data/tutorial_8/recorded_English_audio.wav", sr=16000)
input_values = processor(audio, sampling_rate=16_000, return_tensors="pt", padding="longest").input_values
logits = model(input_values).logits
prediction = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(prediction)[0]

transcription

**Task 1**: *Use this recording and transcribe it with the Wav2vec2.0 model we used before.*

# Whisper ASR
Now we will see, how the transcription would be if we use Whipser ASR models.
If you open hugging face 🤗 again and check the whisper model. https://huggingface.co/openai

Giving you a small introduction to Whisper models:

Whisper is a state-of-the-art ASR model developed by OpenAI. But Whisper is unique, because it can handle a variety of tasks such as multilingual transcription, language identification, and even translation from speech😯.

Key Features of Whisper:

1. Multilingual Support:
Whisper can transcribe speech in multiple languages, including English, Dutch, Spanish, and many more.
2. High Accuracy:
Whisper is trained on a vast amount of data (more than 5M hours of audio), which helps it achieve high transcription accuracy, even in noisy environments or with different accents.
3. Multiple Model Sizes:
Whisper is available in different model sizes: small, medium, large, etc. Larger models usually offer higher accuracy but require more computational power.
You can choose a smaller model for faster transcriptions or a larger one for better accuracy.


**Task 2**: *Now let's try to transcribe the audio which you recorded just before.*


How You’ll Use Whisper in this Course:

**Task 3**: *After downloading different Whisper models from Hugging Face:*

*   Transcribe audio files in English and Dutch.
*   Experiment with different model sizes to see how they impact transcription quality and speed.

We will start with the smallest model of Whisper family - Whisper Tiny https://huggingface.co/openai/whisper-tiny

This is English to English transcription.


In [None]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
import warnings
import librosa

# Suppress FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Load model and processor
# WhisperProcessor helps to preprocess the audio data into a format suitable for the model.
# WhisperForConditionalGeneration is the pre-trained Whisper model used for speech-to-text transcription.
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

# Set the forced_decoder_ids to None
# This is necessary because Whisper uses forced decoding for translation by default.
# By setting it to None, we ensure the model will just perform transcription without forcing translation.
model.config.forced_decoder_ids = None


audio, rate = librosa.load("/data/tutorial_8/recorded_English_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=rate, return_tensors="pt").input_features

# generate token ids
## The model processes the input features and generates token ids (predicted text in token form).
predicted_ids = model.generate(input_features)

# decode token ids to text
# skip_special_tokens=False will keep any special tokens (such as language identifiers or special markers) in the transcription.
transcription_with_special_tokens = processor.batch_decode(predicted_ids, skip_special_tokens=False)

# Decode token ids to text (without special tokens)
# Now we decode the token ids again but skip the special tokens, providing only a cleaner transcription.
transcription_clean = processor.batch_decode(predicted_ids, skip_special_tokens=True)

# Output both transcriptions for comparison
print("Transcription with special tokens:", transcription_with_special_tokens)
print("Clean transcription:", transcription_clean)

Now let's try with Dutch to Dutch transcription.

In [None]:
from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import Audio, load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

# Set forced decoder ids for transcription in Dutch
# This forces the model to transcribe the audio in a specified language (Dutch in this case).
forced_decoder_ids = processor.get_decoder_prompt_ids(language="dutch", task="transcribe")

# load streaming dataset and read first audio sample

audio, rate = librosa.load("/content/recorded_English_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=rate, return_tensors="pt").input_features

# generate token ids
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

# decode token ids to text
# skip_special_tokens=False will keep any special tokens (such as language identifiers or special markers) in the transcription.
transcription_with_special_tokens = processor.batch_decode(predicted_ids, skip_special_tokens=False)

# Decode token ids to text (without special tokens)
# Now we decode the token ids again but skip the special tokens, providing only a cleaner transcription.
transcription_clean = processor.batch_decode(predicted_ids, skip_special_tokens=True)

# Output both transcriptions for comparison
print("Transcription with special tokens:", transcription_with_special_tokens)
print("Clean transcription:", transcription_clean)

Now we will check the translataion. **Dutch - English**
Setting the task to "translate" forces the Whisper model to perform speech translation.

Run the below cell and record your voice in Dutch.

In [None]:
#@title
from IPython.display import Javascript, display
from google.colab import output
import base64
import io

def record():
    js = Javascript("""
    async function recordAudio() {
        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        const mediaRecorder = new MediaRecorder(stream);
        let audioChunks = [];

        mediaRecorder.ondataavailable = event => {
            audioChunks.push(event.data);
        };

        mediaRecorder.start();

        alert('Recording started! Please start speaking in Dutch, around 7 seconds.');  // Alert to inform the user
        await new Promise(resolve => setTimeout(resolve, 7000)); // Record for 7 seconds
        mediaRecorder.stop();

        await new Promise(resolve => mediaRecorder.onstop = resolve);

        const audioBlob = new Blob(audioChunks);
        const reader = new FileReader();
        reader.readAsDataURL(audioBlob);
        reader.onloadend = () => {
            const base64data = reader.result.split(',')[1];
            google.colab.kernel.invokeFunction('notebook.recorded_audio', [base64data], {});
        };
    }

    recordAudio();
    """)
    display(js)

audio_data = None

def save_audio(data):
    global audio_datas
    audio_data = base64.b64decode(data)
    with open('recorded_Dutch_audio.wav', 'wb') as f:
        f.write(audio_data)
    print("Audio saved as 'recorded_Dutch_audio.wav'")

output.register_callback('notebook.recorded_audio', save_audio)

record()


In [None]:
from IPython.display import Audio

# Play the audio file
Audio('/content/recorded_Dutch_audio.wav')

In [None]:
import librosa
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")

# The 'forced_decoder_ids' tells the model to perform the task of translation instead of transcription.
# Here, 'language="dutch"' specifies that we want the model to translate the audio to Dutch.
# 'task="translate"' makes the model translate instead of transcribing in the original language.
forced_decoder_ids = processor.get_decoder_prompt_ids(language="dutch", task="translate")

# load streaming dataset and read first audio sample

audio, rate = librosa.load("/content/recorded_Dutch_audio.wav", sr=16000)
input_features = processor(audio, sampling_rate=rate, return_tensors="pt").input_features

# generate token ids
predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)

# decode token ids to text
# skip_special_tokens=False will keep any special tokens (such as language identifiers or special markers) in the transcription.
transcription_with_special_tokens = processor.batch_decode(predicted_ids, skip_special_tokens=False)

# Decode token ids to text (without special tokens)
# Now we decode the token ids again but skip the special tokens, providing only a cleaner transcription.
transcription_clean = processor.batch_decode(predicted_ids, skip_special_tokens=True)

# Output both transcriptions for comparison
print("Transcription with special tokens:", transcription_with_special_tokens)
print("Clean transcription:", transcription_clean)

# Text-to-Speech (TTS)

Now you know the basics of ASR and different models, we will also check a model for Text-to-Speech. In the same manner, we will download a TTS model from Hugging Face 🤗 and generate a speech synthesis in English and Dutch.

If you open 🤗 and find the TTS models you will find around 2063 models as of October 2024. https://huggingface.co/models?pipeline_tag=text-to-speech&sort=trending

Let's take a model for English language and try to create a speech. We will download a model https://huggingface.co/coqui/XTTS-v2

Here you will use your voice you recorded before and clone it.

You will also need to change your run time from CPU to GPU before running below code. For that, go to tab  `Runtime` and then go to `Change runtime type` and select `T4 GPU`. If needed please record your voice again to clone it.

In [None]:
! pip install TTS

If your recorded speech is gone by changing the runtime then don't worrry. Run the below cell again and record your the speech.

In [None]:
#@title
from IPython.display import Javascript, display
from google.colab import output
import base64
import io

def record():
    js = Javascript("""
    async function recordAudio() {
        const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
        const mediaRecorder = new MediaRecorder(stream);
        let audioChunks = [];

        mediaRecorder.ondataavailable = event => {
            audioChunks.push(event.data);
        };

        mediaRecorder.start();

        alert('Recording started! Please start speaking in English, around 7 seconds.');  // Alert to inform the user
        await new Promise(resolve => setTimeout(resolve, 7000)); // Record for 7 seconds
        mediaRecorder.stop();

        await new Promise(resolve => mediaRecorder.onstop = resolve);

        const audioBlob = new Blob(audioChunks);
        const reader = new FileReader();
        reader.readAsDataURL(audioBlob);
        reader.onloadend = () => {
            const base64data = reader.result.split(',')[1];
            google.colab.kernel.invokeFunction('notebook.recorded_audio', [base64data], {});
        };
    }

    recordAudio();
    """)
    display(js)

audio_data = None

def save_audio(data):
    global audio_datas
    audio_data = base64.b64decode(data)
    with open('recorded_English_audio.wav', 'wb') as f:
        f.write(audio_data)
    print("Audio saved as 'recorded_English_audio.wav'")

output.register_callback('notebook.recorded_audio', save_audio)

record()


In [None]:
from TTS.api import TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2", gpu=True)

In [None]:
# generate speech by cloning a voice using default settings
tts.tts_to_file(text="how are you students? I hope you find this course interesting",
                file_path="output.wav",
                speaker_wav="/content/recorded_English_audio.wav",
                language="en")

In [None]:
from IPython.display import Audio

# Play the audio file
Audio('/content/output.wav')

**Task 4** : Try to generate more speech using different prompts. Also make sure you use different voice to clone. Also, try to use your own voice and clone it and generate more speech. Save at least 5 synthetic speech files.

# **Insights/Discussion/Conclusion**
Provide your insights from the above ASR/TTS models you used.

*TBD by the student.*