## Deep Learning for Audio

In this practical session, we will embark on an exciting journey into the world of deep learning, applied to the field of audio classification. Audio classification is a pivotal task in various applications, such as music genre identification, speech recognition, and environmental sound detection, to name a few. The essence of this task lies in the ability to accurately categorize audio clips into predefined classes based on their content.


The session is structured to provide a comprehensive understanding of the process, including loading and visualizing audio files, extracting meaningful features like Mel-Frequency Cepstral Coefficients (MFCCs) and spectrograms, and preparing the dataset for training. We will then delve into building a convolutional neural network (CNN) model, which has proven effective in capturing the spatial hierarchy of features in audio data. Finally, students will apply their model to new data, seeing the culmination of their work in action.

Additionally, in this practical, we will focus on a specific application of audio classification: speaker identification.

You can find this dataset on this link: https://drive.google.com/file/d/1-6Soa4dM9nQi-hjoz021H2aEdhw-3MV9/view?usp=sharing



In [None]:
# For audio processing

!pip install torchaudio
!pip install gdown

If you work on Colab:

- Download the dataset using gdown (or upload the dataset on your drive)
- Execute the following cell (replace with the path of your drive)

In [None]:
from google.cloud import storage

def download_public_file():
    """Downloads a public blob from the bucket."""
    bucket_name = "data_ecoleit"
    source_blob_name = "Copie de train-dataset.tar.gz"
    destination_file_name = "./train-dataset.tar.gz"

    storage_client = storage.Client.create_anonymous_client()

    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

    print(
        "Downloaded public blob {} from bucket {} to {}.".format(
            source_blob_name, bucket.name, destination_file_name
        )
    )

In [None]:
download_public_file()

In [1]:
!gdown https://drive.google.com/uc?id=1-6Soa4dM9nQi-hjoz021H2aEdhw-3MV9

Downloading...
From (original): https://drive.google.com/uc?id=1-6Soa4dM9nQi-hjoz021H2aEdhw-3MV9
From (redirected): https://drive.google.com/uc?id=1-6Soa4dM9nQi-hjoz021H2aEdhw-3MV9&confirm=t&uuid=1db2f3c6-44ad-4188-92a5-a977a99bebd6
To: /content/train-dataset.tar.gz
100% 5.48G/5.48G [01:51<00:00, 49.3MB/s]


In [None]:
# untar
!tar xvf train-dataset.tar.gz

#### Some explanations

When you load an audio file, what you're essentially doing is converting an analog audio signal into a digital format that can be processed by a computer. This digital representation is a series of discrete numerical values that correspond to the audio signal's amplitude over time.

**Waveform**: The plots you will generate show the waveform of the audio file. A waveform visually represents the variations in amplitude of the audio signal over time. Peaks in the waveform indicate high amplitude sounds (louder parts), while troughs represent low amplitude sounds (quieter parts).

**Sample Rate**: This indicates the number of samples of audio carried per second and is measured in Hz. A higher sample rate means the audio is sampled more frequently and can capture more detail, but it also requires more data. LibROSA standardizes this to 22.05 kHz for consistency and processing efficiency, while TorchAudio retains the original sample rate, which can provide a more accurate representation of the original audio at the cost of increased complexity.

**Duration**: The length of the audio clip in seconds. It's derived from the total number of samples divided by the sample rate. Understanding the duration is crucial for processing and analyzing the audio in segments or ensuring it fits within the required timeframe for your application.

### Loading Audio with LibROSA

- Install LibROSA if you haven't already: pip install librosa.

- Use the librosa.load() function to load an audio file. Note that LibROSA automatically resamples the audio to 22.05 kHz by default.

Resampling an audio signal refers to the process of changing the sample rate of the audio data. The sample rate is the number of samples (individual audio data points) captured per second, measured in Hertz (Hz). When you resample audio, you are essentially altering the number of these data points per second in the digital representation of the sound.



- Display the sample rate and the duration of the audio.

- Display the waveform of the audio in the notebook

In [None]:
# FIXME

### Loading Audio with TorchAudio

- Use torchaudio.load() to load the same audio file. Unlike LibROSA, TorchAudio will retain the original sampling rate of the audio file.

- Display the sample rate and the duration of the audio

- Display the waveform of the audio in the notebook

In [None]:
# FIXME

**Analog Audio Signal**:

An analog audio signal is a continuous representation of sound waves. Unlike digital audio, which represents sound with discrete numerical values at specific intervals (samples), an analog signal is continuous, with its amplitude varying smoothly over time. This type of signal closely mimics the physical sound waves that travel through the air and are captured by our ears.

In the context of audio recording:

**Recording**: Microphones convert sound waves (pressure changes in the air caused by vibrations) into analog electrical signals. These signals vary in voltage in a way that directly corresponds to the variation in air pressure of the sound waves.

The transition from analog to digital (and vice versa) is a crucial part of modern audio processing. To digitize an analog signal for use in computers and digital devices, the continuous signal must be sampled at regular intervals, a process that involves measuring the amplitude of the signal at these points and encoding these measurements as digital data. This digital representation allows for extensive processing, editing, and transmission in ways that analog signals, constrained by physical media, cannot easily support

### Time domain vs Frequency domain

What is the Time Domain?

The time domain representation of an audio signal shows how the signal's amplitude varies over time. It's the most intuitive way to understand sound, as it directly represents what happens in the real world: sounds get louder and quieter over time, which corresponds to the peaks and troughs in the waveform.


What is the Frequency Domain?

Unlike the time domain, the frequency domain represents the audio signal in terms of its constituent frequencies. It shows how much of each frequency is present in the signal, rather than how the signal changes over time. This is crucial for understanding the pitch, tone, and harmonics of sounds.


Fourier Transform:

The Fourier Transform is a mathematical technique that transforms a signal from the time domain to the frequency domain. The Fast Fourier Transform (FFT) is a computationally efficient version of the Fourier Transform and is widely used in digital signal processing.

Extracting and Visualizing Frequencies:

You can use numpy or scipy to perform an FFT on your audio signal and extract its frequency components.
Below is an example code snippet that performs FFT on an audio signal and visualizes its frequency domain representation

You can also use librosa (check the documentation: https://librosa.org/doc/0.10.1/generated/librosa.stft.html#librosa-stft)

So now load an audio with librosa for instance and compute the Short time Fourier Transform and display the frequencies

In [None]:
# FIXME

Now load and play some audio in the notebook

In [None]:
# FIXME

### Let's build a model now

In [None]:
# FIXME

#### Implement the train / test split function

First you need to implement a function that is going to select some image path for the training and some other for the validation.

If you check the folder architecture, you can see we have multiple folders inside the `train-dataset` folder. Each folder is for a specific speaker.

We want to have some audio samples for each speaker in both train and test dataset.

The simplest way to achieve this to extract the audio path for both training and test. And then load dynamically the audio during the training.

In [None]:
import os
import glob
import numpy as np

In [None]:
def train_test_split():
    speaker_folders = glob.glob(os.path.join("train-dataset", '*'))

    train_files, val_files = [], []
    for speaker_folder in speaker_folders:
        speaker_files = glob.glob(os.path.join(speaker_folder, "*.flac"))

        nb_files = len(speaker_files)
        nb_sample_val = int(nb_files * 0.3)
        speaker_files_random = np.random.permutation(speaker_files)

        val_path = speaker_files_random[:nb_sample_val]
        train_path = speaker_files_random[nb_sample_val:]

        train_files.extend(train_path.tolist())
        val_files.extend(val_path.tolist())

    return train_files, val_files

In [None]:
train_files, val_files = train_test_split()

#### Implement the torch dataset

There are multiple pathways to architect your model. Specifically, your model can either operate directly on Mel Frequency Cepstral Coefficients (MFCCs) or utilize the spectrogram as its input. Each approach offers unique insights into the audio data, potentially influencing the model's performance based on the task at hand.

You need to manage 3 types of features:

- mfcc -> shape: (256, time duration)
- mfcc with mean pooling on the time axis -> shape: (256,)
- spectogram -> shape: (201, time duration)
- spectogram with mean pooling on the time axis

Implement the torch dataset that is going to use the mfcc or the spectogram. You can start by the mfcc dataset implementation first.

IMPORTANT: To simplify the task and account for varying audio lengths, you are encouraged to randomly select a 3-second segment from each audio file for processing. This approach ensures consistency in data input size and potentially streamlines the learning model's training process.

Also do not forget that your dataset must return the feature vector (mfcc for instance) and the label (which is the id / name of the folder)

In [None]:
import torch
from torch.utils.data import Dataset
import torchaudio

class AudioDataset(Dataset):
    def __init__(self, file_paths):
        self.file_paths = file_paths

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        file_path = self.file_paths[idx]
        waveform, sr = torchaudio.load(file_path)
        waveform = waveform.squeeze()

        # Extract 3 seconds of our audio
        # case: audio < 3 sec
        if int(len(waveform) - 3 * sr) <= 0:
            zeros_tensor = torch.zeros(3 * sr - len(waveform))
            w_sample = torch.cat([waveform, zeros_tensor])
            print(len(waveform) / sr)
        # case: audio > 3 sec
        else: 
            rd_idx = torch.randint(0, int(len(waveform) - 3 * sr), (1,)).item()
            w_sample = waveform[rd_idx:rd_idx + 3 * sr]

        id_label = int(file_path.split('/')[1])

        spec_features_extraction = torchaudio.transforms.Spectrogram()
        features = spec_features_extraction(w_sample)

        return features, id_label

In [None]:
train_dataset = AudioDataset(train_files)
val_dataset = AudioDataset(val_files)

Now display the shape of one sample

In [None]:
# FIXME

Build the DataLoader

In [None]:
# FIXME

Now display the mfcc and the spectogram

In [None]:
# FIXME

#### Build the models

In this part you need to build multiple models:

- Simple dense classifier for the pooling feature extraction versions

- 1D Convolution model

- 2D Convolution model

For each version, play with the architecture, try to add Dropout, BatchNorm2D, residual connexions

In [None]:
# FIXME

Implement a simple function that is going to count the number of learning parameters of our model

In [None]:
# FIXME

#### Let's train the model

Now you know, play with the optimizer, the loss ....

In [None]:
# FIXME

Now display some predictions with the model

In [None]:
# FIXME

Now compare the results of all the models

In [None]:
# FIXME

Have a look at other metrics, for instance, how to compute precision / recall and f1 score in the case of multiclass classification.

In [None]:
# FIXME

#### Now build an embedding model and visualize embeddings

Follow the steps:

- Create a new model with the same architecture as your best model except it must output the features instead of the class.

- Use the pretrained weights of your best model

- Focus on 20 class (we have more than 200 its too much to visualize something)

- Apply the PCA with 3 components

- Then make a 3D scatter plot with plotly to plot the 3D embeddings

In [None]:
# FIXME