<a href="https://colab.research.google.com/github/Vaibhavs10/ml-with-audio/blob/master/notebooks/session2/audio_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to Audio data

Hello! This quick notebook will show you how to::
* load audio data
* plot an audio's waveform
* create a spectrogram
* do quick automatic speech recognition

This notebook should take about 10 minutes to run.

In [None]:
!pip install transformers

In [None]:
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

from IPython.display import Audio

In [None]:
!wget https://cdn-media.huggingface.co/speech_samples/LibriSpeech_61-70968-0000.flac

In [None]:
sample = "/content/LibriSpeech_61-70968-0000.flac"

The following cell uses `Audio` from IPython` to be able to play an audio
directly in a notebook.

In [None]:
Audio(sample)

[Librosa](https://librosa.org/doc/latest/index.html) is a very common Python
library for audio analysis. It allows to easily load audio files, create
spectrograms, add effects, extract features and much more! Let's plot the
waveform of an audio. Some quick questions to reflect about:

* When is the quietest moment?

In [None]:
y, sr = librosa.load(sample)

plt.plot(y);
plt.title('Signal');
plt.xlabel('Time (samples)');
plt.ylabel('Amplitude');

Librosa also has a `waveplot` method for the same thing :)

In [None]:
librosa.display.waveplot(y, sr=sr);

When we loaded a sample with librosa, we get the sample rate as well.

In [None]:
sr

A sample rate of 22,050 means we're getting 22,050 samples in a given second.
Let's see how many samples we have in total.

In [None]:
len(y)

Alright! So if we get 108,156 samples, and we divide that by the
sample rate, we should get the number of seconds in the audio. Let's see if
that confirms our intuition.

In [None]:
len(y)/sr

Nice! It does!

Alright. Let's go to the next thing. What happens if we use a different sample
rate when executing? Let's hear the audios

In [None]:
Audio(y, rate=sr*1.5)

In [None]:
Audio(y, rate=sr*0.75)

The voice is completely distorted now! Interesting.

## Spectrograms

Cool, it's now time to build a spectrogram. We'll be using Short Time Fourier
Transform (STFT), which means we will be using a bunch of Fourier Transforms (FT) since we have frequencies changing over time. Just as a
reminder, FFT Is useful for decomposing a signal. STFT is useful for a signal
that changes over time. It divides a long signal into shorter segments of equal length and applies Fourier transforms for each segment.



Let's then build a spectrogram! Note that there are different types of 
spectrograms and many variables you can play with. Let's compute Short-Time Fourier Transform using `librosa.stft` ([spec](http://librosa.org/doc/main/generated/librosa.stft.html)) and see what we get out of it.


In [None]:
spec = np.abs(librosa.stft(y))
librosa.display.specshow(spec, sr=sr, x_axis='time')
plt.colorbar(format='%+2.0f amplitude')
plt.title('Almost Spectrogram')

Well, I cannot see anything here. What is going on? The sounds we (humans) hear are concentrated in a very small frequency and amplitude ranges, so plotting the raw data is not great.

![e.jpg](https://miro.medium.com/max/527/1*nRrG3uXi3jj4MHBQXBL22g.png)

What we can do is to transform the y axis to be log scaled and convert the amplitude to decibels. Making the data log-based will provide us much more informative information.

In [None]:
dec_spec = librosa.amplitude_to_db(spec, ref=np.max)
librosa.display.specshow(dec_spec, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Spectrogram')

Librosa allows you to create a mel-spectrogram in two ways using the `librosa.feature.melspectrogram` method:
* By providing the raw data, as we did before (you set the `y` param).
* By providing a pre-computer power spectrogram (you set `S` param).

In [None]:
sg = librosa.feature.melspectrogram(y, sr=sr)
db_spec = librosa.power_to_db(sg, ref=np.max)
librosa.display.specshow(db_spec, x_axis='time', y_axis='mel', fmax=8000)
plt.colorbar(format='%+2.0f dB')

In [None]:
sg = librosa.feature.melspectrogram(S=spec, sr=sr)
db_spec = librosa.amplitude_to_db(sg, ref=1.0, amin=1e-05, top_db=80.0)
librosa.display.specshow(db_spec, x_axis='time', y_axis='mel', fmax=8000)
plt.colorbar(format='%+2.0f dB')

# Automatic Speech Recognition demo

Let's very quickly show you how to do ASR with our given audio using the `transformers` library with the `pipeline`. Loading the model the first time can be a bit slow, but it's much faster afterwards.

In [None]:
from transformers import pipeline

In [None]:
pipe = pipeline("automatic-speech-recognition")

In [None]:
pipe(y)

Some useful resources worth checking if you want to re-inforce this content:
* https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53
* https://towardsdatascience.com/getting-to-know-the-mel-spectrogram-31bca3e2d9d0
* https://ch.mathworks.com/matlabcentral/answers/387458-why-my-spectrogram-have-negative-values