# Audio Concepts, Transforms and Visualizations

Now that we have our dataset downloaded and yes/no labels parsed out. Lets think learn a little more about audio data and concepts by visualization and transform this dataset.

## Sample Rate

First lets think about the digital representation of analog sound. How does sound get recorded anyway?! Just like with images we need to take our physical world and convert it to numbers or a digital represnation for a computer to understand. For audio, a microphone is used to capture the sound and then its converted from analog sound to digital sound by sampling at consitent intervals of time. This is called the `sample rate`. The higher the `sample rate` the higher the quality of the sound. The average sound sample rate is 48 kHz or 48,000 samples per second. This dataset was sampled at 16kHz so our sample rate is 16,000.

With any machine learning dataset we want to make it as small as possible while not loosing the accuracy of our model. We are going to keep this sample rate however if you could play around with reducing the sampel rate of the audio to make the model smaller and see if it effects the quality of the model.

Lets take a look at the dataset, visualize and transform it.


In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, models, transforms
import pandas as pd


In [None]:
yes_waveform = trainset_speechcommands_yes[0][0]
yes_sample_rate = trainset_speechcommands_yes[0][1]
print(f'Waveform: {yes_waveform}')
print(f'Sample Rate: {yes_sample_rate}')
print(f'Label: {trainset_speechcommands_yes[0][2]}')
print(f'ID: {trainset_speechcommands_yes[0][3]}')
#print(f'Something: {trainset_speechcommands_yes[0][4]}')

no_waveform = trainset_speechcommands_no[0][0]
no_sample_rate = trainset_speechcommands_no[0][1]
print(f'Waveform: {no_waveform}')
print(f'Sample Rate: {no_sample_rate}')
print(f'Label: {trainset_speechcommands_no[0][2]}')
print(f'ID: {trainset_speechcommands_no[0][3]}')
#print(f'Something: {trainset_speechcommands_no[0][4]}')

## Waveform



In [None]:
def show_waveform(waveform, sample_rate, label):
    print("Waveform: {}\nSample rate: {}\nLabels: {}".format(waveform, sample_rate, label))
    new_sample_rate = sample_rate/10
    print(new_sample_rate)
    # Since Resample applies to a single channel, we resample first channel here
    channel = 0
    waveform_transformed = torchaudio.transforms.Resample(sample_rate, new_sample_rate)(waveform[channel,:].view(1,-1))

    print("Shape of transformed waveform: {}".format(waveform_transformed.size()))

    plt.figure()
    plt.plot(waveform_transformed[0,:].numpy())

In [None]:
show_waveform(yes_waveform, yes_sample_rate, 'yes')

## Spectrogram



In [None]:
def show_spectrogram(waveform):
    spectrogram = torchaudio.transforms.Spectrogram()(waveform)
    #print(spectrogram)
    print("Shape of spectrogram: {}".format(spectrogram.size()))

    plt.figure()
    plt.imshow(spectrogram.log2()[0,:,:].numpy(), cmap='gray')
    #plt.imsave(f'test/spectrogram_img.png', spectrogram.log2()[0,:,:].numpy(), cmap='gray')
    

In [None]:
show_spectrogram(yes_waveform)

## Mel Spectrogram

In [None]:
def show_melspectrogram(waveform,sample_rate):
    mel_spectrogram = torchaudio.transforms.MelSpectrogram()(waveform)
    print("Shape of spectrogram: {}".format(mel_spectrogram.size()))

    plt.figure()
    plt.imshow(mel_spectrogram.log2()[0,:,:].numpy(), cmap='gray')
    #plt.imsave(f'test/mfcc_img.png', mfcc_spectrogram.log2()[0,:,:].numpy(), cmap='gray')

In [None]:
show_melspectrogram(yes_waveform, yes_sample_rate)

## Mfcc

In [None]:
def show_mfcc(waveform,sample_rate):
    mfcc_spectrogram = torchaudio.transforms.MFCC(sample_rate= sample_rate)(waveform)
    print("Shape of spectrogram: {}".format(mfcc_spectrogram.size()))

    plt.figure()
    plt.imshow(mfcc_spectrogram.log2()[0,:,:].numpy(), cmap='gray')
    #plt.imsave(f'test/mfcc_img.png', mfcc_spectrogram.log2()[0,:,:].numpy(), cmap='gray')

In [None]:
show_mfcc(yes_waveform, yes_sample_rate)