# Understanding Audio Data

Before we jump into build out our model, we need to first get a better understanding of our audio data. Here we will look at some key concepts and features of audio data, how we can visualize and transform the data.

# torchaudio

TODO

In [1]:
# import the packages
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchaudio
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, models, transforms
import IPython.display as ipd



# What is the audio backend?

The audio backend provide I/O functions to work with and play audio files. If you are running this notebook on Windows we need to switch the audio backend to `soundfile`. The default `sox` audio backend is not support on windows. If you are running on Linux or Mac you can skip this step.


In [2]:
torchaudio.set_audio_backend('soundfile')
str(torchaudio.get_audio_backend())

'soundfile'

# Download and Create Dataset

First lets download and parse out our yes/no dataset from the PyTorch Speech Commands dataset before we jump into the key concepts and terms to help us understand and work with audio data.

After the dataset is download we will visualze all the classes available in the dataset and loop through to create the yes and no collections.

In [3]:
#speech commands dataset has different words - we are going to grab yes and no
trainset_speechcommands = torchaudio.datasets.SPEECHCOMMANDS('./data', download=True)
#figure out how to parse out classes in data loader? except folders?
train_speechcommands_loader = torch.utils.data.DataLoader(trainset_speechcommands)


In [8]:
# get current directory and save as default
default_dir = os.getcwd() 
print(default_dir)

os.chdir('./data/SpeechCommands/speech_commands_v0.02/')
labels = [name for name in os.listdir('.') if os.path.isdir(name)]
# back to default directory
os.chdir(default_dir)
print(labels)

C:\code\pytorchfundamentals\4_audio
['backward', 'bed', 'bird', 'cat', 'dog', 'down', 'eight', 'five', 'follow', 'forward', 'four', 'go', 'happy', 'house', 'learn', 'left', 'marvin', 'nine', 'no', 'off', 'on', 'one', 'right', 'seven', 'sheila', 'six', 'stop', 'three', 'tree', 'two', 'up', 'visual', 'wow', 'yes', 'zero', '_background_noise_']


In [10]:
# TODO: remove this and just visualize from the main downloaded set then in the next section
# when the images are created loop thru and only create images for the yes no labels? Need to figure out how to 
# remove this processing time here

trainset_speechcommands_no = [];
trainset_speechcommands_yes = [];

# split out yes and no data for training
for i, data in enumerate(trainset_speechcommands):
    if data[2] == 'yes': 
        trainset_speechcommands_yes.append(data)
    elif data[2] == 'no':
        trainset_speechcommands_no.append(data)
        
print(len(trainset_speechcommands_yes))
print(len(trainset_speechcommands_no))

# Audio Concepts, Transforms and Visualizations

Now that we have our dataset downloaded and yes/no labels parsed out. Lets think learn a little more about audio data and concepts by visualization and transform this dataset.

## Sample Rate

First lets think about the digital representation of analog sound. How does sound get recorded anyway?! Just like with images we need to take our physical world and convert it to numbers or a digital represnation for a computer to understand. For audio a microphone is used to capture the sound and then its converted from analog sound to ditial sound by sampling at consitent intervals of time. This is called the `sample rate`. The higher the `sample rate` the higher the quality of the sound. The average sound sample rate is 48 kHz or 48,000 samples per second. This dataset was sampled at 16kHz so our sample rate is 16,000.

With any machine learning dataset we want to make it as small as possible while not loosing the accuracy of our model. We are going to keep this sample rate however if you could play around with reducing the sampel rate of the audio to make the model smaller and see if it effects the quality of the model.



In [None]:
print(f'Waveform: {trainset_speechcommands_yes[0][0]}')
print(f'Sample Rate: {trainset_speechcommands_yes[0][1]}')
print(f'Label: {trainset_speechcommands_yes[0][2]}')
print(f'ID: {trainset_speechcommands_yes[0][3]}')
#print(f'Something: {trainset_speechcommands_yes[0][4]}')

print(f'Waveform: {trainset_speechcommands_no[0][0]}')
print(f'Sample Rate: {trainset_speechcommands_no[0][1]}')
print(f'Label: {trainset_speechcommands_no[0][2]}')
print(f'ID: {trainset_speechcommands_no[0][3]}')
#print(f'Something: {trainset_speechcommands_no[0][4]}')

## WAVE file

You move likely have used a wave file before and undrestand that this is the format in which we save our digital representation of our analog audio to be shared and played. The Speech Commands dataset that we are using for this tutorial is stored in wave files that are all one second or less.



In [None]:
ipd.Audio(waveform.numpy(), rate=sample_rate)

## Waveform

## Spectrogram

## Mel Spectrogram

## Mfcc