# Audio

In this section, we will learn how to use representations of audio data in machine learning.

## Audio data

Audio files can be represented in a variety of ways. The most common is the waveform, which is a time series of the amplitude of the sound wave at each time point. The waveform is a one-dimensional array of numbers. The sampling rate is the number of samples per second.

| Sampling rate | Quality |
|---------------|---------|
| 8 kHz         | Telephone call |
| 44.1 kHz      | Music CD |
| 48 kHz        | DVD |
| 96 kHz        | Studio quality |

To load an audio file, we can use the `librosa` library. The `librosa.load` function returns the waveform and the sampling rate.

In [None]:
import librosa
waveform, sampling_rate = librosa.load('audio.wav')

A dataset of audio files is available at https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/data.

## Spectrogram

The spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. It is a two-dimensional array of numbers. The x-axis represents time, the y-axis represents frequency, and the color represents the amplitude of the frequency at that time.

The spectrogram can be computed using the `librosa.stft` function. The `librosa.amplitude_to_db` function converts the amplitude to decibels.

## Mel spectrogram

The mel spectrogram is a spectrogram where the frequencies are converted to the mel scale. The mel scale is a scale of pitches judged by listeners to be equal in distance from one another. The mel spectrogram is a two-dimensional array of numbers. The x-axis represents time, the y-axis represents mel frequency, and the color represents the amplitude of the frequency at that time.

The mel spectrogram can be computed using the `librosa.feature.melspectrogram` function.

## MFCC

The mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The MFCC is a one-dimensional array of numbers.

The MFCC can be computed using the `librosa.feature.mfcc` function.

## Chromagram

The chromagram is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The chromagram is a two-dimensional array of numbers. The x-axis represents time, the y-axis represents pitch class, and the color represents the amplitude of the pitch class at that time.

The chromagram can be computed using the `librosa.feature.chroma_stft` function.

## Chroma vector

The chroma vector is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The chroma vector is a one-dimensional array of numbers.

The chroma vector can be computed using the `librosa.feature.chroma_stft` function.

## Chroma deviation

The chroma deviation is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The chroma deviation is a one-dimensional array of numbers.

The chroma deviation can be computed using the `librosa.feature.chroma_stft` function.

## Chroma distance

The chroma distance is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. The chroma distance is a one-dimensional array of numbers.

The chroma distance can be computed using the `librosa.feature.chroma_stft` function.

import librosa

waveform, sampling_rate = librosa.load('data/train/audio/bed/00176480_nohash_0.wav')

waveform
