# Intro to Digital Audio

To be able to work with audio signals in a computer, we need to create a series of discrete values from a continuous sound wave.

This notebook will walk you through the fundamentals of how audio is represented in Python, how to create a custom dataset of music using yt-dlp/musicdl, and how to visualize the waveform of an audio file.

<a href="https://colab.research.google.com/github/MichiganDataScienceTeam/F25-Shazam/blob/main/notebooks/week1_audio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pip install pandas numpy matplotlib librosa scipy yt-dlp pyaudio

In [None]:
import matplotlib.pyplot as plt
import IPython.display as ipd
import pandas as pd
import numpy as np
import librosa
import scipy


## What is a sound wave?

A sound wave is created by the vibration of air molecules.


- **frequency**: number of times that air particles vibrate back and forth per second (Hz). 
    - This is perceived by us as the pitch
- **amplitude**: maximum distance that air particles are displaced (from rest) as the sound wave passes
    - This is perceived by us as the loudness, measured in decibels (dB)

We perceive frequency and amplitude **logarithmically**; as frequency or amplitude increases, it takes more of a change in the respective quantity to produce the same percieved change in pitch/loudness.

![metadata](./asset/sample_metadata.png)

Let's break down some common aspects of an audio file:

- **sample rate** - number of samples per second (Hz). The sampling rate determines the time resolution of our representation.
    - Time is digitized using the sample rate
    - The time in seconds between consecutive samples is $1/\mathrm{sr}$, called the sampling interval
- **bit depth** - The number of bits used to store each sample. The larger the bit depth, the more precision we have to store amplitude information (amplitude resolution).
    - Amplitude is digitized using bit depth
    
<br>

When working with MP3 files, another useful piece of information is the bit rate - the data transfer rate expressed as bits per second.
- $\mathrm{bit\ depth} \neq \mathrm{bit\ rate}$
- $\mathrm{bit\ rate} = \mathrm{sr}\times\mathrm{bit\ depth}\times\textrm{number\ of\ channels}$
- mp3 files lose bit depth information during the lossy compression process

## Digital Waveform Representation

![waveform](./asset/digital_waveform.jpg)

Modify the code below to try out different values of `freq_Hz` and `sample_rate_Hz`

In [None]:
# frequency of the audio signal
freq_Hz = 5

# rate at which the audio signal is being sampled
sample_rate_Hz = 10

x_source = np.linspace(0, 1, 1000, endpoint=False)
y_source = np.cos(freq_Hz * 2*np.pi*x_source)

x_samples = np.linspace(0, 1, sample_rate_Hz, endpoint=False)
y_samples = np.cos(freq_Hz * 2*np.pi*x_samples)

plt.plot(x_source, y_source)
plt.scatter(x_samples, y_samples, color="red", s=50)
plt.xlabel("Time (s)")
plt.grid()


## How many samples do we need?

The samples inside a wav file are used to reconstruct the original audio as a continuous waveform to be played back for listening.

**Nyquist-Shannon sampling theorem**: If a signal contains no frequencies higher than $f_\mathrm{max}$, then the signal can be perfectly reconstructed when sampled at a rate $sr > 2f_\mathrm{max}$. In other words, the maximum reconstructable frequency is strictly less than $sr/2$.

When there is a frequency higher than $sr/2$ in the signal, the frequency instead appears in the reconstructed signal at a lower frequency than the original. This effect is called **aliasing**, and is usually undesirable.

**Question**: In order to cover the human hearing range of 20 Hz to 20 kHz, what is the minimum sampling rate required?

In [None]:
# sampling_rate_required = ???

In [None]:
sample_rate_Hz = 5
freq_Hz = 7

# The Nyquist frequency is defined as sr/2
# Any sinusoid with freq > sr/2 has an alias with freq < sr/2
#     alias: sinusoid indistinguisable by sampling alone (same samples)
k: int = 1  # any integer
freq_alias_wave_Hz = np.abs(freq_Hz - k*sample_rate_Hz)

x_source = np.linspace(0, 1, 1000, endpoint=False)
y_source = np.cos(freq_Hz * 2*np.pi*x_source)

x_reconstructed = np.linspace(0, 1, 1000, endpoint=False)
y_reconstructed = np.cos(min(freq_alias_wave_Hz, freq_Hz) * 2*np.pi*x_reconstructed)

x_samples = np.linspace(0, 1, sample_rate_Hz, endpoint=False)
y_samples = np.cos(freq_Hz * 2*np.pi*x_samples)

plt.plot(x_source, y_source)
plt.plot(x_reconstructed, y_reconstructed, color="red", linestyle="--")
plt.scatter(x_samples, y_samples, color="red", s=50)
plt.xlabel("Time (s)")
plt.grid()

## Data collection

To create a database of songs that our Shazam clone will be able to recognize, we can use the [yt-dlp](https://github.com/yt-dlp/yt-dlp) Python package to download audio files directly from YouTube

In [None]:
import yt_dlp

# add any youtube video url here, 
# copied from browser address bar
youtube_url = ""

yt_audio_path = "yt_sample.wav"

ydl_opts = {
        'format': 'bestaudio/best',
        'outtmpl': "yt_sample.%(ext)s",  # output file
        'postprocessors': [{
            'key': 'FFmpegExtractAudio',
            'preferredcodec': 'wav',     # save as wav file
        }],
        #'cookiefile': 'cookies.txt',
    }
if youtube_url != "":
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([youtube_url])



## musicdl: helper script for downloading audio

For convienience, we've created a yt-dlp wrapper program that accepts urls from either YouTube or Spotify, and downloads the respective audio files to a specified folder.

In [None]:
!git clone --depth 1 https://github.com/dennisfarmer/musicdl.git
%pip install -e ./musicdl
from musicdl.yt import YoutubeDownloader

In [None]:
help(YoutubeDownloader.download)

In [None]:
from pprint import pprint

ydl = YoutubeDownloader(
    audio_directory="./tracks", 
    audio_format="wav"
)

youtube_urls = [
    "https://www.youtube.com/watch?v=TqxfdNm4gZQ"
]

tracks_info = ydl.download(youtube_urls)
pprint(tracks_info)

# write to csv
tracks_csv = "./tracks/tracks_info.csv"
tracks_df = pd.DataFrame(tracks_info)
tracks_df.to_csv(tracks_csv, index=False)

# read from csv
#with open(tracks_csv, "r") as f:
    #tracks_df = pd.read_csv(f)
    #tracks_list = list(tracks_df.to_dict(orient="records"))


## Visualizing the amplitude and sample rate of an audio file

[`librosa.load()` documentation](https://librosa.org/doc/0.11.0/generated/librosa.load.html#librosa.load)

In [None]:
import librosa

#audio_path = yt_audio_path
audio_path = "sample.wav"
ipd.Audio(audio_path)

# Use librosa.load() to read file at audio_path
# default is to convert to mono (1 channel)
# audio, sr = ???
#print(f"sample rate = {sr} Hz")

#plt.figure(figsize=(12, 4))
#plt.plot(audio, lw=0.1)  # interpolated waveform
#plt.scatter(np.arange(len(audio)), audio, s=0.005, color="red")  # individual samples
#plt.title("Audio Waveform")
#plt.xlabel("Sample Index")
#plt.ylabel("Amplitude")
#plt.show()

#print("audio is digitally represented as a numpy.ndarray:")
#audio

## Downsampling

internally, downsampling (reducing the sample rate) performs the following process:
1. uses a low pass anti-aliasing filter to remove higher frequency components
    - low pass: keep signals with frequency lower than a specified cutoff frequency
2. keeps every nth sample using a sample step size of $(\mathrm{sr_{old}}/\mathrm{sr_{new}})$
    - the details of this process are not super important to know
    - if $(\mathrm{sr_{old}}/\mathrm{sr_{new}})$ is not an integer, first upsamples via interpolation, then downsamples with step size $\mathrm{sr_{old}}$.
    - upsampling: each original sample is separated by inserting $(\mathrm{sr_{new}}-1)$ zeros, then signal is interpolated with filter to smooth out discontinuities (replacing zeros)

**Question**: In the low pass filter step, what would the cutoff frequency be if we resampled to 11,025 Hz?

In [None]:
# cutoff_frequency = ???

In [None]:
# Resample from 48,000 Hz to 11,025 Hz
# use librosa.load()
sr = 11_025

# Question: how would you plot just the first 10 seconds of an audio file?
# Hint: use sample rate

# YOUR CODE HERE

# Task for today:

1) download the audio from any youtube video as a .wav file
2) downsample the file to 44.1 kHz
3) crop the audio to any 5 second segment
4) visualize the waveform

In [None]:
# YOUR CODE HERE

---
# Next week: visualizing frequency using a spectrogram

In [None]:
import IPython.display as ipd
audio_path = "log_scale_perception.wav"
ipd.Audio(audio_path)

## Method 1: Librosa

In [None]:
audio, sr = librosa.load(audio_path, sr=None) 

# parameters of the short-time Fourier transform:
# (algorithm that creates the spectrogram)
win_length = 2**11  # number of samples in each window
n_fft = win_length
hop_length = win_length // 4
window = scipy.signal.get_window("triang", Nx=win_length)

S = librosa.stft(audio, 
                       n_fft=n_fft, hop_length=hop_length, 
                       win_length=win_length, window=window)
S_magnitude = np.abs(S)  # |a+bi| = sqrt(a^2 + b^2)
S_db = librosa.amplitude_to_db(S_magnitude, ref=np.max)

im = plt.imshow(S_db, cmap="inferno", aspect="auto", origin="lower")
plt.colorbar(im, format="%+2.0f dB")
plt.xlabel("Time (sec)")
plt.ylabel("Frequency (Hz)")
plt.show()

## Method 2: Scipy (what we'll use)

In [None]:
audio, sr = librosa.load(audio_path, sr=None) 

# parameters of the short-time Fourier transform:
# (algorithm that creates the spectrogram)
nperseg = win_length = 2**11  # number of samples in each window
nfft = n_fft = win_length
hop_length = win_length // 4
window = scipy.signal.get_window("triang", Nx=win_length)

# scipy.signal.stft also uses the sample rate to output 
# frequency (in Hz) and time (in seconds) vectors,
# corresponding to the rows and columns of the stft matrix 
# in "s_scipy"
fs=sr 
noverlap = nperseg - hop_length

freq_scipy, time_scipy, s_scipy = scipy.signal.stft(
    audio, 
    fs=fs, window="hann", nfft=nfft, 
    nperseg=nperseg, noverlap=noverlap
)

print(f"freq vector shape: {freq_scipy.shape}")
print(f"time vector shape: {time_scipy.shape}")
print(f"stft matrix shape: {s_scipy.shape}")

s_scipy_db = librosa.amplitude_to_db(np.abs(s_scipy), ref=np.max)

im = plt.imshow(s_scipy_db, cmap="inferno", aspect="auto", origin="lower")
plt.colorbar(im, format="%+2.0f dB")
plt.xlabel("Time (sec)")
plt.ylabel("Frequency (Hz)")
plt.show()

## Recording from Microphone

In [None]:
import scipy.io.wavfile
import numpy as np
import pyaudio

def record_audio(n_seconds: int = 5) -> str:
    """
    Record audio using computer microphone.
    """
    chunk = 1024
    bit_depth = pyaudio.paInt16
    n_channels = 1
    sample_rate = 48000

    input("Press Enter to begin recording 🎤")
    print("🎤 Listening for music", end="\r")

    outfile = "microphone_sample.wav"

    p = pyaudio.PyAudio()

    stream = p.open(format=bit_depth, channels=n_channels, rate=sample_rate, input=True, frames_per_buffer=chunk)

    frames = []
    for _ in range(0, int(sample_rate / chunk * n_seconds)):
        data = stream.read(chunk)
        frames.append(data)

    stream.stop_stream()
    stream.close()
    p.terminate()


    audio_np = np.frombuffer(b''.join(frames), dtype=np.int16)
    scipy.io.wavfile.write(outfile, sample_rate, audio_np)
    print(f"✅ Recording saved to {outfile}", end="\n")

    return outfile

#audio_path = record_audio()

In [None]:
#audio_path = "microphone_sample.wav"

audio, sr = librosa.load(audio_path, sr=None) 
S_db = librosa.amplitude_to_db(np.abs(librosa.stft(audio)), ref=np.max)
im = plt.imshow(S_db, cmap="inferno", aspect="auto", origin="lower")
plt.colorbar(im, format="%+2.0f dB")
plt.xlabel("Time (sec)")
plt.ylabel("Frequency (Hz)")
plt.show()