## Previously on Audio Processing for ML

<ul>
    <li>Time-domain features</li>
    <li>Frequency-domain features</li>
    <li>Time-frequency domain features</li>
</ul>

<img src="../images/signal_domain.png">

<hr>

<h2>Time Domain Feature Pipeline</h2>
In audio processing, a <b>frame is a short segment</b> of a signal used for analysis or feature extraction.

<img src="../images/time_domain_feature_pipeline_2.png">
<hr>

In the image:

<ul>
    <li>A violin produces a <b>continuous</b> analog sound wave.</li>
    <li>The waveform is <b>sampled and quantized</b>, converting it to a digital signal (blue stair-step line).</li>
    <li>This digital signal is divided into <b>overlapping frames</b> — short, fixed-size segments.</li>
</ul>
<hr>
As we covered before,

<b>Key Concepts in Framing:</b>

<ul>
    <li><b>Frame Size:</b> Number of samples per frame (e.g., 128 samples).</li>
    <li><b>Hop Size:</b> Step size to move to the next frame (e.g., 64 samples).</li>
    <li><b>Overlap:</b> How much the frames overlap. For example, a frame size of 128 with a hop size of 64 results in a 50% overlap.</li>
</ul>
<hr>

<h3>Why Use Framing?</h3>

<ul>
    <li>Audio signals change over time, so processing the whole signal at once may <b>lose time-specific details.</b></li>
    <li>Frames allow <b>time-localized analysis</b> — useful in speech, music, and general sound analysis (e.g., computing MFCCs, spectrograms).</li>
</ul>

We will cover this after a while


<hr>

<h2>Perceivable audio chunk for humans</h2>

<b>Human auditory system</b> cannot perceive as meaningful audio (tone, pitch, loudness, or even detect a transient) features on its own from a sample point, it requires a chunk of audio samples with a temporal resolution of about 10 milliseconds — this means we can distinguish changes in sound that happen around or above this time scale.

<hr>
<b>A power of 2 num. Samples</b>

We often choose frame sizes (number of samples) as powers of 2 specifically to optimize the performance of the Fast Fourier Transform (FFT).


<h3>Why Powers of 2?</h3>
<ul>
    <li><b>The Fast Fourier Transform (FFT)</b> is an efficient algorithm for computing the <b>Discrete Fourier Transform (DFT)</b>.</li>
    <li>The most widely used <b>FFT algorithm (Cooley–Tukey)</b> is fastest when the number of input samples is a power of 2:</li>
    <center>

$$
N = 2^n \quad \text{(e.g., } N = 256,\, 512,\, 1024,\, 2048 \text{)}
$$

</center>
</ul>

<hr>
Let:

- $( d_f )$: frame duration (in seconds)
- $( s_r )$: sampling rate (e.g., 44100 Hz)
- $( K )$: number of samples per frame (e.g., 512)

Then:

$$
d_f = \frac{1}{s_r} \cdot K = \frac{1}{44100} \cdot 512 \approx 11.6\ \text{ms}
$$

<hr>

| Power-of-2 | Samples per Frame | Duration @ 44.1kHz | Use Case                                         |
|------------|-------------------|--------------------|--------------------------------------------------|
| 256        | 256               | ~5.8 ms            | Good time resolution                             |
| 512        | 512               | ~11.6 ms           | Balance of time and frequency                    |
| 1024       | 1024              | ~23.2 ms           | Better frequency resolution                      |
| 2048       | 2048              | ~46.4 ms           | High frequency resolution, worse time resolution |

<hr>

<h2>Frequency-domain feature pipeline</h2>
<hr>
<img src="../images/frequency_domain.png">
<hr>
<h2>From Time to Frequency Domain</h2>
<hr>
<img src="../images/from_time_to_frequency.png">
<hr>

<h2>What Is Spectral Leakage?</h2>

Spectral leakage happens when a <b>non-periodic signal (within a frame)</b> is processed using the Discrete Fourier Transform (DFT) or FFT.

<hr>
<h2>Why Does It Arise?</h2>
<b>
1. Fourier transform assumes periodicity
</b>
<ul>
    <li><b>DFT/FFT</b> assumes that the signal you're analyzing <b>repeats forever.</b></li>
    <li>But when you take a <b>finite frame/window</b>, the signal may <b>not complete a whole cycle.</b></li>
    <li>The <b>edges</b> of the <b>frame</b> are often <b>discontinuous</b>, breaking the illusion of periodicity.</li>
</ul>

<b>
    2. Discontinuities = Artificial high-frequency components
</b>
<ul>
    <li>At the <b>endpoints</b> of the frame, if the signal <b>doesn’t smoothly wrap around</b>, a jump (<b>discontinuity</b>) appears.</li>
    <li>These <b>jumps</b> get interpreted by the Fourier transform as <b>broadband high-frequency content.</b></li>
    <li>These are <b>not real</b> — they’re <b>artifacts.</b></li>
</ul>

<hr>

<h2>Example (Violin waveform):</h2>

Let’s say a violin plays a note of 440 Hz (A4).

If your frame captures 1.2 periods, the waveform starts at one point and ends somewhere random — not matching the start of the next cycle.

<ul>
    <li>The FFT sees that as a signal with many frequencies.</li>
    <li>So instead of a sharp peak at 440 Hz, you get "leakage" into nearby bins (e.g., 430, 450 Hz etc.)</li>
</ul>

<hr>

<h2>Solutions to Reduce Leakage:</h2>

1. <b>Windowing functions</b> (e.g., Hamming, Hann, Blackman):
    <ul>
        <li>Smooth the signal edges to zero to reduce sharp discontinuities.</li>
        <li>Reduces spectral leakage — though may trade off resolution.</li>
    </ul>

2. Choose frame size to match signal periodicity (if known)
3. <b>Zero-padding</b> (sometimes helps for better visual inspection but doesn’t prevent leakage)

<hr>
<h2>Spectral Leakage: Why Endpoints Matter</h2>

<img src="../images/spectral_leakage.png">

<hr>

<p><strong>Problem:</strong> When using the FFT (Fast Fourier Transform), the algorithm assumes the input signal is <em>periodic</em> — meaning the end of the frame connects smoothly back to the start.</p>

<p>If the <strong>start and end points are not equal</strong> (i.e., discontinuous), this creates a sudden jump — a sharp change that the FFT interprets as containing many high-frequency components.</p>

<div style="border: 1px solid #ccc; padding: 10px; background-color: #f9f9f9;">
  <p><strong>Consequence:</strong> This artificial jump causes <strong>spectral leakage</strong>, where energy from a true frequency spreads into adjacent frequency bins. This results in a smeared spectrum.</p>
</div>

<h3>Why You Should Care</h3>
<ul>
  <li><strong>Music analysis:</strong> Makes it harder to detect clean harmonics and pitch</li>
  <li><strong>Speech features:</strong> Leads to distorted MFCCs or spectral centroids</li>
  <li><strong>Scientific signals:</strong> Introduces false frequencies that aren’t really there</li>
</ul>

<h3>Solution</h3>
<ul>
  <li>Apply a <strong>windowing function</strong> (e.g., Hann, Hamming, Blackman)</li>
  <li>This fades the signal smoothly to zero at both ends</li>
  <li>Makes the signal appear more periodic → <strong>reduces spectral leakage</strong></li>
</ul>

<img src="../images/spectral_leakage_2.png">
<hr>

<h2>Frequency Domain Pipeline</h2>
<img src="../images/frequency_domain_2.png">
<hr>
<h2>Windowing</h2>
<ul>
  <li>Apply windowing function to each frame</li>
  <li>Eliminates samples at both ends of a frame</li>
  <li>Generates a periodic signal</li>
</ul>
<hr>
<b>Hann Window</b>
<img src="../images/hann_window.png">
<hr>
<h2>Windowing by Hann function</h2>
<img src="../images/windowing_hann.png">
<hr>
<img src="../images/windowing_hann_2.png">

<hr>
<h3>Problem: Non-Overlapping Frames + Hann Window</h3>

<img src="../images/windowing_hann_3.png">


<p>When applying the <strong>Hann window</strong> to each frame of audio, the window tapers the signal down to <strong>zero at both ends</strong>. If frames are non-overlapping, this causes <strong>gaps of near-zero energy</strong> between frames.</p>

<p>This leads to:</p>
<ul>
  <li>Loss of signal information between frames</li>
  <li>Distortion in reconstructed signals</li>
  <li>Artifacts in spectral analysis</li>
</ul>

<h3>Solution: Use Overlapping Frames</h3>

<p>To avoid gaps caused by windowing, we use overlapping frames — typically with <strong>50% overlap</strong> for Hann windows.</p>

<p>This ensures that when the tapered parts of adjacent frames are summed, they reconstruct the original signal smoothly without gaps.</p>

<p><strong>Result:</strong> Continuous energy, accurate frequency analysis, and smooth reconstruction.</p>

<hr>
<h3>Non Overlapping Frames</h3>
<img src="../images/non_overlapping_frames.png">
<hr>
<h3>Overlapping Frames</h3>
<img src="../images/overlapping_frames.png">
<hr>
<h3>Frequency-domain feature pipeline</h3>
<img src="../images/frequency_domain_3.png">