# Spoken Digit Recognition: The Complete Master Class

Welcome! This notebook is designed to take you from "zero" to "deep understanding" of this codebase. We will cover:

1.  **Software Engineering (SWE) Perspective**: Why is the code organized like this?
2.  **Audio Signal Processing**: What is sound, really? How does the computer "hear"?
3.  **The Machine Learning Pipeline**: From raw data to vectors.
4.  **Deep Learning Theory**: How Convolutional Neural Networks (CNNs) work.
5.  **The Process**: What we did, why it failed initially, and how we fixed it.

---

## Part 1: The Software Engineering (SWE) Foundation

### Q: Why not just put everything in one big `main.py` file or a single notebook?
**A:** Scalability, Reproducibility, and Sanity.

In a professional **MLOps (Machine Learning Operations)** environment, we separate concerns:

-   **`src/`**: The Core Logic. Code here is reusable. `src/data` handles loading, `src/models` handles architecture.
-   **`notebooks/`**: The Laboratory. This is where we experiment, visualize, and fail fast. It uses the stable code from `src/`.
-   **`requirements.txt`**: The Environment. Ensures your computer runs the same code as my computer.
-   **`.gitignore`**: Hygiene. Tells Git: "Ignore my local trash (compiled files, datasets, virtual envs). Only track the source code."

This structure allows us to train a model in a script (`python src/models/train_model.py`) OR in a notebook (`03_training_and_opt.ipynb`) without rewriting code.

---

## Part 2: Audio Signal Processing (The "Physics" Layer)

### What is Sound?
Sound is just air vibrating. A microphone measures this vibration thousands of times per second. This is called the **Sample Rate**.
-   **16,000 Hz (16kHz)**: Used in this project. We measure the air pressure 16,000 times every second.

### The Problem: Raw Audio is Hard to Read
A raw waveform is just a list of amplitutes. It's hard for a Neural Network to find patterns like "pitch" or "tone" in just a list of numbers.

### The Solution: The Spectrogram
We convert Time-Domain signals (Amplitude vs Time) to Frequency-Domain signals (Frequency vs Time).

**Let's Visualize This:**

In [None]:
import sys
import os
sys.path.append(os.path.abspath('../'))

import torch
import torchaudio
import matplotlib.pyplot as plt
import numpy as np
from src.data.dataset import SpokenDigitDataset

# Load a sample
dataset = SpokenDigitDataset('../data/processed')
sample_file, label = dataset.file_list[0]
waveform, sr = torchaudio.load(sample_file)

# 1. The Waveform (Time Domain)
plt.figure(figsize=(10, 4))
plt.plot(waveform.t().numpy())
plt.title(f"Raw Waveform (Digit: {label})")
plt.xlabel("Time")
plt.ylabel("Amplitude")
plt.show()

### The Mel-Spectrogram
We use `MelSpectrogram`. This is a spectrogram adjusted for human hearing (we hear low frequencies better than high ones).

**Why this matters for the model:**
Instead of processing a 1D line of data, the model now sees an **Image** (2D array: Time x Frequency). This allows us to use **CNNs**, which are great at analyzing images!

In [None]:
mel_transform = torchaudio.transforms.MelSpectrogram(sample_rate=16000, n_mels=64)
db_transform = torchaudio.transforms.AmplitudeToDB()

spec = mel_transform(waveform)
spec = db_transform(spec)

plt.figure(figsize=(10, 4))
plt.imshow(spec[0].numpy(), aspect='auto', origin='lower')
plt.title("Mel-Spectrogram (What the Model Sees)")
plt.xlabel("Time")
plt.ylabel("Frequency (Mel)")
plt.colorbar(format='%+2.0f dB')
plt.show()

## Part 3: Deep Learning & CNNs (The "Brain")

We treat this audio problem as an **Image Classification** problem. We want to find the shape of "Zero" vs the shape of "One" on that heatmap above.

### Key Components of our `DeeperCNN`:

1.  **Conv2d (Convolution)**: A small filter (like a magnifying glass) slides over the image. It learns to detect edges, curves, or specific frequency patterns. We stack these to learn complex patterns.
2.  **BatchNorm**: Keeps the math stable. It normalizes inputs so no single unexpected loud noise throws off the whole network.
3.  **ReLU (Activation)**: Allows the network to learn non-linear patterns. Without this, the network is just a linear regression.
4.  **MaxPool**: Downsamples the image. It says, "I saw a feature here, I don't care exactly which pixel, just that it exists in this region."
5.  **Dropout**: Randomly turns off neurons during training. This forces the network to be redundant and robust (prevents overfitting).

---

## Part 4: The Process (What we did)

### Phase 1: The Baseline (`SimpleCNN`)
We started with a small model (30 Epochs). It got ~87% accuracy.
**Issue:** It was "underfitting". It didn't have enough "brain capacity" (neurons) to learn the subtle differences between digits like 1 and 7 in Afaan Oromoo.

### Phase 2: Optimization (`DeeperCNN` + SpecAugment)
1.  **More Layers**: We went from 16->32 channels to 32->64->128->200+ channels. This gave the model more capacity.
2.  **SpecAugment**: We randomly erased blocks of time or frequency during training. This forced the model to learn context. "If I can't hear the start of the word, can I guess it from the end?"

**Result**: 91.94% Accuracy.

### Phase 3: The Ceiling
We ran a Grid Search (trying many learning rates). The results saturated at ~92%. This suggests we are limited by the **Data**, not the model. To improve further, we would need thousands more samples, or to use **Transfer Learning** (using a model pre-trained on Google/Facebook data).

---

## Part 5: Future Steps (What YOU should do)

1.  **Collect Data**: The best code cannot fix bad/small data. Record yourself saying digits 100 times and add it to the folders.
2.  **Transfer Learning**: Look into `Wav2Vec 2.0`. It's a transformer model (like GPT) for audio.
3.  **Deploy**: You already have `app.py`. Host it on Streamlit Cloud or HuggingFace Spaces to share directly with users.