# Audio Classification: CNN Baseline Model

**Course:** CSCI 6366 (Neural Networks and Deep Learning)  
**Project:** Audio Classification using CNN  
**Notebook:** Baseline CNN Model Implementation

## 0. What this notebook will do (big picture)

By the end of this new notebook, we want to have:

1. A **fixed-size input representation** for each audio clip:
   â†’ Mel-spectrogram shaped like `(128, 128, 1)` (height Ã— width Ã— channels).

2. A **small CNN model** defined in Keras:

   ```python
   Conv2D â†’ ReLU â†’ MaxPooling2D â†’ Conv2D â†’ MaxPooling2D â†’ Flatten â†’ Dense â†’ Softmax
   ```

3. The model **compiled** with:
   * `loss='categorical_crossentropy'`
   * `optimizer='adam'`
   * `metrics=['accuracy']`

We'll focus today on:
* shaping the data,
* building the model,
* understanding every layer.

We won't worry about perfect training yet.


In [None]:
import numpy as np
from pathlib import Path

import librosa

import tensorflow as tf
from tensorflow.keras import layers, models


## 1. First cell: imports

### What and why

* `numpy` â†’ arrays and math.
* `Path` â†’ nice file paths.
* `librosa` â†’ we'll reuse it to build Mel-spectrograms.
* `tensorflow` / `keras`:
  * `layers` â†’ Conv2D, MaxPooling2D, Flatten, Dense, etc.
  * `models` â†’ to create the `Sequential` model.

If you get an error on TensorFlow, install it in your venv:

```bash
(.venv) pip install "tensorflow>=2.16,<3"
```

(run in the terminal, not in the notebook).


In [None]:
# Where our audio data lives (relative to this notebook in notebooks/)
DATA_DIR = Path("../data").resolve()

# Our three classes
CLASS_NAMES = ["dog", "cat", "bird"]

label_to_index = {label: idx for idx, label in enumerate(CLASS_NAMES)}
label_to_index


## 2. Second cell: basic config and class labels

### Explanation

* `DATA_DIR` points to your `data/` folder from this notebook (`../data`).
* `CLASS_NAMES` is the list of class labels (folder names under `data/`).
* `label_to_index` turns labels into numbers:
  * `"dog" â†’ 0`, `"cat" â†’ 1`, `"bird" â†’ 2`.
* This mapping will be used for training targets.

Run the cell and you should see something like:

```python
{'dog': 0, 'cat': 1, 'bird': 2}
```


In [None]:
def load_mel_spectrogram(
    audio_path: Path,
    sr: int = 16000,
    n_fft: int = 1024,
    hop_length: int = 512,
    n_mels: int = 128,
) -> tuple[np.ndarray, int]:
    """
    Load an audio file and compute its Mel-spectrogram in dB scale.

    Returns:
        S_db: 2D array of shape (n_mels, time_frames), Mel-spectrogram in dB.
        sr: sample rate used.
    """
    # 1. Load waveform, resampled to `sr` if needed
    y, sr = librosa.load(audio_path, sr=sr)

    # 2. Compute Mel-spectrogram (power)
    S = librosa.feature.melspectrogram(
        y=y,
        sr=sr,
        n_fft=n_fft,
        hop_length=hop_length,
        n_mels=n_mels,
        power=2.0,
    )

    # 3. Convert to dB scale
    S_db = librosa.power_to_db(S, ref=np.max)

    return S_db, sr


## 3. Third cell: copy in `load_mel_spectrogram` (with slight tweak)

We'll reuse the helper we built, but only need it to return `S_db` and `sr` for now.

### Quick recap of what it does

* **Input**: path to `.wav` file.
* **Inside**:
  * Load waveform `y` at `sr` (16k).
  * Compute Mel-spectrogram `(n_mels Ã— time_frames)`.
  * Convert to dB (log scale).
* **Output**: `S_db` (2D spectrogram), `sr`.

This function is **our bridge** from "raw audio file" â†’ "2D array we'll feed into the CNN".


In [None]:
def pad_or_crop_spectrogram(S_db: np.ndarray, target_shape=(128, 128)) -> np.ndarray:
    """
    Ensure the Mel-spectrogram has shape (target_height, target_width)
    by centrally cropping or zero-padding along the time axis.

    Assumes S_db shape is (n_mels, time_frames).
    """
    target_height, target_width = target_shape
    n_mels, time_frames = S_db.shape

    # 1. If mel dimension doesn't match target_height, we could pad/crop,
    #    but here we assume n_mels == target_height (128).
    if n_mels != target_height:
        raise ValueError(f"Expected {target_height} mel bands, got {n_mels}")

    # 2. If too many time frames: centrally crop to target_width
    if time_frames > target_width:
        start = (time_frames - target_width) // 2
        end = start + target_width
        S_db = S_db[:, start:end]

    # 3. If too few time frames: pad with zeros on the right
    elif time_frames < target_width:
        pad_width = target_width - time_frames
        S_db = np.pad(
            S_db,
            pad_width=((0, 0), (0, pad_width)),  # only pad time axis on the right
            mode="constant",
            constant_values=(S_db.min(),),
        )

    # Now S_db has shape (target_height, target_width)
    return S_db


## 4. Fourth cell: make the spectrogram a fixed size (128Ã—128)

Right now different clips might have different `time_frames` (widths), depending on duration.

CNNs want **fixed shape** input. So we'll:
* Keep height = `n_mels = 128`.
* Force width = `128` by:
  * **If too long** â†’ cut the center to 128 columns.
  * **If too short** â†’ pad with zeros on the right.

### Explanation

* `S_db.shape` gives `(n_mels, time_frames)`.
* We **expect** `n_mels = 128`, and want `time_frames = 128`.
* If `time_frames > 128`:
  * We compute a `start` index so we crop the **center** portion.
* If `time_frames < 128`:
  * We pad columns on the right with the **minimum** value (darkest color).

Result: every clip becomes a **128Ã—128 matrix**.


In [None]:
def load_example_for_model(audio_path: Path, label: str) -> tuple[np.ndarray, np.ndarray]:
    """
    Load one audio file and return:
      X: Mel-spectrogram as float32 array with shape (128, 128, 1)
      y: one-hot encoded label array with shape (num_classes,)
    """
    # 1. Mel-spectrogram in dB
    S_db, sr = load_mel_spectrogram(audio_path)

    # 2. Ensure fixed size 128x128
    S_fixed = pad_or_crop_spectrogram(S_db, target_shape=(128, 128))

    # 3. Normalize (optional but common): scale to [0, 1]
    #    We shift and scale based on min and max of this spectrogram
    S_min = S_fixed.min()
    S_max = S_fixed.max()
    S_norm = (S_fixed - S_min) / (S_max - S_min + 1e-8)  # avoid divide-by-zero

    # 4. Add channel dimension â†’ (128, 128, 1)
    X = S_norm.astype("float32")[..., np.newaxis]

    # 5. Build one-hot label vector
    num_classes = len(CLASS_NAMES)
    y = np.zeros(num_classes, dtype="float32")
    y[label_to_index[label]] = 1.0

    return X, y


## 5. Fifth cell: convert audio file â†’ model-ready input & label

Now let's make a function that:
* Takes:
  * `audio_path`
  * `label` (e.g., `"dog"`)
* Returns:
  * `X`: spectrogram with shape `(128, 128, 1)` (extra channel dimension).
  * `y`: one-hot label like `[1, 0, 0]` for dog.

### Explanation

1. **Get Mel-spectrogram**:
   ```python
   S_db, sr = load_mel_spectrogram(audio_path)
   ```

2. **Make it 128Ã—128**:
   ```python
   S_fixed = pad_or_crop_spectrogram(S_db, target_shape=(128, 128))
   ```

3. **Normalize**:
   ```python
   S_min = S_fixed.min()
   S_max = S_fixed.max()
   S_norm = (S_fixed - S_min) / (S_max - S_min + 1e-8)
   ```
   * This maps values to roughly `[0, 1]`.
   * Normalization helps training.

4. **Add channel dimension**:
   ```python
   X = S_norm.astype("float32")[..., np.newaxis]
   ```
   * `S_norm` shape: `(128, 128)`
   * `[..., np.newaxis]` â†’ `(128, 128, 1)`
     (like a grayscale image with 1 channel).

5. **One-hot label**:
   ```python
   y = np.zeros(num_classes)
   y[label_to_index[label]] = 1.0
   ```
   * For `"dog"` (index 0): `[1.0, 0.0, 0.0]`
   * For `"cat"`: `[0.0, 1.0, 0.0]`, etc.

This `(X, y)` pair is exactly what we'll feed to the model.


In [None]:
def load_dataset(max_files_per_class: int = 20):
    X_list = []
    y_list = []

    for label in CLASS_NAMES:
        class_dir = DATA_DIR / label
        wav_files = sorted(class_dir.glob("*.wav"))

        for audio_path in wav_files[:max_files_per_class]:
            X, y = load_example_for_model(audio_path, label)
            X_list.append(X)
            y_list.append(y)

    X = np.stack(X_list, axis=0)
    y = np.stack(y_list, axis=0)
    return X, y

X, y = load_dataset(max_files_per_class=20)
X.shape, y.shape


## 6. Sixth cell: build a tiny dataset (even just a few examples)

We'll keep it simple: load a handful of files from each folder for now.

### Explanation

* Loop over `"dog"`, `"cat"`, `"bird"`.
* For each class:
  * Find `.wav` files inside that folder.
  * Take at most `max_files_per_class`.
  * Convert each to `(X, y)` using our helper.
* `X_list` is a Python list of arrays with shape `(128,128,1)`.
* `np.stack` turns it into a big 4D tensor:
  * `X.shape = (N, 128, 128, 1)`
    (N = total number of samples).
  * `y.shape = (N, 3)` (3 classes).

This gives us a small dataset to test our model pipeline.

> If this is slow, you can reduce `max_files_per_class` to 5 or 10.


In [None]:
input_shape = (128, 128, 1)
num_classes = len(CLASS_NAMES)

model = models.Sequential([
    # Block 1
    layers.Conv2D(
        filters=32,
        kernel_size=(3, 3),
        activation="relu",
        padding="same",
        input_shape=input_shape,
    ),
    layers.MaxPooling2D(pool_size=(2, 2)),

    # Block 2
    layers.Conv2D(
        filters=64,
        kernel_size=(3, 3),
        activation="relu",
        padding="same",
    ),
    layers.MaxPooling2D(pool_size=(2, 2)),

    # Flatten + Dense
    layers.Flatten(),
    layers.Dense(64, activation="relu"),
    layers.Dense(num_classes, activation="softmax"),
])

model.summary()


## 7. Seventh cell: build the CNN model

Now the fun part: define the CNN.

### Detailed explanation of each part

#### `input_shape = (128, 128, 1)`

* Height = 128 (mel bands),
* Width = 128 (time frames),
* Channels = 1 (grayscale spectrogram).

#### `Conv2D` (first block)

```python
layers.Conv2D(
    filters=32,
    kernel_size=(3, 3),
    activation="relu",
    padding="same",
    input_shape=input_shape,
),
```

* **filters=32**:
  * Learn 32 different 3Ã—3 filters.
  * Output will have 32 feature maps.
* **kernel_size=(3,3)**:
  * Each filter looks at a 3Ã—3 local patch.
* **activation="relu"**:
  * Apply `ReLU(x) = max(0,x)` after convolution.
* **padding="same"**:
  * Pad edges so that output height/width stay the same as input.
* **input_shape**:
  * Only needed in first layer:
    * Tells Keras what input shape to expect.

So after this layer:
* Input: `(128, 128, 1)`
* Output: `(128, 128, 32)` (same H,W, but now 32 channels).

#### First `MaxPooling2D`

```python
layers.MaxPooling2D(pool_size=(2, 2)),
```

* Pool size 2Ã—2:
  * Each 2Ã—2 block in each feature map becomes 1 value.
  * Downsamples height and width by factor 2.
* Output:
  * `(64, 64, 32)`.

#### Second Conv block

```python
layers.Conv2D(
    filters=64,
    kernel_size=(3, 3),
    activation="relu",
    padding="same",
),
layers.MaxPooling2D(pool_size=(2, 2)),
```

* Now we have 64 filters.
* Input to this Conv2D: `(64, 64, 32)`.
* Output of Conv2D: `(64, 64, 64)`.
* After MaxPooling2D:
  * `(32, 32, 64)`.

At this stage, we have a small 32Ã—32 spatial map with 64 channels â†’ pretty abstract features.

#### Flatten + Dense

```python
layers.Flatten(),
layers.Dense(64, activation="relu"),
layers.Dense(num_classes, activation="softmax"),
```

* **Flatten**:
  * Take `(32, 32, 64)` and convert to 1D vector:
    * size = `32 * 32 * 64 = 65536`.
* **Dense(64, relu)**:
  * Fully connected layer with 64 neurons.
  * Combines all extracted features into a compact representation.
* **Dense(num_classes, softmax)**:
  * Last layer with 3 neurons (dog/cat/bird).
  * `softmax` makes outputs sum to 1 and interpretable as probabilities.

`model.summary()` prints all shapes and parameter counts.


In [None]:
model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)


## 8. Eighth cell: compile the model

### Explanation

* **optimizer="adam"**:
  * Adam is a widely used variant of gradient descent.
  * It adapts the learning rate per parameter; good default choice.

* **loss="categorical_crossentropy"**:
  * Suitable when:
    * We have **multi-class** classification (dog vs cat vs bird).
    * Labels are **one-hot** vectors `[1,0,0]`, `[0,1,0]`, etc.
  * Measures how different the predicted probability distribution is from the true one.

* **metrics=["accuracy"]**:
  * In training logs, show the portion of examples where:
    * `argmax(predictions) == argmax(true_labels)`.

Now the model is ready to train.


In [None]:
history = model.fit(
    X,
    y,
    epochs=3,
    batch_size=8,
    validation_split=0.2,
)


## 9. (Optional) Ninth cell: sanity-check a tiny training run

You *don't* have to run long training now, but we can test that everything is wired correctly.

If this runs without shape errors, your entire pipeline:

`.wav â†’ Mel-spectrogram â†’ 128Ã—128Ã—1 â†’ CNN â†’ softmax`

is working ðŸŽ‰

---

## What we can do next (after you try this)

Once you run through these cells:

1. If **any** cell errors, send me:
   * The code cell.
   * The full error message.
     I'll debug and explain what went wrong and why.

2. Concept-wise, we can:
   * Take the `model.summary()` output and I'll walk through **every line** so you understand shapes & parameters.
   * Talk more about:
     * What the filters may be learning in the first vs second conv layer.
     * How receptive field grows.

Then, once this baseline is solid, we can think about:
* Splitting properly into train/val/test.
* Data augmentation (adding noise, time shifting).
* Improving model depth.

But first: get this notebook working and understanding every step.
