# Audio Classification: CNN Baseline Model

**Course:** CSCI 6366 (Neural Networks and Deep Learning)  
**Project:** Audio Classification using CNN  
**Notebook:** Baseline CNN Model Implementation

## Overview

This notebook implements a baseline Convolutional Neural Network (CNN) for audio classification. We will:
1. Load and preprocess audio files into fixed-size mel-spectrogram representations
2. Build a simple CNN architecture using Keras/TensorFlow
3. Train and evaluate the baseline model on the audio classification task

The goal is to establish a baseline model that can classify audio samples into categories (dog, cat, bird).


What this notebook will do (big picture)

By the end of this new notebook, we want to have:

A fixed-size input representation for each audio clip:
→ Mel-spectrogram shaped like (128, 128, 1) (height × width × channels).

A small CNN model defined in Keras:

Conv2D → ReLU → MaxPooling2D → Conv2D → MaxPooling2D → Flatten → Dense → Softmax


The model compiled with:

loss='categorical_crossentropy'

optimizer='adam'

metrics=['accuracy']

We’ll focus today on:

shaping the data,

building the model,

understanding every layer.

We won’t worry about perfect training yet.

In [None]:
import numpy as np
from pathlib import Path

import librosa

import tensorflow as tf
from tensorflow.keras import layers, models


In [None]:
## Configuration and Data Paths

Set up the data directory path and define the class labels for our audio classification task.


{'dog': 0, 'cat': 1, 'bird': 2}

# Where our audio data lives (relative to this notebook in notebooks/)
DATA_DIR = Path("../data").resolve()

# Our three classes
CLASS_NAMES = ["dog", "cat", "bird"]

# Create label-to-index mapping for one-hot encoding
label_to_index = {label: idx for idx, label in enumerate(CLASS_NAMES)}
print("Label to index mapping:", label_to_index)

In [None]:
### Configuration Details

- **`DATA_DIR`**: Points to the `data/` folder relative to this notebook (`../data`)
- **`CLASS_NAMES`**: List of class labels corresponding to folder names under `data/`
- **`label_to_index`**: Mapping that converts string labels to numeric indices:
  - `"dog" → 0`
  - `"cat" → 1`
  - `"bird" → 2`

This mapping will be used to create one-hot encoded training targets.


NOw, make the spectrogram a fixed size (128×128)

Right now different clips might have different time_frames (widths), depending on duration.

CNNs want fixed shape input. So we’ll:

Keep height = n_mels = 128.

Force width = 128 by:

If too long → cut the center to 128 columns.

If too short → pad with zeros on the right.

In [None]:
def load_mel_spectrogram(
    audio_path: Path,
    sr: int = 16000,
    n_fft: int = 1024,
    hop_length: int = 512,
    n_mels: int = 128,
) -> tuple[np.ndarray, int]:
    """
    Load an audio file and compute its Mel-spectrogram in dB scale.

    Args:
        audio_path: Path to the audio file
        sr: Sample rate (default: 16000 Hz)
        n_fft: FFT window size (default: 1024)
        hop_length: Number of samples between successive frames (default: 512)
        n_mels: Number of mel filter banks (default: 128)

    Returns:
        S_db: 2D array of shape (n_mels, time_frames), Mel-spectrogram in dB
        sr: Sample rate used
    """
    # 1. Load waveform, resampled to `sr` if needed
    y, sr = librosa.load(audio_path, sr=sr)

    # 2. Compute Mel-spectrogram (power)
    S = librosa.feature.melspectrogram(
        y=y,
        sr=sr,
        n_fft=n_fft,
        hop_length=hop_length,
        n_mels=n_mels,
        power=2.0,
    )

    # 3. Convert to dB scale
    S_db = librosa.power_to_db(S, ref=np.max)

    return S_db, sr


def pad_or_crop_spectrogram(S_db: np.ndarray, target_shape=(128, 128)) -> np.ndarray:
    """
    Ensure the Mel-spectrogram has shape (target_height, target_width)
    by centrally cropping or zero-padding along the time axis.

    Args:
        S_db: Mel-spectrogram array with shape (n_mels, time_frames)
        target_shape: Target (height, width) tuple

    Returns:
        Fixed-size spectrogram with shape (target_height, target_width)
    """
    target_height, target_width = target_shape
    n_mels, time_frames = S_db.shape

    # 1. Validate mel dimension matches target_height
    if n_mels != target_height:
        raise ValueError(f"Expected {target_height} mel bands, got {n_mels}")

    # 2. If too many time frames: centrally crop to target_width
    if time_frames > target_width:
        start = (time_frames - target_width) // 2
        end = start + target_width
        S_db = S_db[:, start:end]

    # 3. If too few time frames: pad with minimum value on the right
    elif time_frames < target_width:
        pad_width = target_width - time_frames
        S_db = np.pad(
            S_db,
            pad_width=((0, 0), (0, pad_width)),  # only pad time axis on the right
            mode="constant",
            constant_values=(S_db.min(),),
        )

    # Now S_db has shape (target_height, target_width)
    return S_db

convert audio file → model-ready input & label

Now let’s make a function that:

Takes:

audio_path

label (e.g., "dog")

Returns:

X: spectrogram with shape (128, 128, 1) (extra channel dimension).

y: one-hot label like [1, 0, 0] for dog.

In [None]:
def load_example_for_model(audio_path: Path, label: str) -> tuple[np.ndarray, np.ndarray]:
    """
    Load one audio file and convert it to model-ready format.

    Args:
        audio_path: Path to the audio file
        label: String label (e.g., "dog", "cat", "bird")

    Returns:
        X: Mel-spectrogram as float32 array with shape (128, 128, 1)
        y: One-hot encoded label array with shape (num_classes,)
    """
    # 1. Load mel-spectrogram in dB
    S_db, sr = load_mel_spectrogram(audio_path)

    # 2. Ensure fixed size 128x128
    S_fixed = pad_or_crop_spectrogram(S_db, target_shape=(128, 128))

    # 3. Normalize to [0, 1] range for better training stability
    S_min = S_fixed.min()
    S_max = S_fixed.max()
    S_norm = (S_fixed - S_min) / (S_max - S_min + 1e-8)  # avoid divide-by-zero

    # 4. Add channel dimension → (128, 128, 1) for CNN input
    X = S_norm.astype("float32")[..., np.newaxis]

    # 5. Build one-hot label vector
    num_classes = len(CLASS_NAMES)
    y = np.zeros(num_classes, dtype="float32")
    y[label_to_index[label]] = 1.0

    return X, y


build a tiny dataset (even just a few examples)

We’ll keep it simple: load a handful of files from each folder for now.

In [None]:
## Loading the Dataset

For this baseline implementation, we'll load a small subset of the data to validate our pipeline. This allows us to quickly test the model architecture and training loop before scaling up.


((60, 128, 128, 1), (60, 3))

def load_dataset(max_files_per_class: int = 20):
    """
    Load audio files from all classes and convert them to model-ready format.

    Args:
        max_files_per_class: Maximum number of files to load per class

    Returns:
        X: Array of shape (N, 128, 128, 1) where N is total number of samples
        y: Array of shape (N, num_classes) with one-hot encoded labels
    """
    X_list = []
    y_list = []

    for label in CLASS_NAMES:
        class_dir = DATA_DIR / label
        wav_files = sorted(class_dir.glob("*.wav"))

        for audio_path in wav_files[:max_files_per_class]:
            X, y = load_example_for_model(audio_path, label)
            X_list.append(X)
            y_list.append(y)

    # Stack individual samples into batch tensors
    X = np.stack(X_list, axis=0)
    y = np.stack(y_list, axis=0)
    return X, y

# Load a small dataset for baseline testing
X, y = load_dataset(max_files_per_class=20)
print(f"Dataset shape - X: {X.shape}, y: {y.shape}")
print(f"Total samples: {X.shape[0]}, Classes: {y.shape[1]}")

In [None]:
### Dataset Structure

The `load_dataset` function:
1. Loops over each class ("dog", "cat", "bird")
2. Finds all `.wav` files in each class directory
3. Takes up to `max_files_per_class` files per class
4. Converts each file to `(X, y)` using `load_example_for_model`
5. Stacks all samples into batch tensors:
   - `X.shape = (N, 128, 128, 1)` where N is the total number of samples
   - `y.shape = (N, 3)` with one-hot encoded labels for 3 classes

This gives us a small dataset to test our model pipeline before scaling up to the full dataset.


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


## Baseline CNN Architecture

We define a simple convolutional neural network suitable for audio classification. The architecture consists of two convolutional blocks followed by fully connected layers.

### Architecture Overview

**Input Shape**: `(128, 128, 1)`
- Height: 128 mel frequency bands
- Width: 128 time frames
- Channels: 1 (grayscale spectrogram)

**Architecture**:
1. **Conv Block 1**: 32 filters, 3×3 kernels → MaxPooling
2. **Conv Block 2**: 64 filters, 3×3 kernels → MaxPooling
3. **Flatten**: Convert 2D feature maps to 1D vector
4. **Dense Layer**: 64 neurons with ReLU activation
5. **Output Layer**: 3 neurons (one per class) with softmax activation


input_shape = (128, 128, 1)
num_classes = len(CLASS_NAMES)

model = models.Sequential([
    # Block 1: First convolutional layer
    layers.Conv2D(
        filters=32,
        kernel_size=(3, 3),
        activation="relu",
        padding="same",
        input_shape=input_shape,
    ),
    layers.MaxPooling2D(pool_size=(2, 2)),

    # Block 2: Second convolutional layer
    layers.Conv2D(
        filters=64,
        kernel_size=(3, 3),
        activation="relu",
        padding="same",
    ),
    layers.MaxPooling2D(pool_size=(2, 2)),

    # Flatten + Dense layers
    layers.Flatten(),
    layers.Dense(64, activation="relu"),
    layers.Dense(num_classes, activation="softmax"),
])

model.summary()

In [None]:
### Layer-by-Layer Explanation

**Input Layer**:
- Shape: `(128, 128, 1)` - 128 mel bands × 128 time frames × 1 channel

**Conv Block 1**:
- **Conv2D(32 filters, 3×3)**: Learns 32 different 3×3 filters to detect local patterns
  - Input: `(128, 128, 1)` → Output: `(128, 128, 32)` (same spatial size due to `padding="same"`)
  - ReLU activation: `ReLU(x) = max(0, x)` introduces non-linearity
- **MaxPooling2D(2×2)**: Downsamples by taking maximum of each 2×2 block
  - Input: `(128, 128, 32)` → Output: `(64, 64, 32)`

**Conv Block 2**:
- **Conv2D(64 filters, 3×3)**: Learns 64 more complex features
  - Input: `(64, 64, 32)` → Output: `(64, 64, 64)`
- **MaxPooling2D(2×2)**: Further downsampling
  - Input: `(64, 64, 64)` → Output: `(32, 32, 64)`

**Flatten**:
- Converts `(32, 32, 64)` to 1D vector of size `32 × 32 × 64 = 65,536`

**Dense(64, ReLU)**:
- Fully connected layer with 64 neurons
- Combines all extracted features into a compact representation

**Dense(3, Softmax)**:
- Output layer with 3 neurons (one per class: dog, cat, bird)
- Softmax activation ensures outputs sum to 1 and are interpretable as probabilities


model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)

sanity-check a tiny training run

You don’t have to run long training now, but we can test that everything is wired correctly:

In [None]:
history = model.fit(
    X,
    y,
    epochs=3,
    batch_size=8,
    validation_split=0.2,
    verbose=1
)


Epoch 1/3
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 81ms/step - accuracy: 0.3125 - loss: 1.1382 - val_accuracy: 0.0000e+00 - val_loss: 1.3535
Epoch 2/3
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 43ms/step - accuracy: 0.5417 - loss: 0.9948 - val_accuracy: 0.0000e+00 - val_loss: 1.4463
Epoch 3/3
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step - accuracy: 0.6042 - loss: 0.9258 - val_accuracy: 0.0000e+00 - val_loss: 1.8839


### Compilation Parameters

- **`optimizer="adam"`**: Adam (Adaptive Moment Estimation) is a widely used optimizer that adapts the learning rate per parameter. It's a good default choice that often works well without extensive hyperparameter tuning.

- **`loss="categorical_crossentropy"`**: Suitable for multi-class classification where:
  - We have multiple mutually exclusive classes (dog vs cat vs bird)
  - Labels are one-hot encoded vectors: `[1, 0, 0]`, `[0, 1, 0]`, `[0, 0, 1]`
  - Measures the difference between predicted probability distribution and true distribution

- **`metrics=["accuracy"]`**: Tracks the proportion of examples where the predicted class (argmax of predictions) matches the true class (argmax of true labels).

The model is now ready for training!
