# Audio Classification: Improved CNN Models

**Course:** CSCI 6366 (Neural Networks and Deep Learning)  
**Project:** Audio Classification using CNN  
**Notebook:** Improved CNN Architectures and Regularization Experiments

## Overview

This notebook builds on the baseline CNN from `02_cnn_baseline.ipynb` and explores improvements to reduce overfitting and improve generalization. We will:

1. **Reuse the same data preprocessing pipeline** from the baseline (Mel-spectrograms, 128×128, normalization).
2. **Experiment with improved architectures**:
   - Smaller Dense layers to reduce parameter count
   - Dropout regularization
   - Potentially other architectural tweaks
3. **Compare results** against the baseline model using the same train/val/test splits.
4. **Report metrics** including accuracy, loss curves, and potentially confusion matrices.

The goal is to find a model that generalizes better than the baseline while maintaining reasonable training performance.



## Imports and Setup

**Goal:** Import necessary libraries and set up configuration constants.

- Standard libraries: `numpy`, `pathlib` for data handling
- Audio processing: `librosa` for Mel-spectrogram computation
- Deep learning: `tensorflow` / `keras` for model building
- Evaluation: `sklearn` for train/test splits and metrics
- Visualization: `matplotlib` for plotting training curves



In [None]:
import numpy as np
from pathlib import Path

import librosa
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras import layers, models
from sklearn.model_selection import train_test_split

# Configuration (same as baseline)
DATA_DIR = Path("../data").resolve()
CLASS_NAMES = ["dog", "cat", "bird"]
label_to_index = {label: idx for idx, label in enumerate(CLASS_NAMES)}


## Data Loading and Preprocessing

**Goal:** Reuse the exact same preprocessing functions from the baseline to ensure fair comparison.

- Load audio files and convert to Mel-spectrograms (128×128)
- Normalize to [0, 1] range
- Create one-hot encoded labels
- Use the same train/val/test split strategy with stratification

This ensures that any performance differences come from model architecture changes, not data differences.



In [None]:
# Reuse preprocessing functions from baseline (or import if refactored)
def load_mel_spectrogram(
    audio_path: Path,
    sr: int = 16000,
    n_fft: int = 1024,
    hop_length: int = 512,
    n_mels: int = 128,
) -> tuple[np.ndarray, int]:
    """Load an audio file and compute its Mel-spectrogram in dB scale."""
    y, sr = librosa.load(audio_path, sr=sr)
    S = librosa.feature.melspectrogram(
        y=y, sr=sr, n_fft=n_fft, hop_length=hop_length, n_mels=n_mels, power=2.0
    )
    S_db = librosa.power_to_db(S, ref=np.max)
    return S_db, sr

def pad_or_crop_spectrogram(S_db: np.ndarray, target_shape=(128, 128)) -> np.ndarray:
    """Ensure the Mel-spectrogram has shape (target_height, target_width)."""
    target_height, target_width = target_shape
    n_mels, time_frames = S_db.shape
    
    if n_mels != target_height:
        raise ValueError(f"Expected {target_height} mel bands, got {n_mels}")
    
    if time_frames > target_width:
        start = (time_frames - target_width) // 2
        end = start + target_width
        S_db = S_db[:, start:end]
    elif time_frames < target_width:
        pad_width = target_width - time_frames
        S_db = np.pad(
            S_db,
            pad_width=((0, 0), (0, pad_width)),
            mode="constant",
            constant_values=(S_db.min(),),
        )
    return S_db

def load_example_for_model(audio_path: Path, label: str) -> tuple[np.ndarray, np.ndarray]:
    """Load one audio file and return (X, y) ready for model."""
    S_db, sr = load_mel_spectrogram(audio_path)
    S_fixed = pad_or_crop_spectrogram(S_db, target_shape=(128, 128))
    
    # Normalize to [0, 1]
    S_min = S_fixed.min()
    S_max = S_fixed.max()
    S_norm = (S_fixed - S_min) / (S_max - S_min + 1e-8)
    
    # Add channel dimension
    X = S_norm.astype("float32")[..., np.newaxis]
    
    # One-hot label
    num_classes = len(CLASS_NAMES)
    y = np.zeros(num_classes, dtype="float32")
    y[label_to_index[label]] = 1.0
    
    return X, y

def load_dataset(max_files_per_class: int = 20):
    """Load dataset with same structure as baseline."""
    X_list = []
    y_list = []
    
    for label in CLASS_NAMES:
        class_dir = DATA_DIR / label
        wav_files = sorted(class_dir.glob("*.wav"))
        
        for audio_path in wav_files[:max_files_per_class]:
            X, y = load_example_for_model(audio_path, label)
            X_list.append(X)
            y_list.append(y)
    
    X = np.stack(X_list, axis=0)
    y = np.stack(y_list, axis=0)
    return X, y

# Load data
X, y = load_dataset(max_files_per_class=20)
print(f"Dataset shape: X={X.shape}, y={y.shape}")


### Train/Validation/Test Split

**Goal:** Create the same stratified splits as the baseline for fair comparison.

- Use `random_state=42` to ensure reproducibility
- 20% for test set (held out completely)
- 20% of remaining data for validation
- Rest for training
- Stratify by class to maintain balanced proportions



In [None]:
# Convert to class indices for stratification
y_indices = np.argmax(y, axis=1)

# First split: test set (20%)
X_train_full, X_test, y_train_full, y_test, y_train_full_idx, y_test_idx = train_test_split(
    X, y, y_indices,
    test_size=0.2,
    random_state=42,
    stratify=y_indices,
)

# Second split: train/val from remaining (20% of train_full becomes val)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full,
    test_size=0.2,
    random_state=42,
    stratify=y_train_full_idx,
)

print(f"Train: {X_train.shape[0]}, Val: {X_val.shape[0]}, Test: {X_test.shape[0]}")


## Improved Model 1: Baseline with Dropout

**Goal:** Add dropout regularization to the baseline architecture to reduce overfitting.

- Keep the same architecture as baseline (Conv(32) → Pool → Conv(64) → Pool → Flatten → Dense(64) → Dense(3))
- Add Dropout(0.5) after the Dense(64) layer
- This should help prevent the model from memorizing training data

Expected: Similar or slightly lower training accuracy, but better validation/test accuracy due to reduced overfitting.



In [None]:
input_shape = (128, 128, 1)
num_classes = len(CLASS_NAMES)

model_dropout = models.Sequential([
    tf.keras.Input(shape=input_shape),
    
    layers.Conv2D(32, (3, 3), activation="relu", padding="same"),
    layers.MaxPooling2D((2, 2)),
    
    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
    layers.MaxPooling2D((2, 2)),
    
    layers.Flatten(),
    layers.Dense(64, activation="relu"),
    layers.Dropout(0.5),  # Add dropout here
    layers.Dense(num_classes, activation="softmax"),
])

model_dropout.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)

model_dropout.summary()


### Training Model 1 (with Dropout)

**Goal:** Train the dropout model with the same hyperparameters as baseline for fair comparison.

- Optimizer: Adam
- Loss: categorical crossentropy
- Batch size: 8
- Epochs: 10
- Monitor validation accuracy to see if dropout helps



In [None]:
history_dropout = model_dropout.fit(
    X_train,
    y_train,
    epochs=10,
    batch_size=8,
    validation_data=(X_val, y_val),
    verbose=1,
)


### Evaluation and Visualization

**Goal:** Plot training curves and evaluate on test set to compare with baseline.

- Plot training vs validation loss and accuracy
- Evaluate on held-out test set
- Compare numbers with baseline (train≈0.89, val≈0.60, test≈0.42)



In [None]:
def plot_training_curves(history, title="Training Curves"):
    """Plot training and validation loss/accuracy."""
    history_dict = history.history
    
    train_loss = history_dict.get("loss", [])
    val_loss = history_dict.get("val_loss", [])
    train_acc = history_dict.get("accuracy", [])
    val_acc = history_dict.get("val_accuracy", [])
    
    epochs = range(1, len(train_loss) + 1)
    
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    plt.plot(epochs, train_loss, label="Train loss")
    if val_loss:
        plt.plot(epochs, val_loss, label="Val loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title(f"{title} - Loss")
    plt.legend()
    
    plt.subplot(1, 2, 2)
    plt.plot(epochs, train_acc, label="Train acc")
    if val_acc:
        plt.plot(epochs, val_acc, label="Val acc")
    plt.xlabel("Epoch")
    plt.ylabel("Accuracy")
    plt.title(f"{title} - Accuracy")
    plt.legend()
    
    plt.tight_layout()
    plt.show()

plot_training_curves(history_dropout, title="Model 1: Baseline + Dropout")

# Evaluate on test set
test_loss_dropout, test_acc_dropout = model_dropout.evaluate(X_test, y_test, batch_size=8, verbose=0)
print(f"\nTest Loss: {test_loss_dropout:.4f}")
print(f"Test Accuracy: {test_acc_dropout:.4f}")


#### Interpretation

- What do we see in the plots/numbers?
- Is it better or worse than baseline?
- Any guess why?

*(Fill this in after running the training and seeing the results)*


## Improved Model 2: Smaller Dense Layer

**Goal:** Reduce model capacity by using a smaller Dense layer to combat overfitting.

- Same Conv layers as baseline
- Reduce Dense(64) to Dense(32) to cut parameter count
- This should reduce overfitting by making the model less expressive
- Compare parameter count with baseline (~4.2M)

Expected: Lower training accuracy but potentially better generalization if the baseline was overfitting due to too many parameters.



In [None]:
model_small = models.Sequential([
    tf.keras.Input(shape=input_shape),
    
    layers.Conv2D(32, (3, 3), activation="relu", padding="same"),
    layers.MaxPooling2D((2, 2)),
    
    layers.Conv2D(64, (3, 3), activation="relu", padding="same"),
    layers.MaxPooling2D((2, 2)),
    
    layers.Flatten(),
    layers.Dense(32, activation="relu"),  # Smaller: 64 → 32
    layers.Dense(num_classes, activation="softmax"),
])

model_small.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)

model_small.summary()


### Training Model 2 (Smaller Dense)

**Goal:** Train with same hyperparameters to isolate the effect of architecture change.

- Same training setup as Model 1 and baseline
- Monitor if smaller capacity helps generalization



In [None]:
history_small = model_small.fit(
    X_train,
    y_train,
    epochs=10,
    batch_size=8,
    validation_data=(X_val, y_val),
    verbose=1,
)


### Evaluation Model 2

**Goal:** Visualize results and get test metrics for comparison.



In [None]:
plot_training_curves(history_small, title="Model 2: Smaller Dense Layer")

test_loss_small, test_acc_small = model_small.evaluate(X_test, y_test, batch_size=8, verbose=0)
print(f"\nTest Loss: {test_loss_small:.4f}")
print(f"Test Accuracy: {test_acc_small:.4f}")


#### Interpretation

- What do we see in the plots/numbers?
- Is it better or worse than baseline and Model 1?
- Any guess why?

*(Fill this in after running the training and seeing the results)*


## Model Comparison Summary

**Goal:** Compare all models side-by-side to identify the best approach.

- Create a table comparing train/val/test accuracies
- Compare parameter counts
- Identify which model generalizes best
- Note any trade-offs (e.g., lower train acc but better test acc)



In [None]:
# Extract final metrics from training histories
train_acc_dropout = history_dropout.history['accuracy'][-1]
val_acc_dropout = history_dropout.history['val_accuracy'][-1]
train_loss_dropout = history_dropout.history['loss'][-1]
val_loss_dropout = history_dropout.history['val_loss'][-1]

train_acc_small = history_small.history['accuracy'][-1]
val_acc_small = history_small.history['val_accuracy'][-1]
train_loss_small = history_small.history['loss'][-1]
val_loss_small = history_small.history['val_loss'][-1]

# Baseline numbers (from 02_cnn_baseline.ipynb)
baseline_train_acc = 0.89
baseline_val_acc = 0.60
baseline_test_acc = 0.42
baseline_train_loss = 0.48
baseline_val_loss = 0.94
baseline_test_loss = 1.12

print("=" * 60)
print("MODEL COMPARISON")
print("=" * 60)
print(f"{'Model':<25} {'Train Acc':<12} {'Val Acc':<12} {'Test Acc':<12}")
print("-" * 60)
print(f"{'Baseline':<25} {baseline_train_acc:<12.4f} {baseline_val_acc:<12.4f} {baseline_test_acc:<12.4f}")
print(f"{'Baseline + Dropout':<25} {train_acc_dropout:<12.4f} {val_acc_dropout:<12.4f} {test_acc_dropout:<12.4f}")
print(f"{'Smaller Dense (32)':<25} {train_acc_small:<12.4f} {val_acc_small:<12.4f} {test_acc_small:<12.4f}")
print("=" * 60)


#### Final Interpretation and Conclusions

- Which model performed best overall?
- Did dropout help reduce overfitting?
- Did reducing capacity help or hurt?
- What would you try next?

*(Fill this in after running all experiments and comparing results)*
