# Lesson 3: Convolutional layers

*Teachers:* Fares Schulz, Lina Campanella

In this course we will cover:
1. An introduction to convolution
2. Defining a convolutional neural network in Pytorch for audio classification
3. An explanation on how to use different devices (CPU, CUDA, MPS, etc.) for computation

## Convolution

In purely mathematical terms, convolution is an operation derived from two functions through integration, which expresses how the shape of one function is modified by the other. In discrete terms, the convolution of a larger signal with a smaller one can be understood as a form of _filtering_. The smaller signal, often called a _kernel_ or _filter_, slides over the larger signal, and at each position, we compute a local weighted sum of the overlapping elements. Formally, for a discrete 1D signal, the convolution operator $\star$ is defined as

$$
(f \star g)[n] = \sum_{m=-M}^{M} f[n - m] \, g[m].
$$



## A simple convolution in Numpy

In the context of audio processing, convolution can be used to _filter_ a sound signal, modifying its frequency content. For example, a low-pass finite impulse response (FIR) filter allows low-frequency components of a signal to pass through while attenuating higher frequencies. By convolving an audio signal with the FIR filter coefficients, we effectively smooth the signal, removing rapid fluctuations corresponding to high-frequency noise.


The behavior of the convolution is determined by several parameters:
- **filter length (M)**: the number of coefficients in the FIR filter, which determines the amount of smoothing  
- **padding**: zeros can be added at the start and end of the signal to control the output length

Convolution is fundamental in signal processing and machine learning, allowing us to extract features, remove noise, or apply effects. In neural networks for audio, convolutional layers operate on raw waveforms or spectrograms, learning patterns in time or frequency automatically.



In the first example we will convolve an audio sample with a low pass filter using numpy. For that we will go through the following three steps:
1. Pad the signal to control the output size (using mode='same' in `np.convolve` achieves this automatically).
2. Convolve the (padded) signal with the filter
3. Downsample the output using a stride to reduce its length and control computational cost, a concept that becomes especially important when working with CNNs.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import sounddevice as sd
from scipy.io import wavfile
from scipy import signal

In [None]:
sample_rate, audio = wavfile.read("recources/_audio/foot_steps.wav")

# In case the audio is stereo or multichannel only the first channel is used
if audio.ndim > 1:
    audio = audio[:,0]
    
# Normalize the audio
audio = audio / np.max(np.abs(audio))

sd.play(audio,sample_rate)

In [4]:
# Create a simple FIR filter (low-pass)
M = 11
filter_IR = signal.firwin(M, 200, fs=sample_rate)

down_factor = 2 # downsampling factor

# To avoid ambiguity
audio = audio[:len(audio) // down_factor] 

In [5]:
# Zero padding
pad = M - 1

# Manual zero padding (same as np.pad)
audio_padded = np.concatenate([np.zeros(int(pad)), audio, np.zeros(int(pad))])

# Convolution in numpy
y_conv = np.convolve(audio_padded, filter_IR, mode='valid')

# Downsampling
y_down = y_conv[::down_factor]

sd.play(y_down,sample_rate/down_factor)

In [None]:
# Plot the data
x = np.arange(len(audio)) / sample_rate 
x_conv = np.arange(len(y_down)) / (sample_rate / down_factor)

plt.figure(figsize=(8, 5), dpi=120)
plt.plot(x[7000:9000], audio[7000:9000], color='C0', lw=1.2, alpha=0.8, label="Original audio")
plt.plot(x_conv[3500:4500], y_down[3500:4500], color='C1', lw=2, alpha=0.8, label="Audio after convolution")

plt.title("Audio Signal vs. Audio Signal after Convolution (Filtered)", fontsize=14, weight='bold')
plt.xlabel("Time (seconds)", fontsize=12)
plt.ylabel("Amplitude", fontsize=12)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## A transposed convolution in Numpy

The transposed convolution is a complementary operation that allows us to increase the length of a signal or upsample it. In neural networks, it is commonly used in generative models or audio synthesis, where we want to reconstruct or expand a signal from a lower-resolution representation.

Conceptually, a transposed convolution can be thought of as “reversing” the effect of a normal convolution. Instead of sliding a filter over an input and producing a smaller (or same-size) output, we insert zeros between the input samples (according to the upsampling factor) and then apply a convolution-like operation to spread the input values over a larger output. This effectively increases the signal length while applying the learned filter weights to produce a smooth, meaningful output.

In audio applications, a simple example would be upsampling a low-resolution waveform. In this example the downsampled audio signal from before will be upsampled again to have the same samplerate as the original audio signal.


In [8]:
# Transposed convolution
up_factor = 2 # upsampling factor

# Upsampling
y_up = np.zeros(len(y_down) * up_factor)
y_up[::up_factor] = y_down  # insert zeros

# Zero padding
pad = M - 1
y_up_padded = np.concatenate([np.zeros(int(pad)), y_up, np.zeros(int(pad))])

# Convolve with same filter
y_trans = np.convolve(y_up_padded, filter_IR, mode='valid')
y_trans = y_trans / np.max(np.abs(y_trans))

sd.play(y_trans,sample_rate)

In [None]:
# Plot the data
x_conv = np.arange(len(y_down)) / (sample_rate / down_factor)  
x_trans = np.arange(len(y_trans)) / sample_rate

plt.close('all')
plt.figure(figsize=(8, 5), dpi=120)
plt.plot(x[7000:9000], audio[7000:9000], color='C0', lw=1.2, alpha=0.8, label="Original Audio")
plt.plot(x_conv[3500:4500], y_down[3500:4500], color='C1', lw=1.2, alpha=0.8, label="convolved Audio")
plt.plot(x_trans[7000:9000], y_trans[7000:9000], color='C3', lw=2, alpha=0.8, label="Audio after transposed convolution")


plt.title("Audio Signal vs. Audio Signal after Transposed Convolution", fontsize=14, weight='bold')
plt.xlabel("Time (seconds)", fontsize=12)
plt.ylabel("Amplitude", fontsize=12)
plt.legend(fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

Of course, it will be impossible to reconstruct the signal exactly, since data is lost in the downsampling process after convolution and cannot be recovered. 

## Convolution with PyTorch

Now we will do the same using PyTorch. In this example we will do the convolution with the same audio file and the same filter using PyTorchs `torch.nn.functional.conv1d`.

One of the advantages of PyTorch is that it can run computations on different hardware devices. This allows us to take advantage of faster processors when available.

On machines with an NVIDIA GPU, PyTorch can use **CUDA** to perform operations on the graphics card, which is highly efficient for large numerical computations and deep learning. On Apple Silicon (M1/M2) or other Macs with newer GPUs, PyTorch can use **MPS** to run the same operations on the GPU. If no compatible GPU is available, PyTorch will automatically fall back to using the **CPU**, which is slower but always available.

We can easily check which device is available and move our data and models to that device.


In [None]:
from pathlib import Path
import torch
import torch.nn as nn
import torch.nn.functional as F

# Check which device is available
device = torch.device("cpu")

print(f"MPS is available: {torch.backends.mps.is_available()}")
if torch.backends.mps.is_available():
    device = torch.device("mps")
print(f"Cuda is available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    device = torch.device("cuda")
    
print(rf"Device that will be trained on: {device}")

In [None]:
audio_tensor = torch.tensor(audio, dtype=torch.float32, device=device).unsqueeze(0).unsqueeze(0)
filter_IR = torch.tensor(signal.firwin(12, 200, fs=sample_rate), dtype=torch.float32, device=device)
kernel = filter_IR.flip(0).unsqueeze(0).unsqueeze(0)

print(f"kernel is on device: {kernel.device}")

In [None]:
tensor_y_down = F.conv1d(audio_tensor, kernel, stride=down_factor, padding=(M-1,))

print(f"tensor_y_down is on device {tensor_y_down.device}")

In [None]:
tensor_y_down_squeezed = tensor_y_down.squeeze(0).squeeze(0)
y_down_np = tensor_y_down_squeezed.detach().cpu().numpy()

y_dif = np.mean((y_down - y_down_np)**2)
print(rf"The Mean Squared Error of the difference between the convolution with numpy and with pytorch is: y_dif = {np.round(y_dif,5)}")

In [None]:
x = np.arange(len(audio)) / sample_rate
x_conv_np = np.arange(len(y_down_np)) / (sample_rate / down_factor)

plt.figure(figsize=(8, 5), dpi=120)
plt.plot(x[7000:9000], audio[7000:9000], color='C0', lw=1.2, alpha=0.8, label="Original Audio")
plt.plot(x_conv_np[3500:4500], y_down_np[3500:4500], color='C1', lw=2, alpha=0.8, label="Filtered Audio (Low-pass)")

plt.title("Audio Signal vs. Audio Signal after Convolution (Filtered)", fontsize=14, weight='bold')
plt.xlabel("Time (seconds)", fontsize=12)
plt.ylabel("Amplitude", fontsize=12)
plt.legend(frameon=False, fontsize=11)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

## Transposed convolution with PyTorch

Now we will implement the trnasposed convolution in PyTorch too.

In [None]:
# Upsampling with pytorch
tensor_y_up = F.conv_transpose1d(tensor_y_down, kernel, stride=up_factor)

tensor_y_up_squeezed = tensor_y_up.squeeze(0).squeeze(0)
y_up_np = tensor_y_up_squeezed.detach().cpu().numpy()

y_dif = np.mean( (y_trans - y_up_np)**2)
print(rf"The Mean Squared Error of the difference between the transposed convolution with numpy and with pytorch is: y_dif = {np.round(y_dif,5)}")

# Waveform classification with a convolutional neureal network

In this exercise, we build and train a simple 1D Convolutional Neural Network (CNN) to classify different waveform types, such as sine, triangle, and square waves. Each waveform is represented as a one-dimensional signal of length 128.

We will:
1. Generate a synthetic dataset of waveforms.
2. Prepare the data for PyTorch using custom Dataset and DataLoader classes.
3. Define and train a 1D CNN for waveform classification.
4. Evaluate the model performance with a test set and visualize the training process and the result.

In [18]:
# Define devices:
if torch.backends.mps.is_available():
    device_mps = torch.device("mps")  
if torch.cuda.is_available():
    device_cuda = torch.device("cuda")

device_cpu = torch.device("cpu")

In [19]:
# Creating the data
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split

def generate_waveform(wave_type, length, fs):
    t = np.linspace(0, 1, length, endpoint=False)
    freq = np.random.uniform(1, 10)      # random frequency 1-10Hz
    amp = np.random.uniform(0.5, 1.5)    # random amplitude
    phi = np.random.uniform(0, 2*np.pi)  # random phases

    if wave_type == 'sine':
        y = amp * np.sin(2 * np.pi * freq * t + phi)
    elif wave_type == 'triangle':
        y = amp * signal.sawtooth(2 * np.pi * freq * t + phi, 0.5)
    elif wave_type == 'square':
        y = amp * signal.square(2 * np.pi * freq * t + phi)
    else:
        raise ValueError("Unknown wave type")
    
    # optional noise
    noise = np.random.normal(0, 0.05, length)
    y += noise
    return y

# Parameters
num_samples = 200      # samples per waveform type
length = 128           # number of points per waveform
fs = 128               # sampling frequency

# Generate dataset in memory
wave_types = ['sine', 'triangle', 'square']
X = []
y = []

for idx, wave in enumerate(wave_types):
    for _ in range(num_samples):
        waveform = generate_waveform(wave, length, fs)
        X.append(waveform)
        y.append(idx)  # class label: 0=sine, 1=triangle, 2=square


# Convert to numpy arrays
X = np.array(X)   # shape: (600, 128)
y = np.array(y)   # shape: (600,)

In [None]:
# Class to create our data set with torch.utils.data.Dataset
class WaveformDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32).unsqueeze(1)  # add channel dimension
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.y)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]
    
dataset = WaveformDataset(X, y)

train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)

In [None]:
import torch.nn as nn
import torch.nn.functional as F

# Our CNN model for waveform classification
class Waveform_Classification_CNN(nn.Module):
    def __init__(self, num_classes=3):
        super().__init__()
        self.conv1 = nn.Conv1d(1, 16, kernel_size=5, stride=2)
        self.conv2 = nn.Conv1d(16, 32, kernel_size=5,  stride=2)
        self.pool = nn.MaxPool1d(2)
        self.fc1 = nn.Linear(224, 64) 
        self.fc2 = nn.Linear(64, num_classes)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool(x)
        x = F.relu(self.conv2(x))
        x = self.pool(x)
        x = x.view(x.size(0), -1)  # flatten
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

In [None]:
import time

device = device_cpu

model = Waveform_Classification_CNN(num_classes=3).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.ASGD(model.parameters(),lr=0.01)
# optimizer = optim.Adam(model.parameters(),lr=0.01)

n_epochs = 1000
loss_history = []

start_time = time.time()

for epoch in range(n_epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * inputs.size(0)

    epoch_loss = running_loss / len(train_loader.dataset)
    loss_history.append(epoch_loss)

end_time = time.time()
elapsed = end_time - start_time
print(f"Training time on {device}: {elapsed:.2f} seconds")

plt.figure(figsize=(6,4))
plt.plot(loss_history, label="Training loss")
plt.title("Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.grid(True)
plt.show()


In [None]:
model.eval()
all_preds = []
all_labels = []
all_inputs = []
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
        all_inputs.extend(inputs.cpu().numpy()) 
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Accuracy: {100 * correct / total:.2f}%")

In [None]:

wave_types = ['sine', 'triangle', 'square']
n_examples = 6
idxs = np.random.choice(len(all_inputs), n_examples, replace=False)

plt.figure(figsize=(12, 8))
for i, idx in enumerate(idxs):
    plt.subplot(2, 3, i+1)
    plt.plot(all_inputs[idx][0], color='black')
    plt.title(f"True: {wave_types[all_labels[idx]]}\nPred: {wave_types[all_preds[idx]]}",
              color="green" if all_labels[idx] == all_preds[idx] else "red")
    plt.tight_layout()
plt.show()

We could also implement this task using a fully linear model (a simple feed-forward network)
instead of a convolutional one. 

In [25]:
# Linear model

class Waveform_Classification_Linear(nn.Module):
    def __init__(self, num_classes=3):
        super().__init__()
        # Input: 1 × 128 waveform → flatten to 128 features
        self.fc1 = nn.Linear(128, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 64)
        self.fc4 = nn.Linear(64, num_classes)

    def forward(self, x):
        # Flatten from (batch, 1, 128) → (batch, 128)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)   # output logits
        return x

In [None]:
device = device_cpu  
model = Waveform_Classification_Linear(num_classes=3).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.ASGD(model.parameters(), lr=0.01)
# optimizer = optim.Adam(model.parameters(), lr=0.01)

n_epochs = 1000
loss_history = []

start_time = time.time()

for epoch in range(n_epochs):
    model.train()
    running_loss = 0.0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item() * inputs.size(0)

    epoch_loss = running_loss / len(train_loader.dataset)
    loss_history.append(epoch_loss)

end_time = time.time()
elapsed = end_time - start_time
print(f"Training time on {device}: {elapsed:.2f} seconds")

plt.figure(figsize=(6, 4))
plt.plot(loss_history, label="Training loss")
plt.title("Training Loss (Linear Model)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
model.eval()
all_preds = []
all_labels = []
all_inputs = []
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, predicted = torch.max(outputs, 1)
        all_preds.extend(predicted.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
        all_inputs.extend(inputs.cpu().numpy())
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Accuracy: {100 * correct / total:.2f}%")

However, such a model would not achieve the same accuracy as the CNN, because it treats
every input value as an independent feature and ignores the local structure of the waveform.
Convolutional layers, in contrast, learn spatial and temporal relationships — for example,
recurring shapes, edges, or transitions in the signal, by applying shared filters across
the entire sequence. So if there is a phase shift in the waveform the linear model won't recognise it anymore.

As a result, while a fully linear network can technically classify the data,
it usually performs worse and requires many more parameters to capture the same patterns.
CNNs are therefore much better suited for waveform or time-series classification tasks.