
# Convolutions Beyond Images: 1D CNNs, Causality, Audio, and Text

Hands-on notebook aligned with the lecture "Convolutions beyond images".
You will:
- Implement and visualize 1D convolutions, causal masks, and dilations.
- Build a tiny autoregressive forecaster on a synthetic time series.
- Generate simple spectrogram features and run a toy CNN over them.
- Practice padding variable-length sequences for batching.
- Explore basic tokenization and embeddings, with NumPy-first examples.


In [None]:

import numpy as np
import matplotlib.pyplot as plt

np.set_printoptions(suppress=True, linewidth=120)
def seed_all(seed=0):
    np.random.seed(seed)

seed_all(7)



## Part 1 — 1D Convolutions on Sequences

We represent a length-$n$ sequence with $c$ channels as a matrix $X \in \mathbb{R}^{n \times c}$.
A 1D convolution extracts local patches and applies a learned kernel to each patch:
$$
H[i] = \phi\big( W \cdot \mathrm{vec}(\text{patch around } i) + b \big).
$$

We will implement a simple **valid** convolution for clarity.


In [None]:

def conv1d_valid(X, K):
    """
    X: (n, c) sequence
    K: (s, c, c_out) kernel, s = kernel size, c_in = c
    Returns H: (n - s + 1, c_out)
    """
    n, c = X.shape
    s, c_in, c_out = K.shape
    assert c_in == c
    H = np.zeros((n - s + 1, c_out), dtype=X.dtype)
    for i in range(n - s + 1):
        patch = X[i:i+s, :]  # (s, c)
        # vectorized filter application: sum over s and c
        # output shape (c_out,)
        H[i] = np.tensordot(patch, K, axes=([0,1],[0,1]))
    return H

# quick demo
seed_all(0)
n, c, s, c_out = 16, 2, 3, 4
X = np.random.randn(n, c)
K = np.random.randn(s, c, c_out) * 0.1
H = conv1d_valid(X, K)
H.shape



## Part 2 — Causal (Masked) Convolutions

For forecasting, ensure output at time $i$ depends only on inputs $\le i$.
We enforce **causality** by padding on the left and not using future positions.
Implementation trick: left-pad with $(s-1)$ zeros, then do a valid convolution.


In [None]:

def conv1d_causal(X, K):
    """
    Causal 1D conv with left padding of size (s-1).
    X: (n, c)
    K: (s, c, c_out)
    Returns H: (n, c_out), aligned with X positions 0..n-1.
    """
    s = K.shape[0]
    pad = np.zeros((s-1, X.shape[1]), dtype=X.dtype)
    X_pad = np.concatenate([pad, X], axis=0)  # (n + s - 1, c)
    H_valid = conv1d_valid(X_pad, K)          # (n, c_out)
    return H_valid

# sanity check shapes
Hc = conv1d_causal(X, K)
H.shape, Hc.shape



## Part 3 — Dilated Convolutions

A **dilated** kernel samples inputs with a stride $d$ inside the receptive field,
increasing receptive field without increasing parameter count.


In [None]:

def conv1d_dilated_causal(X, K, dilation=1):
    """
    Dilated causal 1D conv via left padding.
    For kernel size s and dilation d, the receptive field width is (s-1)*d + 1.
    X: (n, c), K: (s, c, c_out)
    Returns (n, c_out)
    """
    s = K.shape[0]
    # effective width
    width = (s - 1) * dilation + 1
    pad = np.zeros((width - 1, X.shape[1]), dtype=X.dtype)
    X_pad = np.concatenate([pad, X], axis=0)  # (n + width - 1, c)
    n = X.shape[0]
    c_out = K.shape[2]
    H = np.zeros((n, c_out), dtype=X.dtype)
    for i in range(n):
        # gather dilated patch ending at i
        idxs = [i + width - 1 - j*dilation for j in range(s)]
        # reverse order so kernel K[0] aligns with oldest element
        idxs = idxs[::-1]
        patch = X_pad[idxs, :]  # (s, c)
        H[i] = np.tensordot(patch, K, axes=([0,1],[0,1]))
    return H

# visualize dilation growth of receptive field by plotting outputs
seed_all(1)
X_demo = np.zeros((40,1)); X_demo[10] = 1.0  # impulse
K_demo = np.ones((3,1,1))
H_d1 = conv1d_dilated_causal(X_demo, K_demo, dilation=1)
H_d2 = conv1d_dilated_causal(X_demo, K_demo, dilation=2)
H_d4 = conv1d_dilated_causal(X_demo, K_demo, dilation=4)

plt.figure(figsize=(9,3))
plt.plot(H_d1[:,0], label="d=1")
plt.plot(H_d2[:,0], label="d=2")
plt.plot(H_d4[:,0], label="d=4")
plt.title("Dilated causal conv outputs for an impulse input")
plt.legend(); plt.show()



## Part 4 — Forecasting with a Causal CNN

We build a synthetic time series and train a tiny causal CNN to predict the next value at each step.
Loss: mean squared error between predicted next-step $\hat{x}_{t+1}$ and true $x_{t+1}$.
This is a teacher-forcing setup for one-step prediction.


In [None]:

def make_series(T=400, noise=0.05):
    t = np.linspace(0, 8*np.pi, T)
    x = np.sin(t) + 0.3*np.sin(3*t + 0.5) + noise*np.random.randn(T)
    return x

def make_dataset_1d(x, s, horizon=1):
    """
    Build supervised pairs (patch->next value) for a single-channel series.
    Returns X: (N, s, 1), Y: (N, 1)
    """
    N = len(x) - s - horizon + 1
    X = np.zeros((N, s, 1))
    Y = np.zeros((N, 1))
    for i in range(N):
        X[i,:,0] = x[i:i+s]
        Y[i,0]  = x[i+s]  # next-step target
    return X, Y

seed_all(3)
x = make_series(T=600, noise=0.03)
s = 9           # kernel size / receptive field window
X_sup, Y_sup = make_dataset_1d(x, s=s)

# simple 1-layer causal conv regressor: y_hat = X * K + b, applied at last time step only
K_reg = np.random.randn(s, 1, 1)*0.1
b_reg = np.zeros((1,))

def predict_batch(Xb, K, b):
    # apply 1D conv causally then take last output as prediction
    # Here equivalently: weighted sum over the s-window
    # Using conv definition: H[i] = sum_{u,c} X[i+u,c]*K[u,c]
    # since Xb windows are aligned, we directly tensordot
    return np.tensordot(Xb, K, axes=([1,2],[0,1])).reshape(-1,1) + b  # (N,1)

def mse(a, b):
    return np.mean((a-b)**2)

lr = 1e-2
for epoch in range(200):
    # mini-batch SGD over random slices
    idx = np.random.randint(0, X_sup.shape[0], size=64)
    Xb, Yb = X_sup[idx], Y_sup[idx]
    Yhat = predict_batch(Xb, K_reg, b_reg)
    loss = mse(Yhat, Yb)
    # gradients: dL/dK = average over batch of 2*(Yhat-Yb)*Xb
    diff = 2*(Yhat - Yb) / len(idx)  # (B,1)
    # dK shape (s,1,1)
    dK = np.tensordot(Xb, diff, axes=([0,2],[0,1]))  # (s,1)
    dK = dK.reshape(s,1,1)
    db = np.sum(diff, axis=0)  # (1,)
    K_reg -= lr*dK
    b_reg -= lr*db
    if (epoch+1) % 50 == 0:
        print(f"epoch {epoch+1:3d} | loss {loss:.5f}")

# Evaluate on tail
X_te, Y_te = make_dataset_1d(x[-300:], s=s)
Y_pred = predict_batch(X_te, K_reg, b_reg)

plt.figure(figsize=(10,3))
plt.plot(range(len(x)), x, label='series', alpha=0.5)
base = len(x)-len(Y_pred)
plt.plot(range(base+s, base+s+len(Y_pred)), Y_pred[:,0], label='1-step pred')
plt.legend(); plt.title("Causal CNN (1-layer) one-step forecasts"); plt.show()



### Autoregressive generation

Given a trained one-step model, we can generate multiple steps by feeding predictions back as inputs.
We demonstrate naive AR generation from the last observed window.


In [None]:

def ar_generate(x_prefix, steps, K, b):
    # x_prefix: last s samples
    s = K.shape[0]
    buf = list(x_prefix[-s:])
    out = []
    for _ in range(steps):
        Xw = np.array(buf).reshape(1, s, 1)
        yhat = predict_batch(Xw, K, b)[0,0]
        out.append(yhat)
        buf = buf[1:] + [yhat]
    return np.array(out)

s = K_reg.shape[0]
prefix = x[-(s+100):-100]  # last s window from a point before the very end
gen = ar_generate(prefix, steps=120, K=K_reg, b=b_reg)

plt.figure(figsize=(10,3))
plt.plot(range(len(x)), x, label='series', alpha=0.5)
start = len(x)-100
plt.plot(range(start, start+len(gen)), gen, label='AR gen')
plt.legend(); plt.title("Autoregressive generation with causal conv regressor")
plt.show()



## Part 5 — Simple Spectrogram Features (STFT)

We simulate an audio waveform and compute a short-time Fourier transform to obtain a spectrogram,
then run a tiny CNN across time (treating frequency bins as channels or as the second spatial axis).


In [None]:

def stft_mag(y, win=256, hop=128):
    # simple magnitude STFT using NumPy FFT
    n = len(y)
    windows = []
    for start in range(0, n - win + 1, hop):
        seg = y[start:start+win] * np.hanning(win)
        spec = np.fft.rfft(seg)
        windows.append(np.abs(spec))
    S = np.stack(windows, axis=0)  # (frames, freq_bins)
    return S

# fake "audio": sum of tones
seed_all(8)
sr = 8000
t = np.linspace(0, 2.0, int(2.0*sr), endpoint=False)
y = 0.7*np.sin(2*np.pi*220*t) + 0.4*np.sin(2*np.pi*660*t) + 0.1*np.random.randn(len(t))
S = stft_mag(y, win=256, hop=128)  # (frames, freq_bins)

plt.figure(figsize=(6,3))
plt.imshow(20*np.log10(S.T+1e-6), aspect='auto', origin='lower')
plt.title("Spectrogram (dB)"); plt.xlabel("frame"); plt.ylabel("freq bin")
plt.colorbar(); plt.show()

# Treat as sequence of frames with c = freq_bins channels
X_spec = S.astype(np.float32)  # (T, F)
# tiny conv along time with kernel size 3, mapping F->Cout
T, F = X_spec.shape
Cout = 6
K_spec = np.random.randn(3, F, Cout) * 0.01
H_spec = conv1d_valid(X_spec, K_spec)  # (T-2, Cout)
H_spec.shape



## Part 6 — Padding Variable-Length Sequences

Mini-batches need same-length tensors. We pad to the max length in the batch.


In [None]:

def pad_batch(seqs):
    """
    seqs: list of arrays with shape (t_i, c)
    returns X: (B, T_max, c), mask: (B, T_max) where 1 indicates valid.
    """
    B = len(seqs)
    c = seqs[0].shape[1]
    T_max = max(s.shape[0] for s in seqs)
    X = np.zeros((B, T_max, c))
    mask = np.zeros((B, T_max), dtype=np.int32)
    for i,s in enumerate(seqs):
        t = s.shape[0]
        X[i, :t, :] = s
        mask[i, :t] = 1
    return X, mask

# demo with three sequences
seqs = [np.random.randn(12,2), np.random.randn(7,2), np.random.randn(10,2)]
X_pad, mask = pad_batch(seqs)
X_pad.shape, mask



## Part 7 — Text Tokenization and Embeddings (NumPy-first)

We simulate three tokenization levels and simple embeddings.


In [None]:

def simple_char_tokenize(s):
    return list(s)

def simple_word_tokenize(s):
    return s.lower().split()

def build_vocab(tokens_list):
    vocab = {}
    for tokens in tokens_list:
        for tok in tokens:
            if tok not in vocab:
                vocab[tok] = len(vocab)
    return vocab

def encode(tokens, vocab):
    return np.array([vocab[t] for t in tokens], dtype=np.int32)

def embedding_lookup(idx, E):
    # idx: (n,), E: (V, d)
    return E[idx]  # (n, d)

# demo
texts = ["Check this out", "Check out bowling", "this text"]
tok_char = [simple_char_tokenize(s) for s in texts]
tok_word = [simple_word_tokenize(s) for s in texts]

v_char = build_vocab(tok_char)
v_word = build_vocab(tok_word)

# build embeddings
d_char, d_word = 8, 16
E_char = np.random.randn(len(v_char), d_char)*0.1
E_word = np.random.randn(len(v_word), d_word)*0.1

ex = tok_word[0]
ids = encode(ex, v_word)
emb = embedding_lookup(ids, E_word)
print("tokens:", ex)
print("ids:", ids)
print("embeddings shape:", emb.shape)



### Tiny text CNN skeleton

We will:
- tokenize to word ids,
- look up embeddings,
- apply a small temporal conv and a global average pooling,
- classify into two toy classes.

This is a non-optimized NumPy sketch for educational purposes.


In [None]:

def text_to_ids(text, vocab):
    return encode(simple_word_tokenize(text), vocab)

def global_avg_pool(H):
    # H: (T, C) -> (C,)
    return H.mean(axis=0)

def text_cnn_forward(text, v_word, E_word, K, W_cls, b_cls):
    ids = text_to_ids(text, v_word)
    X = embedding_lookup(ids, E_word)          # (T, d)
    H = conv1d_valid(X, K)                     # (T-s+1, C)
    g = global_avg_pool(H)                     # (C,)
    logits = W_cls @ g + b_cls                 # (num_classes,)
    return logits, {"ids":ids, "H":H, "g":g}

# toy data
classes = ["sport", "other"]
num_classes = 2
s = 3
C = 6
K = np.random.randn(s, d_word, C)*0.05
W_cls = np.random.randn(num_classes, C)*0.05
b_cls = np.zeros((num_classes,))

text = "check this bowling"
logits, cache = text_cnn_forward(text, v_word, E_word, K, W_cls, b_cls)
logits, logits.shape



## Practice — Your Turn

1. **Causal vs non-causal:** Modify `conv1d_valid` to return outputs aligned so each position uses symmetric context.
   Then implement a non-causal forecaster and compare information leakage behavior to `conv1d_causal`.

2. **Multi-layer dilations:** Chain two dilated layers with dilations $d=1$ and $d=2$. Measure effective receptive field.

3. **Spectrogram CNN:** Add one more conv layer on `H_spec` and a small classifier. Try to separate two synthetic audio classes
   built from different tone mixtures.

4. **Padding masks:** Using `mask` from `pad_batch`, implement a masked global average pooling that ignores padded positions.

5. **Embeddings:** Replace the word embeddings with character embeddings and compare the number of tokens, model size,
   and performance on a small toy classification set.

6. **Autoregressive sampling:** In `ar_generate`, add scheduled sampling: with small probability, use the *true* next value during generation.


In [None]:
# Your work here


In [None]:
# Your work here


In [None]:
# Your work here


In [None]:
# Your work here


In [None]:
# Your work here


In [None]:
# Your work here



## Summary

- 1D convolutions apply naturally to sequences and can be made **causal**.
- **Dilations** expand receptive field exponentially with depth.
- Forecasting uses causal models to avoid information leakage; AR generation reuses model outputs.
- Audio tasks often use **spectrograms** as inputs; text requires **tokenization** and **embeddings**.
- Variable-length sequences need **padding** and masking for mini-batches.

Extend any section with deeper experiments.
