# Recitation 5: Dense embedding and CNN

_Date_: 10/9/2025

## Dataset

In this recitation, we use _Names_ corpus whose each instance is a tuple of (name, gender).

The original dataset is stored in a directory which has following structure:
```
>>> tree ./data/names

data/names
├── female.txt
├── male.txt
└── README
```

_**TODO**_: download the data and put them in a newly made directory called `data` in this recitation project root.

In [1]:
from typing import List, Tuple
from dataclasses import dataclass, asdict

import os

In [2]:
@dataclass
class NameInstance:
    """Dataclass for a single instance of 'Names' dataset"""
    name: str
    gender: str

    def __repr__(self) -> str:
        return f"< Name: {self.name} | Gender: {self.gender} >"
    

def load_names(data_dir: str) -> List[NameInstance]:
    """Load instances of 'Names' dataset"""
    raw_data = []
    
    for filename in os.listdir(data_dir):
        if not filename.endswith('.txt'):
            continue

        data_file = os.path.join(data_dir, filename)
        label = os.path.splitext(filename)[0]

        with open(data_file, "r") as file:
            for i, line in enumerate(file):
                # raw_data.append([line.strip(), label])
                raw_data.append(NameInstance(name=line.strip(), gender=label))

    return raw_data


def split_train_test(
    data: List[NameInstance],
    train_size: int,
    seed: int = 42
) -> Tuple[List[NameInstance], List[NameInstance]]:
    """Split the data into train and test set"""
    assert train_size < len(data), 'training size must be less than the whole set'
    import random
    random.seed(seed)

    random.shuffle(data)

    return data[:train_size], data[train_size:]

In [3]:
names = load_names('../data/names')
sample_name = names[1]
print(sample_name)

< Name: Aaron | Gender: male >


In [4]:
train_names, test_names = split_train_test(names, 6000)

## Dense embeddings

### Build vocabulary
Even though it's doable simply learning the relationship between the gender and the name at word-level, recall that a string can be seen as a list of characters. A character-level representation may capture more features from a single name, for example, the order, frequency of the characters.

In [5]:
from typing import Dict

In [6]:
def build_vocab(
    train_instances: List[NameInstance],
    special_tokens=None,
) -> Dict[str, int]:
    """Build vocabulary from names in the train set"""
    vocab_set = set()
    
    if special_tokens:
        vocab_set.update(special_tokens)
    else:
        special_tokens = set()
    
    for inst in train_instances:
        name = inst.name.lower()
        vocab_set.update(set(name))

    idx = 0
    vocab = {}
    
    for char in vocab_set:
        if char.isalpha() or char in special_tokens:
            vocab[char] = idx
            idx += 1
            
    return vocab


def build_label_map(name_instances: List[NameInstance]) -> Dict[str, int]:
    """Build label map from the whole dataset"""
    from collections import Counter

    label_count = Counter([instance.gender.lower() for instance in name_instances])
    
    return {label: idx for idx, label in enumerate(label_count)}

In [7]:
pad_token, unk_token = '@', '#'
V = build_vocab(train_names, special_tokens=[pad_token, unk_token])
label_map = build_label_map(names)

In [8]:
label_map

{'female': 0, 'male': 1}

In [9]:
V

{'f': 0,
 'x': 1,
 'h': 2,
 'y': 3,
 'b': 4,
 'q': 5,
 'p': 6,
 'z': 7,
 'd': 8,
 'l': 9,
 'c': 10,
 'i': 11,
 'e': 12,
 's': 13,
 '#': 14,
 'm': 15,
 'n': 16,
 'g': 17,
 'u': 18,
 'v': 19,
 't': 20,
 'a': 21,
 'j': 22,
 'r': 23,
 'w': 24,
 'k': 25,
 '@': 26,
 'o': 27}

### Dense vectorization
In general, we can regard the vectorization as a two-step transformation process. We need two operators $I$ and $T$, where $I$ access a token's ID and $T$ transforms a token's ID (basic unit) to a vector (i.e. 1d tensor). $$\vec{v} = T(I(t))$$

Previously in one-hot encoding, the transformation is divided into two sub-steps:
1. Access the token's ID $i$ in the vocabulary
2. Transform the integer ID $i$ to a vector by assigning $1$ at $i^{th}$ position while assigning $0$ for rest of positions.

Similarly, in dense vectorization, the only difference is that the vector $\vec{v}$ is different. It is dense which means it has much fewer dimensions that a sparse vector, so it reflects semantic similarity more easily.

That's to say, $T_{\text{dense}}$ has different outcome than $T_{\text{sparse}}$. We learned (static) word embeddings from the lecture, so each token in the vocabulary of the training paradigm has a corresponding dense vector with dimension $d$.

[`nn.Embedding`](https://docs.pytorch.org/docs/stable/generated/torch.nn.Embedding.html) implements $T_{\text{dense}} \in \mathbb{R}^{|V| \times d}$ for you so we can use that in practice.

In [10]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

In [11]:
def tokenize(name: str) -> List[str]:
    """Tokenize a name string to a list of chars"""
    return list(name.lower())
    

def dense_map(
    name_instance: NameInstance,
    vocab: Dict[str, int],
    label_map: Dict[str, int],
    max_len: int,
    verbose: bool = False
) -> Tuple[torch.Tensor, torch.Tensor]:
    """Map a name string to a list of indices and its gender index"""
    chars = tokenize(name_instance.name.lower())

    if len(chars) > max_len:
        new_chars = chars[:max_len]
    else:
        new_chars = chars + [pad_token] * (max_len - len(chars))
    
    indices, label = [vocab.get(char, vocab[unk_token]) for char in new_chars], label_map[name_instance.gender.lower()]

    if verbose:
        print(f"Original chars: {chars}\nAfter: {new_chars}")
    
    return torch.tensor(indices, dtype=torch.long), torch.tensor(label, dtype=torch.long)

In [12]:
dense_map(NameInstance('Bohan', 'male'), V, label_map, 10, True)

Original chars: ['b', 'o', 'h', 'a', 'n']
After: ['b', 'o', 'h', 'a', 'n', '@', '@', '@', '@', '@']


(tensor([ 4, 27,  2, 21, 16, 26, 26, 26, 26, 26]), tensor(1))

In [13]:
class NamesDataset(Dataset):
    def __init__(
        self,
        names: List[NameInstance],
        vocab: Dict[str, int],
        label_map: Dict[str, int],
        max_len: int
    ):
        self.names = names
        self.vocab = vocab
        self.label_map = label_map
        self.max_len = max_len

    def __len__(self) -> int:
        return len(self.names)

    def __getitem__(self, idx: int) -> Tuple[List[int], int]:
         return dense_map(self.names[idx], self.vocab, self.label_map, self.max_len)

## CNN (or ConvNet)

### Introduction

By its name, CNN is a neural network mainly operating image convolutions (as well as other important operations like pooling, activation functions, etc.).

The convolution can be expressed as
$$Y = W * X$$
where $W$ is called _kernel_ (or _filter_) and $X$ is the image to be transformed.

To better understand the convolution, it can be seen as a series of dot-product operations. The kernel is a sliding magnifier that specifically extract features from where it covers. Therefore, within its entire motions (i.e. convolution), the kernel can be configured in two ways:
* What to see => padding & dilation 
* How to move => stride

### Example

```
1.     Embedding          Kernel

    We     : -1, 0, 1     -1, 1
    are    :  0, 1, 2      1, 0
    here   :  1, 2, 3      0, 1
    .      :  1, 1, 1


```

```
2.  Embedding Transpose      Kernel

    We  are   here   .
    -1   0     1     1       -1, 1
     0   1     2     1        1, 0
     1   2     3     1        0, 1
```

```
3. Convolution #1

    Position of window: [We are] here .
    
    We  are     Kernel
    -1   0      -1, 1      1 0
     0   1   x   1, 0   =  0 0  => sum: 3
     1   2       0, 1      0 2
```

```
4. Convolution #2

    Position of window: We [are here] .
    
    are  here      Kernel
     0     1       -1, 1      0 1
     1     2    x   1, 0   =  1 0  => sum: 5
     2     3        0, 1      0 3
```

```
5. Convolution #3

        Position of window: We are [here .]
        
     here   .      Kernel
      1     1      -1, 1     -1 1
      2     1   x   1, 0   =  2 0  => sum: 3
      3     1       0, 1      0 1
```

```
6. Output

    3
    5
    3
```

### Implementation

PyTorch provides two APIs for convolution: `conv1d` and `conv2d`.

#### `conv1d`
It operates cross-correlation (which measures similarity between two signals) for one-dimensional data, for example, time-series, a sequence of word embeddings.

INPUT: `(batch_size, in_embed_dim, in_seq_len)`

OUTPUT: `(batch_size, out_embed_dim, out_seq_len)`

---

Arguments:
* `in_channels`: the size of the input embeddings
* `out_channels`: the size of the output feature vector (in FFN, the size of next hidden layer)
* `kernel_size`: the number of tokens to filter in a single convolution
  * e.g. $2$ stands for bigrams, $3$ stands for trigrams


#### `conv2d`
Similar to `conv1d` but operates on two-dimensional data, for example, images or any other grid-like data.

INPUT: `(batch_size, in_embed_dim, height, width)`

OUTPUT: `(batch_size, out_embed_dim, out_height, out_width)`

In [14]:
sent_embeds = torch.tensor([[
    [-1, 0, 1],
    [0, 1, 2],
    [1, 2, 3],
    [1, 1, 1]
]], dtype=torch.float32)

sent_embeds, sent_embeds.shape

(tensor([[[-1.,  0.,  1.],
          [ 0.,  1.,  2.],
          [ 1.,  2.,  3.],
          [ 1.,  1.,  1.]]]),
 torch.Size([1, 4, 3]))

In [15]:
sent_embeds_transpose = sent_embeds.transpose(1, 2)

sent_embeds_transpose, sent_embeds_transpose.shape 

(tensor([[[-1.,  0.,  1.,  1.],
          [ 0.,  1.,  2.,  1.],
          [ 1.,  2.,  3.,  1.]]]),
 torch.Size([1, 3, 4]))

In [16]:
kernel = torch.tensor([[
    [-1, 1],
    [1, 0],
    [0, 1]
]], dtype=torch.float32)

conv_op = nn.Conv1d(
    in_channels=3,
    out_channels=1,
    kernel_size=2,
    bias=False
)

conv_op.weight = nn.Parameter(kernel)

conv_op.weight, conv_op.weight.shape

(Parameter containing:
 tensor([[[-1.,  1.],
          [ 1.,  0.],
          [ 0.,  1.]]], requires_grad=True),
 torch.Size([1, 3, 2]))

In [17]:
conv_out = conv_op(sent_embeds_transpose)

conv_out, conv_out.shape

(tensor([[[3., 5., 3.]]], grad_fn=<ConvolutionBackward0>),
 torch.Size([1, 1, 3]))

In [18]:
# Activation function
relu = nn.ReLU()
conv_relu = relu(conv_out)
conv_relu, conv_relu.shape

(tensor([[[3., 5., 3.]]], grad_fn=<ReluBackward0>), torch.Size([1, 1, 3]))

### Pooling

Pooling is another magnifier/filter following the convolution layer (and activation function) but behaves differently. It performs certain operation to values in a sliding window (say find maximum value) and discard rest. Therefore, as convolution, pooling would also change the shape of a feature representation.


PyTorch provides below pooling strategies:
* Max pooling
* Mean pooling
* Power avg. pooling
* Adaptive max pooling

#### `nn.MaxPooling1d`
Key arguments:
* `kernel_size`: the size of the sliding window

#### `nn.functional.max_pool1d`
Key arguments:
* `input`
* `kernel_size`

What's the difference?
* `nn.MaxPooling1d` needs to be declared in `__init__`, and you need to calculate the "length" of the input
* `nn.functional.max_pool1d` doesn't need to calculate in advance but retrieve the last dimension

In [19]:
import torch.nn.functional as F
conv_pool = F.max_pool1d(conv_relu, kernel_size=conv_relu.size(-1))

conv_pool, conv_pool.shape

(tensor([[[5.]]], grad_fn=<SqueezeBackward1>), torch.Size([1, 1, 1]))

In [20]:
conv_pool = conv_pool.squeeze(-1)

conv_pool, conv_pool.shape

(tensor([[5.]], grad_fn=<SqueezeBackward1>), torch.Size([1, 1]))

### Train CNN using Names corpus

In [21]:
class CNN(nn.Module):
    def __init__(
        self,
        vocab_size: int,
        embed_dim: int,
        kernel_size: int,
        num_hidden: int,
        num_class: int,
        padding_idx: int):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=padding_idx)
        self.conv = nn.Conv1d(embed_dim, num_hidden, kernel_size)
        self.relu = nn.ReLU()
        self.linear = nn.Linear(num_hidden, num_class)

    def forward(self, data):
        # input data's shape: (`batch_size`, `seq_len`) -> batch of indices
        ### 1. embedding: turn indices into feature vectors
        embeds = self.embedding(data)                # (`batch_size`, `seq_len`, `embed_dim`)
        embeds = torch.transpose(embeds, 1, 2)       # (`batch_size`, `embed_dim`, `seq_len`)

        ### 2. convolution and activation
        conv = self.relu(self.conv(embeds))          # (`batch_size`, `num_hidden`, `seq_len` - `kernel_size` + 1)

        ### 3. max pooling allows you to reduce the last dimension
        conv = F.max_pool1d(conv, conv.size(-1))     # (`batch_size`, `num_hidden`, 1)
        conv = torch.squeeze(conv, -1)               # (`batch_size`, `num_hidden`)

        ### 4. final linear layer to output scores/logits over all possible labels
        logits = self.linear(conv)                   # (`batch_size`, `num_class`)
        return logits

In [22]:
# batch size
BATCH_SIZE = 32

# training
LEARNING_RATE = 0.01
NUM_EPOCHS = 20

# model
EMBED_DIM = 32   # arbitrary embedding dimension
NUM_HIDDEN = 16  # arbitrary number of hidden dimension
KERNEL_SIZE = 3  # kernal size for CNN (3 meaning tri-gram)

# Device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [23]:
cnn = CNN(
    vocab_size=len(V),
    embed_dim=EMBED_DIM,     
    kernel_size=KERNEL_SIZE, 
    num_hidden=NUM_HIDDEN,   
    num_class=len(label_map),
    padding_idx=V[pad_token]
)
cnn.to(device)

CNN(
  (embedding): Embedding(28, 32, padding_idx=26)
  (conv): Conv1d(32, 16, kernel_size=(3,), stride=(1,))
  (relu): ReLU()
  (linear): Linear(in_features=16, out_features=2, bias=True)
)

In [24]:
optimizer = torch.optim.SGD(cnn.parameters(), lr=LEARNING_RATE)
loss_fn = nn.CrossEntropyLoss()

In [25]:
train_ds = NamesDataset(train_names, V, label_map, 6)
test_ds = NamesDataset(test_names, V, label_map, 6)

In [26]:
# training loop
training_dataloader = DataLoader(
    train_ds,
    batch_size=BATCH_SIZE,
    shuffle=True
)

for i in range(NUM_EPOCHS):
    epoch_loss = 0  # total loss in any given epoch

    for x, y in training_dataloader:
        x, y = x.to(device), y.to(device)

        optimizer.zero_grad()      # 1. clear out gradients from the last step

        o = cnn(x)                 # 2. forward pass (this calls `forward` function of `clf`)
        loss = loss_fn(o, y)       # 3. compute loss
        loss.backward()            # 4. backward pass (i.e., computes gradients)

        optimizer.step()           # 5. update weights

        epoch_loss += loss.item()  # (optional) accumulate loss for the entire epoch
    print('Epoch [{}/{}] | Loss: {:.3f}'.format(i+1, NUM_EPOCHS, epoch_loss))

Epoch [1/20] | Loss: 122.064
Epoch [2/20] | Loss: 117.113
Epoch [3/20] | Loss: 114.083
Epoch [4/20] | Loss: 111.724
Epoch [5/20] | Loss: 109.222
Epoch [6/20] | Loss: 107.089
Epoch [7/20] | Loss: 105.120
Epoch [8/20] | Loss: 103.465
Epoch [9/20] | Loss: 101.934
Epoch [10/20] | Loss: 100.661
Epoch [11/20] | Loss: 99.608
Epoch [12/20] | Loss: 98.312
Epoch [13/20] | Loss: 97.543
Epoch [14/20] | Loss: 96.575
Epoch [15/20] | Loss: 95.615
Epoch [16/20] | Loss: 94.856
Epoch [17/20] | Loss: 94.193
Epoch [18/20] | Loss: 93.532
Epoch [19/20] | Loss: 92.506
Epoch [20/20] | Loss: 92.018


In [27]:
# Evaluation / Inference / Prediction Loop
test_dataloader = DataLoader(
    test_ds,
    batch_size=BATCH_SIZE,
    shuffle=False
)

with torch.no_grad(): # only during inference
    num_correct = 0
    for x, y in test_dataloader:
        x, y = x.to(device), y.to(device)

        logits = cnn(x)
        preds = logits.argmax(dim=-1)

        num_correct += (preds == y).sum()
    acc = num_correct.item() / len(test_ds)
    print("TEST Acc: {:.3f}".format(acc))

TEST Acc: 0.727
