# Transformer-Based Dialogue Act Classification (From Scratch)

## Problem Statement

The objective of this project is to perform **sentence-level dialogue act classification** using a **Transformer encoder implemented from scratch**. Given a single sentence as input, the model predicts the corresponding dialogue act class. Pretrained Transformer models are deliberately avoided to demonstrate complete conceptual understanding of the architecture and training process.

---

## Data Preprocessing

The original dataset is dialogue-level, where each dialogue contains multiple utterances with associated act labels. Since Transformers operate on fixed training units, I convert the data into **one sentence → one label** format.

Preprocessing involves:
- Parsing dialogues into individual sentences
- Aligning each sentence with its dialogue act
- Lowercasing text
- Tokenizing into word-level tokens
- Storing only token lists and labels (no embeddings)

The output of preprocessing is stored as pickle files containing tokenized sentences and integer labels.

---

## Word Embeddings

I use **Word2Vec Skip-Gram** to obtain dense static word embeddings. One-hot encoding is avoided due to sparsity and lack of semantic information. Embeddings are trained offline and saved as `KeyedVectors`.

During training and inference:
- Tokens are converted to vectors dynamically inside the dataset
- Sentence embeddings are not precomputed
- This design is memory-efficient and scalable

---

## Dataset Design

The custom dataset class:
- Loads tokenized sentences and labels
- Converts tokens to Word2Vec vectors at runtime
- Returns a variable-length tensor per sentence

If no token is found in the vocabulary, a zero vector is used to avoid empty inputs.

---

## Padding and Attention Masking

Sentences have variable lengths, so I pad them at batch time using `pad_sequence`. An attention mask is created with shape `(B, 1, 1, L)` where:
- Valid tokens have mask value `0`
- Padding tokens have mask value `-1e9`

This mask is added to attention scores so padded positions are ignored by softmax.

---

## Positional Encoding

Since self-attention does not encode word order, I add **sinusoidal positional encoding** to input embeddings. Positional encodings are:
- Precomputed once up to the maximum sequence length
- Stored as a non-trainable buffer
- Added only at the encoder input

This preserves sequence order without additional parameters.

---

## Transformer Encoder Architecture

Each encoder block consists of:
- Multi-head self-attention with scaled dot-product attention
- Position-wise feed-forward network
- Residual connections and layer normalization

Multiple encoder blocks are stacked to progressively contextualize token representations. The attention mask is applied before softmax to suppress padding tokens.

---

## Sentence Representation and Pooling

The encoder produces token-level contextual embeddings. To obtain a sentence-level representation, I apply **masked mean pooling**:
- Padding positions are zeroed out
- Valid token embeddings are summed
- The sum is divided by the number of valid tokens

This ensures padding does not influence the sentence embedding.

---

## Classification Head

The pooled sentence embedding is passed through a linear layer mapping `d_model` to `num_classes`, producing logits for dialogue act prediction.

---

## Label Handling and Class Imbalance

Dialogue act labels are converted to a contiguous zero-indexed range `[0, C-1]`, as required by `CrossEntropyLoss`.

To handle class imbalance, I compute inverse-frequency class weights with smoothing and use them in the loss function. This reduces bias toward dominant classes and improves minority-class learning.

---

## Training and Validation

The model is trained using the Adam optimizer with gradient clipping for stability. Training loss is logged at regular intervals.

After each epoch, I evaluate the model on a validation set and compute:
- Validation loss
- Accuracy
- Macro F1-score

Macro F1 is emphasized because it treats all classes equally. The model checkpoint with the highest validation Macro F1 is saved.

---

## Testing

After training, the best-performing model is evaluated once on the test set. Reported metrics include:
- Accuracy
- Macro F1-score
- Confusion matrix
- Classification report

This provides an unbiased estimate of generalization performance.

---

## Inference

A prediction function is implemented to:
- Tokenize a user-provided sentence
- Convert tokens to Word2Vec embeddings
- Perform a forward pass through the model
- Output the predicted dialogue act and class probabilities

This enables direct usage of the trained model on unseen text.

---

## Summary

This project demonstrates a complete Transformer encoder implemented from first principles, including dynamic embeddings, attention masking, positional encoding, sentence pooling, class imbalance handling, and rigorous evaluation. The design prioritizes conceptual clarity, correctness, and reproducibility.

In [2]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [3]:
import pandas as pd
import numpy as np
import gensim
import torch
import torch.nn as nn
import torch.nn.utils.rnn
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader
from gensim.models import KeyedVectors

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
wv = KeyedVectors.load("/content/drive/MyDrive/Transformer-Model/word2vec-256-dim.kv")

df = pd.read_pickle("/content/drive/MyDrive/Transformer-Model/train_clean_tokens_and_labels_act.pkl")
df_val = pd.read_pickle("/content/drive/MyDrive/Transformer-Model/val_clean_tokens_and_labels_act.pkl")
df_test = pd.read_pickle("/content/drive/MyDrive/Transformer-Model/test_clean_tokens_and_labels_act.pkl")

# Recalculate max_len to consider all datasets (train, val, test) and add a buffer
max_len_train = max(len(s) for s in df["sentence"])
max_len_val = max(len(s) for s in df_val["sentence"])
max_len_test = max(len(s) for s in df_test["sentence"])
max_len = max(max_len_train, max_len_val, max_len_test) + 10 # Added a buffer of 10

df["act"] = df["act"] - 1
df_val["act"] = df_val["act"] - 1
df_test["act"] = df_test["act"] - 1

num_classes = df["act"].nunique()

alpha = 0.75
class_counts = df["act"].value_counts().sort_index().values
weights = torch.tensor((1.0 / class_counts)**alpha, dtype=torch.float32)
weights = weights / weights.sum()

In [6]:
class SentenceDataset(torch.utils.data.Dataset):
    def __init__(self, df, wv):
        self.sentences = df["sentence"].tolist()
        self.labels = df['act'].tolist()
        self.wv = wv
        self.embedding_dim = wv.vector_size

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        tokens = self.sentences[idx]

        vectors = []
        for token in tokens:
            if token in self.wv:
                vectors.append(torch.tensor(self.wv[token], dtype=torch.float32))

        if len(vectors) == 0:
            vectors.append(torch.zeros(self.embedding_dim))

        sentence_tensor = torch.stack(vectors)   # (L, d_model)
        label = int(self.labels[idx])

        return sentence_tensor, label

In [7]:
def collate_fn(batch):
    # batch = [(sentence_tensor, label), ...]

    sentences, labels = zip(*batch)  # unzip

    lengths = torch.tensor([x.shape[0] for x in sentences])  # (B,)

    # Pad sentence tensors
    padded = pad_sequence(sentences, batch_first=True).float()  # (B, L, d_model)

    # Create attention mask
    max_len = padded.shape[1]
    mask = torch.arange(max_len, device=padded.device).expand(len(sentences), max_len)
    mask = mask >= lengths.unsqueeze(1)
    mask = mask.unsqueeze(1).unsqueeze(2)  # (B,1,1,L)
    mask = mask.float() * -1e9


    labels = torch.tensor(labels, dtype=torch.long)

    return padded, mask, labels

In [8]:
class PositionalEncoding(nn.Module):
    def __init__(self, max_len, d_model):
        super().__init__()

        PE = np.zeros((max_len, d_model))

        for pos in range(max_len):
            for i in range(d_model // 2):
                PE[pos, 2*i] = np.sin(pos / (10000 ** (2*i / d_model)))
                PE[pos, 2*i + 1] = np.cos(pos / (10000 ** (2*i / d_model)))

        # convert to tensor
        PE = torch.tensor(PE, dtype=torch.float32)

        # register as buffer (not parameter)
        self.register_buffer("PE", PE)

    def forward(self, x):

        # shape of x is (B, L, d_model)
        L = x.size(1)
        return x + self.PE[:L]

In [9]:
class MultiHeadSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.head_dim = d_model // num_heads
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        self.W_Q = nn.Linear(d_model, d_model, bias=False)
        self.W_K = nn.Linear(d_model, d_model, bias=False)
        self.W_V = nn.Linear(d_model, d_model, bias=False)
        self.W_O = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x, mask=None):
        # x has a shape of (batch, sequence_length, model_dimension)
        B = x.shape[0]
        L = x.shape[1]

        Q = self.W_Q(x) # all are of the shape (B, L, d_model)
        K = self.W_K(x)
        V = self.W_V(x)


        # creating multiple heads from these single heads, ie. Q, K, V by
        # reshaping the shape to (B, L, num_heads, head_dim) where num_heads*head_dims
        # is equal to the d_model.
        Q = Q.reshape(B, L, self.num_heads, self.head_dim)
        K = K.reshape(B, L, self.num_heads, self.head_dim)
        V = V.reshape(B, L, self.num_heads, self.head_dim)

        # rearranging the dimensions to (B, num_heads, L, head_dim)
        Q = Q.permute(0, 2, 1, 3)
        K = K.permute(0, 2, 1, 3)
        V = V.permute(0, 2, 1, 3)

        # calculating the scaled attention scores
        scores = Q @ K.transpose(-2, -1) # dim = (B, num_heads, L, L)
        scores = scores / (self.head_dim**0.5)
        if mask is not None:
            scores = scores + mask
        attention_weights = torch.softmax(scores, dim=-1) #dim = (B, num_heads, L, L)
        out = attention_weights @ V

        # making the output dimensions equal to the input dimensions.
        out = out.permute(0, 2, 1, 3)
        out = out.reshape(B, L, self.d_model)

        # applying the output projection
        out = self.W_O(out)

        return out

In [10]:
class FeedForward(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(d_model, 4*d_model), # Changed from 2*d_model to 4*d_model for standard Transformer FFN
            nn.ReLU(),
            nn.Linear(4*d_model, d_model)  # Changed from 2*d_model to 4*d_model for standard Transformer FFN
        )

    def forward(self, x):
        out = self.network(x)
        return out

In [11]:
class EncoderBlock(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()

        self.attention = MultiHeadSelfAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model)
        self.layer_norm1 = nn.LayerNorm(d_model)
        self.layer_norm2 = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):

        attn_out = self.attention(x, mask)

        x = self.layer_norm1(attn_out + x)

        ffn_out = self.feed_forward(x)

        x = self.layer_norm2(ffn_out + x)

        return x

In [12]:
class Encoder(nn.Module):
    def __init__(self, d_model, num_heads, num_layers, max_len):
        super().__init__()
        self.num_layers = num_layers
        self.positional_encoding = PositionalEncoding(max_len, d_model)
        self.blocks = nn.ModuleList([EncoderBlock(d_model, num_heads) for i in range(num_layers)])

    def forward(self, x, mask=None):
        x = self.positional_encoding(x)

        for block in self.blocks:
            x = block(x, mask)
        return x

In [13]:
class SentenceActModel(nn.Module):
    def __init__(self, d_model, num_heads, num_layers, num_classes, max_len):
        super().__init__()

        # Encoder backbone
        self.encoder = Encoder(d_model, num_heads, num_layers, max_len=max_len)

        # Classification head
        self.classifier = nn.Linear(d_model, num_classes)

    def forward(self, x, mask):
        """
        x:    (B, L, d_model)
        mask: (B, 1, 1, L)
        """

        # Encoder
        encoded = self.encoder(x, mask)  # (B, L, d_model)

        # mask for pooling
        # mask == 0 → valid tokens
        valid_mask = (mask == 0).squeeze(1).squeeze(1)  # (B, L)

        # Zero-out padding embeddings
        valid_mask = valid_mask.unsqueeze(-1)           # (B, L, 1)
        encoded = encoded * valid_mask                   # (B, L, d_model)

        # Sum over tokens
        summed = encoded.sum(dim=1)                      # (B, d_model)

        # Count real tokens
        lengths = valid_mask.sum(dim=1)                  # (B, 1)
        lengths = lengths.clamp(min=1)                   # avoid divide-by-zero

        # Mean pooling
        sentence_embedding = summed / lengths            # (B, d_model)

        # Classification
        logits = self.classifier(sentence_embedding)     # (B, num_classes)

        return logits

In [16]:
dataset = SentenceDataset(df, wv)
loader = DataLoader(dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)

In [17]:
val_dataset = SentenceDataset(df_val, wv)
test_dataset = SentenceDataset(df_test, wv)

val_loader = DataLoader(val_dataset,batch_size=32,shuffle=False,collate_fn=collate_fn)

test_loader = DataLoader(test_dataset,batch_size=32,shuffle=False,collate_fn=collate_fn)

In [18]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

model = SentenceActModel(
    d_model=256,
    num_heads=8,
    num_layers=6,
    num_classes=num_classes,
    max_len=max_len
).to(device)

criterion = nn.CrossEntropyLoss(weight=weights.to(device))
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

Using device: cuda


In [19]:
from sklearn.metrics import accuracy_score, f1_score

def evaluate(model, loader, criterion, device):
    model.eval()

    total_loss = 0.0
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for padded_batch, mask, labels in loader:
            padded_batch = padded_batch.to(device)
            mask = mask.to(device)
            labels = labels.to(device)

            logits = model(padded_batch, mask)
            loss = criterion(logits, labels)
            total_loss += loss.item()

            preds = torch.argmax(logits, dim=1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    avg_loss = total_loss / len(loader)
    acc = accuracy_score(all_labels, all_preds)
    macro_f1 = f1_score(all_labels, all_preds, average="macro")

    return avg_loss, acc, macro_f1

In [20]:
from sklearn.metrics import confusion_matrix, classification_report

num_epochs = 8
# =========================
# TRAINING + VALIDATION
# =========================
for epoch in range(num_epochs):

    model.train()
    total_loss = 0.0

    for step, (padded_batch, mask, labels) in enumerate(loader, start=1):
        padded_batch = padded_batch.to(device)
        mask = mask.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()

        logits = model(padded_batch, mask)
        loss = criterion(logits, labels)

        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()

        total_loss += loss.item()

    avg_train_loss = total_loss / len(loader)

    # ---------- VALIDATION ----------
    val_loss, val_acc, val_f1 = evaluate(
        model, val_loader, criterion, device
    )

    print(
        f"Epoch [{epoch+1}/{num_epochs}] | "
        f"Train Loss: {avg_train_loss:.4f} | "
        f"Val Loss: {val_loss:.4f} | "
        f"Val Acc: {val_acc:.4f} | "
        f"Val Macro F1: {val_f1:.4f}"
    )

# =========================
# FINAL TEST (RUN ONCE)
# =========================
test_loss, test_acc, test_f1 = evaluate(
    model, test_loader, criterion, device
)

print("\nFINAL TEST RESULTS")
print(f"Test Loss     : {test_loss:.4f}")
print(f"Test Accuracy : {test_acc:.4f}")
print(f"Test Macro F1 : {test_f1:.4f}")

# =========================
# CONFUSION MATRIX (TEST)
# =========================
model.eval()
y_true, y_pred = [], []

with torch.no_grad():
    for padded_batch, mask, labels in test_loader:
        padded_batch = padded_batch.to(device)
        mask = mask.to(device)

        logits = model(padded_batch, mask)
        preds = torch.argmax(logits, dim=1)

        y_pred.extend(preds.cpu().numpy())
        y_true.extend(labels.numpy())

print("\nConfusion Matrix:")
print(confusion_matrix(y_true, y_pred))

print("\nClassification Report:")
print(classification_report(y_true, y_pred))

Epoch [1/8] | Train Loss: 0.7631 | Val Loss: 0.7050 | Val Acc: 0.7523 | Val Macro F1: 0.7019
Epoch [2/8] | Train Loss: 0.6590 | Val Loss: 0.7038 | Val Acc: 0.7421 | Val Macro F1: 0.6976
Epoch [3/8] | Train Loss: 0.6118 | Val Loss: 0.6593 | Val Acc: 0.7603 | Val Macro F1: 0.7201
Epoch [4/8] | Train Loss: 0.5648 | Val Loss: 0.6613 | Val Acc: 0.7546 | Val Macro F1: 0.7167
Epoch [5/8] | Train Loss: 0.5081 | Val Loss: 0.6672 | Val Acc: 0.7543 | Val Macro F1: 0.7215
Epoch [6/8] | Train Loss: 0.4420 | Val Loss: 0.7260 | Val Acc: 0.7634 | Val Macro F1: 0.7255
Epoch [7/8] | Train Loss: 0.3765 | Val Loss: 0.8225 | Val Acc: 0.7677 | Val Macro F1: 0.7266
Epoch [8/8] | Train Loss: 0.3219 | Val Loss: 0.8691 | Val Acc: 0.7628 | Val Macro F1: 0.7238

FINAL TEST RESULTS
Test Loss     : 0.7728
Test Accuracy : 0.7846
Test Macro F1 : 0.7214

Confusion Matrix:
[[2750   66  229  489]
 [  27 2086   84   13]
 [ 246  126  804  102]
 [ 192   23   70  433]]

Classification Report:
              precision    reca

In [21]:
import nltk
from nltk.tokenize import wordpunct_tokenize
import torch.nn.functional as F

In [23]:
from nltk.tokenize import wordpunct_tokenize
import torch.nn.functional as F

def predict_act(sentence, model, wv, device):
    model.eval()

    tokens = wordpunct_tokenize(sentence.lower())

    vectors = [
        torch.tensor(wv[t], dtype=torch.float32)
        for t in tokens if t in wv
    ] or [torch.zeros(wv.vector_size)]

    x = torch.stack(vectors).unsqueeze(0).to(device)  # (1, L, d_model)
    mask = torch.zeros(1, 1, 1, x.size(1), device=device)

    with torch.no_grad():
        logits = model(x, mask)
        probs = F.softmax(logits, dim=1)

    return probs.argmax(dim=1).item(), probs.squeeze(0).cpu().numpy()

In [26]:
id2act = {
    0: "statement",
    1: "question",
    2: "request",
    3: "agreement"
}

In [27]:
sentence = "Can you please help me with this problem?"

pred_act, probs = predict_act(sentence, model, wv, device)

print("Predicted act id:", pred_act)
print("Predicted act:", id2act[pred_act])
print("Class probabilities:", probs)

Predicted act id: 2
Predicted act: request
Class probabilities: [2.7967757e-04 5.7560736e-03 9.9342334e-01 5.4084347e-04]


In [28]:
tests = [
    "Can you explain this?",
    "Thank you so much!",
    "I think this is correct.",
    "Please send the file."
]

for s in tests:
    act, _ = predict_act(s, model, wv, device)
    print(f"{s} -> {id2act[act]}")

Can you explain this? -> request
Thank you so much! -> agreement
I think this is correct. -> statement
Please send the file. -> request
