<a href="https://colab.research.google.com/github/csch7/CSCI-4170/blob/main/Homework-05/NLP_and_Attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 3: Natural Language Processing and Attention

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch import nn

Below is my implementation of scaled dot-product attention, assuming queries, keys, and values have an extra dimension with "batch size" entries. Because of this, the transpose operations are replaced with permutes. (It's also worth noting that this could easily be translated from PyTorch functions to numpy functions).

In [2]:
def softmax(scores):
  return torch.exp(scores) / torch.sum(torch.exp(scores), dim = 0)

def scaled_dot_product_attention(queries, keys, values):
  queries = torch.unsqueeze(queries, 2).repeat((1,1,keys.shape[-1]))
  scores = (queries @ keys.permute(1,2,0)) / np.sqrt(keys.shape[-1])
  s = softmax(scores)
  return torch.squeeze(s @ values.permute(1,0,2), 1)



Below, I have (attempted to) integrate scaled dot-product attention into an encoder-decoder seq2seq model, using Bahdanau's method.

In [3]:
class Scaled_Dot_Product_Attention(nn.Module):
  def __init__(self):
    super().__init__()
    self.sm = nn.Softmax(dim=0)

  def forward(self, queries, keys, values):
    batch_size = keys.shape[0]
    queries = queries.repeat((batch_size,1,1))
    scores = (queries @ keys.permute(0,2,1)) / np.sqrt(keys.shape[-1])
    s = softmax(scores)
    return (s @ values)

class Encoder(nn.Module):
  def __init__(self, input_dim, embed_dim, hidden_dim):
    super().__init__()
    self.embed = nn.Embedding(input_dim, embed_dim)
    self.lstm = nn.LSTM(embed_dim, hidden_dim)
    self.fc = nn.Linear(2*hidden_dim, hidden_dim)
    self.tanh = nn.Tanh()

  def forward(self, x):
    em = self.embed(x)
    lstm_out, (hidden,_) = self.lstm(em)
    return lstm_out, hidden

class Decoder(nn.Module):
  def __init__(self, embed_dim, hidden_dim, output_dim, seq_len):
    super().__init__()
    self.seq_len = seq_len
    self.embed = nn.Embedding(output_dim, embed_dim)
    self.gru = nn.GRU(hidden_dim + embed_dim, hidden_dim)
    self.attn = Scaled_Dot_Product_Attention()
    self.fc = nn.Linear(hidden_dim, output_dim)
    self.sm = nn.Softmax(dim=2)

  def forward(self, hidden, encoder_out, targets = None):
    outputs = []
    batch_size = encoder_out.shape[0]
    em = self.embed(torch.full((batch_size, 1), 4667, dtype=torch.long))

    for i in range(self.seq_len):
      attn = self.attn(hidden, encoder_out, encoder_out)
      gru_out, hidden = self.gru(torch.cat((em, attn), dim=2), hidden)
      gru_out = self.fc(gru_out)
      if targets is not None:
        new_targets = targets[:,i].unsqueeze(-1)
      else:
        new_targets = torch.argmax(gru_out.detach(), dim=2)
      em = self.embed(new_targets)
      outputs.append(gru_out)
    outputs = torch.cat(outputs, dim=1)

    return self.sm(outputs), hidden


In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.4.1-py3-none-any.whl (487 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m487.4/487.4 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py311-none-any.whl (143 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xx

For this task, I will use a small subset of the Multi30k german to english dataset. Below, I load this dataset, and then clean it by removing any special characters, padding the sequences, adding start / end tokens, and converting sentences into ids.

In [5]:
from datasets import load_dataset

ds = load_dataset('bentrevett/multi30k')

train_dat = ds['train'][:len(ds['train'])//20]
valid_dat = ds['validation'][:len(ds['validation'])//20]
test_dat = ds['test'][:len(ds['test'])//20]
train_lab = train_dat['en']
train_dat = train_dat['de']
valid_lab = valid_dat['en']
valid_dat = valid_dat['de']
test_lab = test_dat['en']
test_dat = test_dat['de']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

train.jsonl:   0%|          | 0.00/4.60M [00:00<?, ?B/s]

val.jsonl:   0%|          | 0.00/164k [00:00<?, ?B/s]

test.jsonl:   0%|          | 0.00/156k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/29000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1014 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [6]:
import re

def clean_text(text):
    text = str(text).lower() # Ensure no duplicate word embeddings due to capital letters
    test =  re.sub(r"[^a-z0-9' ]", "", text)         # Remove certain special characters (need to be careful not to remove umlauds or eszetts from German)
    text = re.sub(r"\s+", " ", text).strip()      # Remove extra spaces
    return text

def pad_sentences(dat, max_len):
  for s in range(len(dat)):
    if len(dat[s]) > max_len:
      dat[s] = dat[s][:max_len]
    else:
      dat[s] = dat[s] + ['<PAD>']*(max_len-len(dat[s]))
  return dat

def process_sentences(dat, vocab, max_len):
  dat = [s for s in dat]
  dat = [['<SOS>']+[clean_text(si) for si in s.split()]+['<EOS>'] for s in dat]
  dat = pad_sentences(dat, max_len)
  dat = [[vocab[word] for word in s] for s in dat]
  return dat


max_len = 50

sentences_en = [s for ds in [train_lab, valid_lab, test_lab] for s in ds]
sentences_en = [['<SOS>']+[clean_text(si) for si in s.split()]+['<EOS>'] for s in sentences_en]
vocab = set([w for s in sentences_en for w in s])

sentences_de = [s for ds in [train_dat, valid_dat, test_dat] for s in ds]
sentences_de = [['<SOS>']+[clean_text(si) for si in s.split()]+['<EOS>'] for s in sentences_de]
vocab = set(list(vocab)+[w for s in sentences_de for w in s])
vocab = {word: idx+1 for idx, word in enumerate(vocab)}
vocab['<PAD>'] = 0
print(vocab['<SOS>'])
token_to_value = {vocab[k]: k for k in vocab}

train_dat = process_sentences(train_dat, vocab, max_len)
train_lab = process_sentences(train_lab, vocab, max_len)
valid_dat = process_sentences(valid_dat, vocab, max_len)
valid_lab = process_sentences(valid_lab, vocab, max_len)
test_dat = process_sentences(test_dat, vocab, max_len)
test_lab = process_sentences(test_lab, vocab, max_len)

2552


Below, I have trained my attempt at a seq2seq model. Every other number is the BLEU score on the validation set, which never gets above zero. I'm not sure what is wrong in my implementation, but evidently something is.

In [7]:
import torch.optim as optim
from nltk.translate.bleu_score import sentence_bleu

def one_hot_encode(labels, max_len, vocab_size):
  res = torch.zeros((len(labels), max_len, vocab_size))
  for i in range(len(labels)):
    for j in range(max_len):
      res[i,j,labels[i,j]] = 1
  return res



train_dat = torch.LongTensor(train_dat)
train_lab = torch.LongTensor(train_lab)
train_ohe = one_hot_encode(train_lab, max_len, len(vocab))
valid_dat = torch.LongTensor(valid_dat)
valid_lab = torch.LongTensor(valid_lab)
valid_ohe = one_hot_encode(valid_lab, max_len, len(vocab))
test_dat = torch.LongTensor(test_dat)
test_lab = torch.LongTensor(test_lab)
test_ohe = one_hot_encode(test_lab, max_len, len(vocab))

epochs = 20
batch_size = 32
enc = Encoder(len(vocab), 200, 64)
dec = Decoder(200, 64, len(vocab), max_len)
enc_opt = optim.Adam(enc.parameters(), lr=0.001)
dec_opt = optim.Adam(dec.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()

for e in range(epochs):
  enc.train()
  dec.train()
  losses = []
  for b in range((train_dat.shape[0]//batch_size)):
    enc_opt.zero_grad()
    dec_opt.zero_grad()

    enc_out, hidden = enc(train_dat[b*batch_size:(b+1)*batch_size])
    dec_out, _ = dec(hidden[:,-1,:].unsqueeze(1), enc_out, train_lab[b*batch_size:(b+1)*batch_size])
    loss = loss_fn(dec_out, train_ohe[b*batch_size:(b+1)*batch_size])

    losses.append(loss.item())
    loss.backward()
    enc_opt.step()
    dec_opt.step()


  print("Epoch {} | Loss: {:.7f}".format(e, np.mean(losses)))
  dec.eval()
  enc.eval()
  with torch.no_grad():
    enc_out, hidden = enc(valid_dat)
    pred, _ = dec(hidden[:,-1,:].unsqueeze(1), enc_out)
    print("BLEU Score: {}".format(sentence_bleu([token_to_value[int(w)] for i in range(len(valid_lab)) for w in valid_lab[i]], [token_to_value[int(w)] for i in range(len(valid_lab)) for w in torch.argmax(pred[i], dim=0)])))


Epoch 0 | Loss: 0.0306834
BLEU Score: 0
Epoch 1 | Loss: 0.0298110
BLEU Score: 0
Epoch 2 | Loss: 0.0293769
BLEU Score: 0
Epoch 3 | Loss: 0.0293133
BLEU Score: 0
Epoch 4 | Loss: 0.0292857
BLEU Score: 0
Epoch 5 | Loss: 0.0291689
BLEU Score: 0
Epoch 6 | Loss: 0.0291099
BLEU Score: 0
Epoch 7 | Loss: 0.0290890
BLEU Score: 0
Epoch 8 | Loss: 0.0290789
BLEU Score: 0
Epoch 9 | Loss: 0.0290705
BLEU Score: 0
Epoch 10 | Loss: 0.0290645
BLEU Score: 0
Epoch 11 | Loss: 0.0290599
BLEU Score: 0
Epoch 12 | Loss: 0.0290561
BLEU Score: 0
Epoch 13 | Loss: 0.0290527
BLEU Score: 0
Epoch 14 | Loss: 0.0290501
BLEU Score: 0
Epoch 15 | Loss: 0.0290476
BLEU Score: 0
Epoch 16 | Loss: 0.0290451
BLEU Score: 0
Epoch 17 | Loss: 0.0290433
BLEU Score: 0
Epoch 18 | Loss: 0.0290417
BLEU Score: 0
Epoch 19 | Loss: 0.0290406
BLEU Score: 0


Below, I have (tried) to implement a transformer from scratch.

In [8]:
from math import inf

class PositionalEncoding(nn.Module):
  def __init__(self, seq_len, embedding_dim):
    super().__init__()
    self.seq_len = seq_len
    self.embed_dim = embedding_dim

  def forward(self, x):
    embedding = torch.zeros(self.seq_len, x.shape[1], self.embed_dim)
    positions = torch.arange(self.seq_len)
    for p in positions:
      embedding[p, :, ::2] = torch.sin(p/(10000**(2*torch.arange(self.embed_dim)[:self.embed_dim//2]/self.embed_dim)))
      embedding[p, :, 1::2] = torch.cos(p/(10000**(2*torch.arange(self.embed_dim)[:self.embed_dim//2]/self.embed_dim)))
    return embedding


class ScaledDotProductAttention(nn.Module):
  def __init__(self, masking = False):
    super().__init__()
    self.sm = nn.Softmax(dim=0)
    self.mask = masking

  def forward(self, q, k, v):
    scores = q @ k.permute(0,2,1) / np.sqrt(k.shape[-1])
    if self.mask:
      mask = torch.full(scores.shape, -1*(10**10), dtype = int)
      mask = torch.triu(mask, 1)
      scores = scores + mask
    s = self.sm(scores)
    return s @ v

class MultiHeadAttention(nn.Module):
  def __init__(self, num_heads, qk, qv, dim_model, masking = False):
    super().__init__()
    self.i = 0
    self.nh = num_heads
    self.d_model = dim_model
    self.mask = masking
    self.Wq = nn.Parameter(torch.randn((num_heads, dim_model, qk)))
    self.Wk = nn.Parameter(torch.randn((num_heads, dim_model, qk)))
    self.Wv = nn.Parameter(torch.randn((num_heads, dim_model, qv)))
    self.Wo = nn.Parameter(torch.randn((num_heads*qv, dim_model)))
    self.attn = ScaledDotProductAttention(masking)

  def forward(self, Q, K, V):
    # print(self.i, V)
    self.i += 1
    output = self.attn(Q @ self.Wq[0], K @ self.Wk[0], V @ self.Wv[0])
    for i in range(1, self.nh):
      output = torch.cat((output, self.attn(Q @ self.Wq[i], K @ self.Wk[i], V @ self.Wv[i])), dim=2)
    return output @ self.Wo


class FFN(nn.Module):
  def __init__(self, embedding_dim = 64, hidden_dim = 128):
    super().__init__()
    self.w1 = nn.Parameter(torch.randn((embedding_dim, hidden_dim)))
    self.b1 = nn.Parameter(torch.randn(hidden_dim))
    self.w2 = nn.Parameter(torch.randn((hidden_dim, embedding_dim)))
    self.b2 = nn.Parameter(torch.randn(embedding_dim))
    self.relu = nn.ReLU()

  def forward(self, x):
    return self.relu(x @ self.w1 + self.b1) @ self.w2 + self.b2


class Encoder(nn.Module):
  def __init__(self, input_len, vocab_size, d_model = 64, hidden_dim = 128, num_heads = 8, num_layers = 2):
    super().__init__()
    self.d_model = d_model
    self.hidden_dim = hidden_dim
    self.embed = nn.Embedding(vocab_size, d_model)
    self.position = PositionalEncoding(input_len, d_model)
    self.ffn = FFN(d_model, hidden_dim)
    self.layernorm = nn.LayerNorm((input_len, d_model))
    self.attn = MultiHeadAttention(num_heads, int(d_model / num_heads), int(d_model / num_heads), d_model)
    self.L = num_layers

  def forward(self, inputs):
    em = self.embed(inputs)
    pos_en = self.position(inputs)
    out = (em + pos_en).permute(1,0,2)

    for l in range(self.L):
      self_attn = self.attn(out, out, out)
      attn_norm = self.layernorm(self_attn + out)
      ffn_out = self.ffn(attn_norm)
      out = self.layernorm(ffn_out + attn_norm)
    return out


class Decoder(nn.Module):
  def __init__(self, output_len, vocab_size, d_model = 64, hidden_dim = 128, num_heads = 8, num_layers = 2):
    super().__init__()
    self.d_model = d_model
    self.hidden_dim = hidden_dim
    self.embed = nn.Embedding(vocab_size, d_model)
    self.position = PositionalEncoding(output_len, d_model)
    self.ffn = FFN(d_model, hidden_dim)
    self.layernorm = nn.LayerNorm((output_len, d_model))
    self.attn = MultiHeadAttention(num_heads, int(d_model / num_heads), int(d_model / num_heads), d_model)
    self.masked_attn = MultiHeadAttention(num_heads, int(d_model / num_heads), int(d_model / num_heads), d_model, masking=True)
    self.L = num_layers

  def forward(self, outputs, enc_out):
    em = self.embed(outputs)
    pos_en = self.position(outputs)
    out = (em + pos_en).permute(1,0,2)
    for l in range(self.L):
      self_attn = self.masked_attn(out, out, out)
      attn_norm = self.layernorm(self_attn + out)
      enc_attn = self.attn(enc_out, enc_out, attn_norm)
      attn_norm = self.layernorm(enc_attn + attn_norm)
      ffn_out = self.ffn(attn_norm)
      out = self.layernorm(ffn_out + attn_norm)
    return out


class Transformer(nn.Module):
  def __init__(self, input_len, output_len, in_vocab_size, out_vocab_size, d_model = 64, hidden_dim = 128, num_heads = 8, num_layers = 2):
    super().__init__()
    self.encoder = Encoder(input_len, in_vocab_size)
    self.decoder = Decoder(output_len, out_vocab_size)
    self.fc = nn.Linear(d_model, out_vocab_size)
    self.sm = nn.Softmax(dim=0)

  def forward(self, inputs, outputs):
    enc_out = self.encoder(inputs)
    dec_out = self.decoder(outputs, enc_out)
    return self.sm(self.fc(dec_out))

In [9]:
import torch.optim as optim

def one_hot_encode(labels, max_len, vocab_size):
  res = torch.zeros((len(labels), max_len, vocab_size))
  for i in range(len(labels)):
    for j in range(max_len):
      res[i,j,labels[i,j]] = 1
  return res


train_dat = torch.LongTensor(train_dat)
train_lab = torch.LongTensor(train_lab)
train_ohe = one_hot_encode(train_lab, max_len, len(vocab))
valid_dat = torch.LongTensor(valid_dat)
valid_lab = torch.LongTensor(valid_lab)
valid_ohe = one_hot_encode(valid_lab, max_len, len(vocab))
test_dat = torch.LongTensor(test_dat)
test_lab = torch.LongTensor(test_lab)
test_ohe = one_hot_encode(test_lab, max_len, len(vocab))

epochs = 20
batch_size = 32
model = Transformer(max_len-1, max_len-1, len(vocab),len(vocab))
optimizer = optim.Adam(model.parameters(), lr=0.1)
loss_fn = nn.CrossEntropyLoss()

for e in range(epochs):
  model.train()
  losses = []
  for b in range((train_dat.shape[0]//batch_size)):
    optimizer.zero_grad()

    pred = model(train_dat[b*batch_size:(b+1)*batch_size,1:].T, train_lab[b*batch_size:(b+1)*batch_size,:-1].T)
    # print(pred.shape)
    # print(sentence_bleu([token_to_value_en[int(w)] for i in range(len(train_lab)) for w in train_lab[i]], [token_to_value_en[int(w)] for i in range(len(train_lab)) for w in torch.argmax(pred[i], dim=1)]))
    loss = loss_fn(pred, train_ohe[b*batch_size:(b+1)*batch_size,1:,:])
    losses.append(loss.item())
    loss.backward()
    optimizer.step()
  print("Epoch {} | Loss: {:.7f}".format(e, np.mean(losses)))
  model.eval()
  with torch.no_grad():
    pred = model(valid_dat[:,1:].T, valid_lab[:,:-1].T)
    print(sentence_bleu([token_to_value[int(w)] for i in range(len(valid_lab)) for w in valid_lab[i]], [token_to_value[int(w)] for i in range(len(valid_lab)) for w in torch.argmax(pred[i], dim=1)]))

Epoch 0 | Loss: 0.0299669


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


3.872213578557129e-232
Epoch 1 | Loss: 0.0299714
3.662113500188318e-232
Epoch 2 | Loss: 0.0299796
3.872213578557129e-232
Epoch 3 | Loss: 0.0299825
3.407980617858593e-232
Epoch 4 | Loss: 0.0299836
4.052794798548983e-232
Epoch 5 | Loss: 0.0299812
3.407980617858593e-232
Epoch 6 | Loss: 0.0299957
3.662113500188318e-232
Epoch 7 | Loss: 0.0299929
2.5895052894580338e-232
Epoch 8 | Loss: 0.0299981
3.407980617858593e-232
Epoch 9 | Loss: 0.0299955
3.662113500188318e-232
Epoch 10 | Loss: 0.0300036
2.5895052894580338e-232
Epoch 11 | Loss: 0.0299992
3.662113500188318e-232
Epoch 12 | Loss: 0.0300008
3.662113500188318e-232
Epoch 13 | Loss: 0.0300124
0
Epoch 14 | Loss: 0.0300084
3.662113500188318e-232
Epoch 15 | Loss: 0.0300068
3.407980617858593e-232
Epoch 16 | Loss: 0.0299961
3.872213578557129e-232
Epoch 17 | Loss: 0.0299958
3.662113500188318e-232
Epoch 18 | Loss: 0.0300106
3.407980617858593e-232
Epoch 19 | Loss: 0.0299989
3.407980617858593e-232


From these results, my transformer is obvioulsly not working as intended. I'm not entirely sure what I did wrong, and I debugged for awhile, to no avail. Because of this, I cannot comment on any changes in performance between the two models. However, the transformer does run significantly faster than the Seq2Seq model, as expected -- this is one of the most significant advantages of using transformers vs seq2seq models.