## Benchmarking nn.Linear vs nd.Linear in a Sequence To Sequence Machine Translation Model

### Overview
**this is my application to Ensemble AI's ML research intern & ML engineering intern positions** 

- We will be implementing a Sequence To Sequence model for English to Chinese machine translation. 
- The attention mechanism we will be implementing will be Luong Attention, general form, as it makes use of another Linear layer. 
- We will run 2 experiments benchmarking nn.Linear vs nd.Linear:

1. **Performance Benchmarking**
We will compare final performance between the Seq2Seq models using nn.Linear and nd.Linear, keeping model size the same. 

2. **Parameter size analysis**
We will evaluate whether a Seq2Seq model using nd.Linear will perform similarly to the same model architecture using nn.Linear, but with a lower parameter count; specifically, the hidden dim will be 128 in the nn.Linear implementation, vs 80 in the nd.Linear implementation, representing a ~22% reduction in parameter count.

### Imports

In [1]:
import json
import numpy as np
import pickle
from collections import Counter
import string
import re
import torch
import torch.nn as nn
import time

### Dataset & Preprocessing

Full Dataset: (https://www.kaggle.com/datasets/qianhuan/translation?resource)
- 5.1M English-Chinese sentence pairs
- For the sake of the experiment, we will randomly sample 250,000 pairs 
- also sample 5000 pairs from validation set

In [2]:
train_set_path = "dataset/translation2019zh_train.json"
val_set_path = "dataset/translation2019zh_valid.json"
train_set = []
val_set = [] 

with open(train_set_path) as f:
    for line in f:
        train_set.append(json.loads(line))

with open(val_set_path) as f:
    for line in f:
        val_set.append(json.loads(line))

print(len(train_set))
print(train_set[0])

5161434
{'english': 'For greater sharpness, but with a slight increase in graininess, you can use a 1:1 dilution of this developer.', 'chinese': '为了更好的锐度，但是附带的会多一些颗粒度，可以使用这个显影剂的1：1稀释液。'}


In [16]:
### sample 250,000 sentences and save the data 
sampled_indices = np.random.choice(len(train_set), 250000)

train_subset = [train_set[i] for i in sampled_indices]
print(train_subset[0])
with open('dataset/train_set_mini.pkl', 'wb') as f:
    pickle.dump(train_subset, f)

## sample 5000 pairs for validation set
val_sampled_indices = np.random.choice(len(val_set), 5000)
val_subset = [val_set[i] for i in val_sampled_indices]

with open('dataset/val_set_mini.pkl', 'wb') as f:
    pickle.dump(val_subset, f)


{'english': 'His timing when he volleys is so good.', 'chinese': '他截击空中球的时机掌握得很好。'}


In [4]:
## pull the subset dataset
with open('dataset/train_set_mini.pkl', 'rb') as f:
    train_set_mini = pickle.load(f)

with open('dataset/val_set_mini.pkl', 'rb') as f:
    val_set_mini = pickle.load(f)

print(train_set_mini[0])
print(val_set_mini[0])

{'english': 'The present paper deals with the biology of the parasitic copepod Lernaea poly-morpha and the acquired immunity on the part of the hosts after its infeetion on silver carp and big-head.', 'chinese': '本文对鲢、鳙锚头鳋的生物学、病后获得免疫以及药物治疗进行了探讨。'}
{'english': 'Our company is a Japanese THK linear guide set up in Qingdao Co. , Ltd. the only product-related service center.', 'chinese': '我公司是日本THK直线导轨株式会社在青岛地区设立唯一一家产品相关服务中心。'}


### Creating vocabularies:
- Maintain a vocabulary for english and chinese. 
- Limit it to words that appear at least 5 times. 
- Sequences will be represented as a list of indices, in the order in which they appear in the sentence, e.g. [0, 98, 4532, 12, 1].
- These list of sequences will be passed to their appropriate embedding layer. 

In [5]:
def remove_punctuation(text):
    '''
    Get rid of all punctuation from string text
    '''
    return text.translate(str.maketrans('', '', string.punctuation))

def get_words_from_sentence(s):
    '''
    Gets words from sentence 
    '''
    return s.split(' ')

def clean_en_pair(pair):
    '''
    Cleans the english from the pair 
    '''
    return get_words_from_sentence(remove_punctuation(pair['english']).lower())

def remove_zh_punctuation(text):
    cleaned = re.sub(r'[，。！？【】（）《》“”‘’、]', '', text)
    cleaned = re.sub(r'\s+', '', cleaned)
    return cleaned

In [6]:
def get_en_vocab(train_set):
    '''
    get_en_dict:
        Gets an english vocab from train_set as a dict 
    '''
    # get only the english sentences, list of strings 
    en_sentences = [clean_en_pair(pair) for pair in train_set]
    en_sentences_flattened = [word for sentence in en_sentences for word in sentence]
    en_sentences_flattened = [word for word in en_sentences_flattened if word != '']
    
    word_counts = Counter(en_sentences_flattened)
    # with word counts, now we limit the vocabulary to words that happen at least 5 times
    en_vocab = {}
    # {word: index}
    idx = 0
    for word in ["<SOS>", "<EOS>", "<UNK>"]:
        en_vocab[word] = idx 
        idx += 1
    for word, occurrences in word_counts.items():
        if occurrences >= 5:
            en_vocab[word] = idx 
            idx += 1
    return en_vocab

def get_zh_vocab(train_set):
    '''
    get_zh_vocab:
        Gets an zh vocab from train_set as a dict 
    '''
    zh_sentences = [list(remove_zh_punctuation(pair['chinese'])) for pair in train_set]
    zh_sentences_flattened = [word for sentence in zh_sentences for word in sentence]

    word_counts = Counter(zh_sentences_flattened)
    zh_vocab = {}

    idx = 0 
    for word in ["<SOS>", "<EOS>", "<UNK>"]:
        zh_vocab[word] = idx 
        idx += 1 
    for word, occurrences in word_counts.items():
        if occurrences >= 2: 
            zh_vocab[word] = idx 
            idx += 1 
    return zh_vocab

en_vocab = get_en_vocab(train_set_mini)
print(len(en_vocab))

zh_vocab = get_zh_vocab(train_set_mini)
print(len(zh_vocab))

14
165


In [7]:
with open('vocab/en_vocab.pkl', 'wb') as f:
    pickle.dump(en_vocab, f)

with open('vocab/zh_vocab.pkl', 'wb') as f:
    pickle.dump(zh_vocab, f)

In [8]:
with open('vocab/en_vocab.pkl', 'rb') as f:
    en_vocab = pickle.load(f)

with open('vocab/zh_vocab.pkl', 'rb') as f:
    zh_vocab = pickle.load(f)

### Model Architecture
- Based mostly on : (https://arxiv.org/pdf/1409.3215), not completely honest to the paper
- Model will consist of an Encoder and Decoder, passing in source sequence to encoder, and passing the hidden states from the encoder to the decoder. 

**Encoder**:
- consists of an LSTM and an embedding layer. 

**Decoder**:
- consists of an LSTM and an embedding layer, and a **linear layer** to output logits. 
- this linear layer is where we can place nd.Linear in place of nn.Linear
- forward() has two settings, inference and teacher-forcing. If a "correct label" sentence is passed to the forward() function, it will do teacher forcing. 

**Attention Layer**:
- A separate general form Luong Attention layer. It's another **linear layer** so another place where nd.Linear will be dropped in. 

**In total that makes 2 places where nn.Linear will be replaced by nd.Linear**

### Model Classes

In [9]:
class Encoder(nn.Module):
    def __init__(self, embedding_dim, vocab_size, hidden_dim):
        super(Encoder, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.LSTM = nn.LSTM(embedding_dim, hidden_dim) # initialize an LSTM, with embedding_dim, and hidden_dim hyperparameters 
    
    def forward(self, sentence):
        embeds = self.embeddings(sentence)  # remember that sentence has to the in [word_index0, word_index1, word_index2] form
        out , (h_n, c_n) = self.LSTM(embeds.view(len(sentence), 1, -1)) # one timestep at a time 
        return out, (h_n, c_n)

In [10]:
### LUONG ATTENTION LAYER
class GeneralAttention(nn.Module):
    def __init__(self, hidden_dim, linear_cls=nn.Linear):
        super().__init__()
        self.linear_layer = linear_cls(hidden_dim, hidden_dim, bias=False)

    def forward(self, encoder_outputs):
        return self.linear_layer(encoder_outputs)

In [11]:
class LuongAttnDecoder(nn.Module):
    def __init__(self, embedding_dim, vocab_size, hidden_dim, device, max_response_length, linear_cls=nn.Linear):
        super(LuongAttnDecoder, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.LSTM = nn.LSTM(embedding_dim, hidden_dim)
        self.linear = linear_cls(hidden_dim * 2, vocab_size) # 2x hidden dim now for result after concat attention input
        self.device = device
        self.general_attn_layer = GeneralAttention(hidden_dim, linear_cls)
        self.max_response_length = max_response_length
    
    def word_to_tensor(self, word):
        '''
        takes a single wrod and gets the corresponding tensor
        '''
        word_lst = get_words_from_sentence(remove_zh_punctuation(word))
        indices = [zh_vocab[word] for word in word_lst]
        # get tensor 
        return torch.tensor(indices, dtype=torch.long).to(self.device)

    def forward(self, hidden, encoder_out, sentence=None):
        '''
        does the forward propagation. If sentence is provided, then we do teacher-forcing. Else we assume it is inference  
            Params:
                hidden: the hidden state passed from the previous 
                sentence: a sentence to be used for teacher-forcing, as a tensor 
                Make sure the teacher-forcing sentence is sliced to not include the last token [:-1]
        '''
        # teacher-forcing training, we are going to "undo" vectorization]
        all_outputs = []
        if sentence is not None:
            embeds_tensor = self.embeddings(sentence)
            for word_tensor in embeds_tensor:
                out, hidden = self.LSTM(word_tensor.view(1, 1, -1), hidden)
                # pass encoder out to attention layer 
                attn_scores = self.general_attn_layer(encoder_out) @ hidden[0].squeeze()
                # now with attn scores, we want to softmax the scores 
                softmaxed_scores = torch.nn.functional.softmax(attn_scores, dim=0)
                # multiply by encoder_out
                # now that they are softmaxed, we want to multiply by all encoder states to give a weighted tensor, we can broadcast it as well 
                weighted_encoder_hidden_states = softmaxed_scores * encoder_out.squeeze()
                # sum the tensor 
                context = torch.sum(weighted_encoder_hidden_states, dim=0).view(1, 1, -1)
                # concat the context vector with the hidden state 
                combined_tensor = torch.concat([context, hidden[0]], dim=-1)
                logits = self.linear(combined_tensor)
                all_outputs.append(logits)
        else:
            start_token = self.word_to_tensor('<SOS>')
            # run through embedding layer
            prev_char = start_token
            for i in range(self.max_response_length):
                if prev_char.item() == 1:
                    break
                embeds = self.embeddings(prev_char).to(self.device)
                out, hidden = self.LSTM(embeds.view(1, 1, -1), hidden)
                attn_scores = self.general_attn_layer(encoder_out) @ hidden[0].squeeze()
                softmaxed_scores = torch.nn.functional.softmax(attn_scores, dim=0)
                weighted_encoder_hidden_states = softmaxed_scores * encoder_out.squeeze()
                context = torch.sum(weighted_encoder_hidden_states, dim=0).view(1, 1, -1)
                combined_tensor = torch.concat([context, hidden[0]], dim=-1)
                logits = self.linear(combined_tensor)
                all_outputs.append(logits)
                pred_idx = torch.argmax(logits, dim=2).item()
                prev_char = torch.tensor(pred_idx, dtype=torch.long, device=self.device)
        return torch.cat(all_outputs, dim=0)

### Training / Inference utility functions

In [12]:
## functions to take a sentence and turn it into a tensor, adding <sos> and <eos>
def sequence_to_tensor_en(sequence):
    '''
    takes sequence and converts to tensor 
    '''
    # add "<SOS> and <EOS>"
    words = get_words_from_sentence("<SOS> " + remove_punctuation(sequence).lower() + " <EOS>")
    
    # convert to indices, reverting to <UNK> token
    word_indices = [ en_vocab[word] if word in en_vocab else en_vocab["<UNK>"] for word in words ]
    return torch.tensor(word_indices, dtype=torch.long)
    

def sequence_to_tensor_zh(sequence):
    '''
    takes sequence and converts to chinese tensor 
    '''
    words = (["<SOS>"] + list(remove_zh_punctuation(sequence)))
    words.append("<EOS>")
    
    word_indices = [ zh_vocab[word] if word in zh_vocab else zh_vocab["<UNK>"] for word in words ]
    return torch.tensor(word_indices, dtype=torch.long)

def zh_tensor_outputs_to_sentence(output_tensor):
    '''
    converts a zh_tensor to a string
    '''
    s = ''
    zh_vocab_lst = list(zh_vocab.keys())
    for word_tensor in output_tensor:
        pred_idx = torch.argmax(word_tensor, dim=-1).item()
        s += zh_vocab_lst[pred_idx]
    return s 

In [None]:
def train(num_epochs, training_data, encoder, decoder, device, save_loss_file, lr=0.001):
    optimizer = torch.optim.Adam(
    list(encoder.parameters()) + list(decoder.parameters()), lr=lr
)
    count = 0
    total_loss = 0
    # see random prediction
    predict_en_sequence = "I love bread" 
    predict_en_tensor = sequence_to_tensor_en(predict_en_sequence).to(device)
    out, predict_hidden = encoder(predict_en_tensor)
    out = out.to(device)
    print(zh_tensor_outputs_to_sentence(decoder.forward(predict_hidden, encoder_out=out)))
    start_time = time.time()
    for i in range(num_epochs):
        for pair in training_data:
            count += 1
            if count % 10000 == 0:
                print(f"Number of trains {count}")
                # print the loss
                print(f"Loss {total_loss / 10000}")
                # add the loss with count to it
                with open(save_loss_file, 'a') as f:
                     f.write(f'{total_loss / 10000}, {count} \n')
                total_loss = 0
            english = pair['english']
            zh = pair['chinese']
            en_tensor = sequence_to_tensor_en(english)
            zh_tensor = sequence_to_tensor_zh(zh)
            # pass to device 
            en_tensor = sequence_to_tensor_en(english).to(device)
            zh_tensor = sequence_to_tensor_zh(zh).to(device)

            out, hidden = encoder.forward(en_tensor)
            target = zh_tensor[1:]
            predicted = decoder.forward(hidden, encoder_out=out, sentence = zh_tensor[:-1])
            loss = nn.functional.cross_entropy(torch.squeeze(predicted), target)
            total_loss += loss.item()
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        
        out, predict_hidden = encoder(predict_en_tensor)
        out = out.to(device)
        print(zh_tensor_outputs_to_sentence(decoder.forward(predict_hidden, out)))
    print(f"Total training time: {time.time() - start_time}")

### Setting hyperparameters

In [14]:
MAX_RESPONSE_LENGTH=20
EMBEDDING_DIM=32
HIDDEN_DIM=128
device = torch.device("cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")

### EXPERIMENT 1: Performance benchmarking

- Train 2 Sequence to Sequence models. The first one being a baseline model using nn.Linear, and the second being a model using nd.Linear. Both will have the same size, and trained for the same time. 

### Baseline model (nn.Linear)

In [15]:
encoder = Encoder(embedding_dim=32, vocab_size=len(en_vocab), hidden_dim=128)
decoder = LuongAttnDecoder(embedding_dim=32, vocab_size=len(zh_vocab), hidden_dim=128, device=device, max_response_length=MAX_RESPONSE_LENGTH, linear_cls=nn.Linear)
print(device)
encoder.to(device)
decoder.to(device)
save_loss_file = "intermediate_steps/experiment1-baseline.txt"
train(3, train_set_mini, encoder, decoder, device, save_loss_file)
torch.save(encoder.state_dict(), './trained_models/experiment1_baseline_encoder.pth')
torch.save(decoder.state_dict(), './trained_models/experiment2_baseline_decoder.pth')

mps
作所作国国国国资资资个所作所作国国国资资
Number of trains 10
Loss 4.423114252090454
Number of trains 20
Loss 4.24901123046875
<UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK>
Number of trains 30
Loss 4.1141905069351195
Number of trains 40
Loss 3.7753936529159544
Number of trains 50
Loss 3.80055718421936
<UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK>
Number of trains 60
Loss 3.8754350900650025
Number of trains 70
Loss 3.710989832878113
<UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK><UNK>
Total training time: 8.561107873916626
