### Seq2Seq model for English to Chinese translation (baseline model)

In [3]:
import json
import numpy as np
import pickle
from collections import Counter
import string
import re
import torch
import torch.nn as nn

### Choosing, loading, and cleaning dataset

- going to use datasets from: https://www.kaggle.com/datasets/qianhuan/translation?resource=download

In [2]:
train_set_path = "dataset/translation2019zh_train.json"

train_set = []
with open(train_set_path) as f:
    for line in f:
        train_set.append(json.loads(line))

print(len(train_set))
print(train_set[0])

5161434
{'english': 'For greater sharpness, but with a slight increase in graininess, you can use a 1:1 dilution of this developer.', 'chinese': '为了更好的锐度，但是附带的会多一些颗粒度，可以使用这个显影剂的1：1稀释液。'}


We want to lower the size of this dataset, for testing purposes.
- right now 5.1M sentences
- let's make it 10,000 sentences...

In [52]:
# get 10,000 random indices 
sampled_indices = np.random.choice(len(train_set), 200000)

train_subset = [train_set[i] for i in sampled_indices]
print(train_subset[0])
with open('dataset/train_set_mini.pkl', 'wb') as f:
    pickle.dump(train_subset, f)

{'english': 'In fact, it is less clear-cut than that.', 'chinese': '但事实上并没有这么清晰了当。'}


In [11]:
with open('dataset/train_set_mini.pkl', 'rb') as f:
    train_set_mini = pickle.load(f)

print(train_set_mini[0])

{'english': 'In fact, it is less clear-cut than that.', 'chinese': '但事实上并没有这么清晰了当。'}


Train set made. 
Now to work on the actual architecture

## Preprocessing steps:
- we want to maintain vocabulary for english and chinese. 
    - this is simple, just use a counter() and then limit it to if it appears within 5

- In terms of encoding, we want to use sequence input, so a sentence becomes a list [3, 100, 8, 9], where each number corresponds to the index of the word in the dictionary
    - we want to do this because LSTMs keep track of the word relationships at the sentence level
    - then use nn.Embedding?
        - nn.Embedding allows us to create a matrix representing the vocabulary. It allows us to train some nuance into the words, where instead of typical BoW where each word is just an index, each word is now a vector, which allows us to train some meaning into the word
        - the embedding is a matrix size (vocab length, dim). Length vocab length because each row in the matrix corresponds to a word in the vocab, ( row of index = index of word in vocab :) )
        

# now to work on the vocabulary

In [10]:
## helper functions 
def remove_punctuation(text):
    '''
    Get rid of all punctuation from string text
    '''
    return text.translate(str.maketrans('', '', string.punctuation))

def get_words_from_sentence(s):
    '''
    Gets words from sentence 
    '''
    return s.split(' ')

def clean_en_pair(pair):
    '''
    Cleans the english from the pair 
    '''
    return get_words_from_sentence(remove_punctuation(pair['english']).lower())



In [12]:
def get_en_vocab(train_set):
    '''
    get_en_dict:
        Gets an english vocab from train_set as a dict 
    '''
    # get only the english sentences, list of strings 
    en_sentences = [clean_en_pair(pair) for pair in train_set]
    en_sentences_flattened = [word for sentence in en_sentences for word in sentence]
    print(f"Words pre-clean {len(en_sentences_flattened)}")
    en_sentences_flattened = [word for word in en_sentences_flattened if word != '']
    print(f"Words post-clean {len(en_sentences_flattened)}")
    
    word_counts = Counter(en_sentences_flattened)
    # with word counts, now we limit the vocabulary to words that happen at least 5 times
    en_vocab = {}
    # {word: index}
    idx = 0
    for word in ["<SOS>", "<EOS>", "<UNK>"]:
        en_vocab[word] = idx 
        idx += 1
    for word, occurrences in word_counts.items():
        if occurrences >= 5:
            en_vocab[word] = idx 
            idx += 1
    return en_vocab

def remove_zh_punctuation(text):
    cleaned = re.sub(r'[，。！？【】（）《》“”‘’、]', '', text)
    cleaned = re.sub(r'\s+', '', cleaned)
    return cleaned

def get_zh_vocab(train_set):
    '''
    get_zh_vocab:
        Gets an zh vocab from train_set as a dict 
    '''
    zh_sentences = [list(remove_zh_punctuation(pair['chinese'])) for pair in train_set]
    zh_sentences_flattened = [word for sentence in zh_sentences for word in sentence]
    print(len(zh_sentences_flattened))

    word_counts = Counter(zh_sentences_flattened)
    zh_vocab = {}

    idx = 0 
    for word in ["<SOS>", "<EOS>", "<UNK>"]:
        zh_vocab[word] = idx 
        idx += 1 
    for word, occurrences in word_counts.items():
        if occurrences >= 2: 
            zh_vocab[word] = idx 
            idx += 1 
    return zh_vocab

en_vocab = get_en_vocab(train_set_mini)
print(len(en_vocab))

zh_vocab = get_zh_vocab(train_set_mini)
print(len(zh_vocab))

Words pre-clean 3828342
Words post-clean 3808706
31568
6559724
5613


In [55]:
with open('vocab/en_vocab.pkl', 'wb') as f:
    pickle.dump(en_vocab, f)

with open('vocab/zh_vocab.pkl', 'wb') as f:
    pickle.dump(zh_vocab, f)

In [56]:
with open('vocab/en_vocab.pkl', 'rb') as f:
    en_vocab = pickle.load(f)

with open('vocab/zh_vocab.pkl', 'rb') as f:
    zh_vocab = pickle.load(f)

### Model architecture building
- 2 LSTM's are the backbone
- also build a higher level Seq2Seq model as abstraction of the entire model 
- nn.Embedding() as a variable for both Encoder and Decoder 
    - use vocab_size as row length, by the embedding dim as the column length
- Encoder will be english, decoder will be chinese 

### nn.LSTM

- sequence models are central to NLPl they are models where there is some sort of dependence through teime between inputs. 
- a recurrent neural network is a network that maintains some kind of state.
- for example its output could be used as part of the next input, so that information c an propagate along as the network passes over the sequence.
- In the case of an LSTM, for each element in the sequence, there is a corresponding hidden state ht, which in principle contains infromation from arbitrary points earlier in the sequence. 
- we can use the hidden state to predict words in a language model, pos, and a myriad of oether things.

LSTMs in pytorch:
- pytorch LSTM expects all of its inputs to be 3D tensors. The semantics of the axes of these tensors is important. The first axis is the sequence itself, the second indixes instances in the mini batch, and the third indexes elements of the input. 
- ignore mini batching, we will always just have 1 dimension on th second axis.
- If we want to run the sequence model over the sentence "The cow jumped" our input should look like:

[
    q (the)
    q (cow)
    q (jumped)
]
Except remember there is an additional 2nd dimension with size 1, (this dimension )

Initializing an LSTM:
```python
lstm = nn.LSTM(3, 3) #input dim is 3, output dim is 3 
```

input_size = 3: this means each input vector at a time step is of length 3. All inputs must have 3 columns, (n x 3). 
- each sequence = a list of input vectors (one per timestemp)
- each input vector = size input_size 

input_size = 3, then your input tensor shape for 1 batch would be:
    (seq_len, batch_size, 3)

- What does this mean for the embedding layer?
    - it must also be dimension (vocab_size, 3), since each token is mapped to a vecotr using the embedding. This embedding becomes the input at each timestep for the LSTM, and the LSTM accepts vector of dimensions (3).

- using nn.LSTM example:
    
```python
self.lstm = nn.LSTM(embedding_dim, hidden_dim)  #(embedding_dim) is the dimension of the embedding dim, and then hidden_dim is also essentially a hyperparameter, it's the dimension of the hidden state 
```

**For most LSTM applications we will need a linear layer to learn the mapping from hidden state to the tag space. but in the case of the encoder, you don't need it! Since the linear layer is essentially the classifier layer, that learns to 
interpret the hidden layer.**


### forward:
- in forward we will have
```
lstm_out, _ = self.LSTM(embeds.view(len(sentence), 1, -1))
```
- what exactly is going on here? Well, basically embeds is a tensor of dimensions (number of tokens in sentence, embedding dim) (**since rememember that the embedding layer takes each word index, and pulls up the corresponding row from the vocabulary)
- we want to reshape it to (sequence length, batch_size, input_size)
    - reshaping is: sequence length( the length of the sentence )
    - batch size is 1, since its one sentence at a time 
    - input_size = -1: it assumes that the length will be embedding_dim 
- **embeds.view**: is a way to reshape in LSTM

- forward in the encoder should only return the hidden state and cell state, since its what matters 


## backpropagation
- pytorch under the hood tracks operations on tensors with require_grad = True. All nn.modules like nn.Linear and nn.LSTM already register their parameters with requires_grad=True, so as long as its connected correctly in the forward pass, pytorch will handle the gradients during backprop. 

In [7]:
## Encoder: English layer 

class Encoder(nn.Module):
    def __init__(self, embedding_dim, vocab_size, hidden_dim):
        super(Encoder, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.LSTM = nn.LSTM(embedding_dim, hidden_dim) # initialize an LSTM, with embedding_dim, and hidden_dim hyperparameters 
    
    def forward(self, sentence):
        embeds = self.embeddings(sentence)  # remember that sentence has to the in [word_index0, word_index1, word_index2] form
        _, (h_n, c_n) = self.LSTM(embeds.view(len(sentence), 1, -1)) # one timestep at a tiem 
        return h_n, c_n
        

In [8]:
## Test example pass through the encoder 
encoder = Encoder(embedding_dim=3, vocab_size= len(en_vocab), hidden_dim=5)
# now remember that for forward we pass a sentence as the list of words mapped to the indices they show up in the vocab, such as [45, 18, 28]
sentence = "I love bread."

input_words = get_words_from_sentence(remove_punctuation(sentence).lower())
# now map the inputs to the vocab 
input_indices = [en_vocab[word] for word in input_words] 
# now that I think about it, we probably want a function that does this, so that we don't get hit with a KeyError and actually use our <unk> token lul
input_indices

NameError: name 'get_words_from_sentence' is not defined

In [59]:
# with the input_indices, we can now throw it through the encoder?
# oh wait yopu need tensor first lul
input_indices_tensor = torch.tensor(input_indices, dtype=torch.long)
output = encoder.forward(input_indices_tensor)
output # this makes sense, we have both hidden and cell states :)

(tensor([[[ 0.0234,  0.0432, -0.0121,  0.1221, -0.2183]]],
        grad_fn=<StackBackward0>),
 tensor([[[ 0.1151,  0.1039, -0.0202,  0.2092, -0.3026]]],
        grad_fn=<StackBackward0>))

### Decoder
- so the decoder is another LSTM, taking as input "a large fixed-dimensional vector representation", and then use another LSTM to extract the output sequence from that vector 
- we can just pass in h_n and c_n in the decoder LSTM as parameter for this! 
- and then for forward we just run the linear layer and then run the log_softmax to get the logits?

- What about the Embedding layer?
    - we also need an embedding layer ( used in both training and inference )
    - training phase we have "ground-truth" tokens, we need the "ground-truth" tokens we need the embedding layer to make them tensors and to feed each token through the decoder

### Teacher forcing and backpropagation
- At a time step t, the input is the actual target sequence from t - 1 !. This makes sense, we give it the "correct" input from the time before, and have it try to predict the input now.  the t-1 token is called the "ground truth" token, is passed through the embedding layer specifically trained for the target language vocabulary
- the output is the predicted timestep t token, and what you use to compare as loss is the actual t token. 

**forward step in the decoder**
- when making a prediction, you either use the correct previous token (teacher forcing during training), or you use the previous prediction (inference)
- during prediction we will use nn.LSTM. You want to pass the token at t-1's tensor, AND the previous c_n and h_n from the decoder! it's the recurrent aspect of the RNN

- we also do feed in the encoder h_n, and c_n, in the first step :).
- don't need to worry about the rest, because it already does it implicitly by nn.LSTM

- also to enforce teacher forcing, we have to do one time step at a time, instead of all at once 
    - so instead, we go through every single input one at time 

- in the no teacher YOU USE THE PREVIOUS PREDICTION OF THE TIME(after it is )

In [19]:
## Configurations 
MAX_RESPONSE_LENGTH = 10

In [6]:
class Decoder(nn.Module):
    def __init__(self, embedding_dim, vocab_size, hidden_dim, device):
        super(Decoder, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.LSTM = nn.LSTM(embedding_dim, hidden_dim)
        self.linear = nn.Linear(hidden_dim, vocab_size)
        self.device = device

    def word_to_tensor(self, word):
        '''
        takes a single wrod and gets the corresponding tensor
        '''
        word_lst = get_words_from_sentence(remove_zh_punctuation(word))
        indices = [zh_vocab[word] for word in word_lst]
        # get tensor 
        return torch.tensor(indices, dtype=torch.long).to(self.device)
    
    def forward(self, hidden, sentence=None):
        '''
        does the forward propagation. If sentence is provided, then we do teacher-forcing. Else we assume it is inference  
            Params:
                hidden: the hidden state passed from the previous 
                sentence: a sentence to be used for teacher-forcing, as a tensor 
                Make sure the teacher-forcing sentence is sliced to not include the last token [:-1]
        '''
        if sentence is not None:
            # use teacher forcing 
            embeds = self.embeddings(sentence)
            embeds_reshaped = embeds.view(len(embeds), 1, -1)
            out, hidden = self.LSTM(embeds_reshaped, hidden)
            logits = self.linear(out)
            return logits
        else:
            # inference can stay the same (we want one at a time to check if the out put is eos)
            # '<SOS>' token 
            # just generate some tokens, starting from the <sos> token
            start_token = self.word_to_tensor('<SOS>')
            # run through embedding layer
            prev_char = start_token
            outputs = []
            for i in range(MAX_RESPONSE_LENGTH):
                embeds = self.embeddings(prev_char).to(self.device)
                out, hidden = self.LSTM(embeds.view(1, 1, -1), hidden)
                logits = self.linear(out)
                outputs.append(logits)
                pred_idx = torch.argmax(logits, dim=2).item()
                prev_char = torch.tensor(pred_idx, dtype=torch.long, device=self.device)
            return torch.cat(outputs, dim=0)

In [15]:
## functions to take a sentence and turn it into a tensor, adding <sos> and <eos>
def sequence_to_tensor_en(sequence):
    '''
    takes sequence and converts to tensor 
    '''
    # add "<SOS> and <EOS>"
    words = get_words_from_sentence("<SOS> " + remove_punctuation(sequence).lower() + " <EOS>")
    
    # convert to indices, reverting to <UNK> token
    word_indices = [ en_vocab[word] if word in en_vocab else en_vocab["<UNK>"] for word in words ]
    return torch.tensor(word_indices, dtype=torch.long)
    

def sequence_to_tensor_zh(sequence):
    '''
    takes sequence and converts to chinese tensor 
    '''
    words = (["<SOS>"] + list(remove_zh_punctuation(sequence)))
    words.append("<EOS>")
    
    word_indices = [ zh_vocab[word] if word in zh_vocab else zh_vocab["<UNK>"] for word in words ]
    return torch.tensor(word_indices, dtype=torch.long)

In [63]:
# a full run through both the Encoder and the decoder 

encoder = Encoder(embedding_dim=3, vocab_size=len(en_vocab), hidden_dim=5)
en_sentence = "I love bread."
zh_sentence = "我爱面包"


h_n, c_n = encoder.forward(sequence_to_tensor_en(en_sentence))

In [64]:
## new 100,000 len
print("Length of train ", len(train_set_mini))
print("Length of zh dictionary ", len(zh_vocab))
print("Length of english dictionary ", len(en_vocab))

Length of train  200000
Length of zh dictionary  5613
Length of english dictionary  31568


In [17]:
def zh_tensor_outputs_to_sentence(output_tensor):
    s = ''
    zh_vocab_lst = list(zh_vocab.keys())
    for word_tensor in output_tensor:
        pred_idx = torch.argmax(word_tensor, dim=-1).item()
        s += zh_vocab_lst[pred_idx]
    return s 

## Training loop

Does mps save time?
琥ä纷随哗良钮眼黎s
Number of trains 1000
Loss 6.893100890159607
time for 1000 : mps: 13.119836807250977

CPU:
Number of trains 1000
Loss 6.896958403587341
time for 1000 : cpu: 23.142292022705078

In [66]:
import time

def train(num_epochs, training_data, encoder, decoder, device, lr=0.001):
    optimizer = torch.optim.Adam(
    list(encoder.parameters()) + list(decoder.parameters()), lr=lr
)
    count = 0
    total_loss = 0
    # see random prediction
    predict_en_sequence = "I love bread" 
    predict_en_tensor = sequence_to_tensor_en(predict_en_sequence).to(device)
    predict_hidden = encoder(predict_en_tensor)
    print(zh_tensor_outputs_to_sentence(decoder.forward(predict_hidden)))
    start_time = time.time()
    for i in range(num_epochs):
        for pair in training_data:
            count += 1
            if count % 1000 == 0:
                print(f"Number of trains {count}")
                # print the loss 
                print(f"Loss {total_loss / 1000}")
                total_loss = 0
            english = pair['english']
            zh = pair['chinese']
            en_tensor = sequence_to_tensor_en(english)
            zh_tensor = sequence_to_tensor_zh(zh)
            # pass to device 
            en_tensor = sequence_to_tensor_en(english).to(device)
            zh_tensor = sequence_to_tensor_zh(zh).to(device)

            h_n, c_n = encoder.forward(en_tensor)
            target = zh_tensor[1:]
            predicted = decoder.forward((h_n, c_n), zh_tensor[:-1])
            
            loss = nn.functional.cross_entropy(torch.squeeze(predicted), target)
            total_loss += loss.item()
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        predict_hidden = encoder(predict_en_tensor)
        print(zh_tensor_outputs_to_sentence(decoder.forward(predict_hidden)))
    print(f"Total training time: {time.time() - start_time}")


device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
encoder = Encoder(embedding_dim=32, vocab_size=len(en_vocab), hidden_dim=128)
decoder = Decoder(embedding_dim=32, vocab_size=len(zh_vocab), hidden_dim=128, device=device)
encoder.to(device)
decoder.to(device)

train(3, train_set_mini, encoder, decoder, device)

攀祂夺挲蒙痿梧舛阡栎
Number of trains 1000
Loss 6.890591035366058
Number of trains 2000
Loss 6.536398762226105
Number of trains 3000
Loss 6.311073744058609
Number of trains 4000
Loss 6.114545065164566
Number of trains 5000
Loss 6.013179780244827
Number of trains 6000
Loss 5.939472044944763
Number of trains 7000
Loss 5.800854883670807
Number of trains 8000
Loss 5.774311278581619
Number of trains 9000
Loss 5.654200109958649
Number of trains 10000
Loss 5.682991857051849
Number of trains 11000
Loss 5.632449702739716
Number of trains 12000
Loss 5.590144649028778
Number of trains 13000
Loss 5.54178799700737
Number of trains 14000
Loss 5.516321888923645
Number of trains 15000
Loss 5.480177397966385
Number of trains 16000
Loss 5.458978399753571
Number of trains 17000
Loss 5.36113609623909
Number of trains 18000
Loss 5.45415352511406
Number of trains 19000
Loss 5.412057629108429
Number of trains 20000
Loss 5.3729295177459715
Number of trains 21000
Loss 5.381785968780518
Number of trains 22000
Loss 5.2977

In [67]:
# save encoder and decoder 
torch.save(encoder.state_dict(), './trained_models/baseline_encoder.pth')
torch.save(decoder.state_dict(), './trained_models/baseline_decoder.pth')

In [13]:
# Load previous model 
with open('vocab/en_vocab.pkl', 'rb') as f:
    en_vocab = pickle.load(f)

with open('vocab/zh_vocab.pkl', 'rb') as f:
    zh_vocab = pickle.load(f)
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
encoder = Encoder(embedding_dim=32, vocab_size=len(en_vocab), hidden_dim=128)
decoder = Decoder(embedding_dim=32, vocab_size=len(zh_vocab), hidden_dim=128, device=device)
encoder.to(device)
decoder.to(device)

encoder.load_state_dict(torch.load('./trained_models/baseline_encoder.pth'))
decoder.load_state_dict(torch.load('./trained_models/baseline_decoder.pth'))


<All keys matched successfully>

In [20]:
def translate_en_to_zh(encoder, decoder, sequence):
        hidden = encoder.forward(sequence_to_tensor_en(sequence).to(device="mps"))
        print(zh_tensor_outputs_to_sentence(decoder.forward(hidden)))

translate_en_to_zh(encoder, decoder, "What is your name?")

你的话你是什么<EOS>吗<EOS>


### ATTENTION MECHANISM
- we will now work on adding the attention mechanism
- we will try to add Luong attention 
- current reading source:   https://arxiv.org/pdf/1508.04025

### Luong attention in the model
- modfies the decoder
- at each decoding step, instead of relying only on the previous hidden state, the decoder now looks back at all encoder outputs and chooses which parts to pay attention to. 

Key ideas:
    - Encoder still processes the whole input and outputs as sequence of hidden states:

    - Decoder at time step t has its own hidden state st.
    - Attention score is computed between st (current decoder hidden state) and each hi (encoder hidden state)
    - scoring function (how you actually calculate the attention score )
        - dot product score(st, hi) = st * Wa * hi 
    
    - for all encoder outputs, softmax the scores to get a probability distribution over input positions  (adds up to 1)

    - context vector:
        - weighted sum of encoder hidden states using the attention weights 
        - a "summary" of the input that the decoder should focus on at step t 
    
    final step: combine the context vector, and the decoder hidden state to predict the next word. 

- decoder hidden --> compute scores with all encoder outputs --> softmax --> get context --> combine with decoder hidden --> predict word 


### score
- we calculate a score vector
- this score vector is softmaxed, this turns it into a probability vector (adds up to 1)
- whatever we multiply with this softmaxed score vector has the effect of being amplified or ignored. If the corresponding attention cell is close to 1, it is amplified, but if it is small, the value goes to 0

- we are going to use Luong Attention general version 

$$
score = h_t^{T} W_a h_s
$$

### preimplementation thoughts

1. Encoder stays the same. we just keep track of out, as it represents the entire hidden state across all time steps.

out, hidden = Encoder(sequence)

2. We change the teacher forcing code to be one time step at a time (not vectorized, going to map it exactly as the inference logic. 

Whatever changes I mention down here will apply to both the teacher forcing and inference blocks:

3. run:
out, hidden = self.LSTM(embeds.view(1, 1, -1), hidden)

to give us the hidden state for this current time step.

3. With encoder out, representing all hidden states, and hidden representing the decoder hidden state for this current time step, we have all we need to create the SCORES tensor.
	a. We do this by passing in OUT to a NEW linear layer (I still need to figure out the dimensions, but I'll just do some math during the implementation time). This is dot with decoder hidden . I know it should be out, because we want to use the ALL hidden states, that's the point of attention. 

    b. Now we will have a scores tensor. I softmax this so as to create a vector that can amplify / minimize whatever its multiplied by. I multiply this softmax vector with out (all encoder hidden states). This gives a weighted out tensor, I sum it up to give us a single context tensor.

4. with the context tensor, we concatenate with CURRENT decoder hidden state torch.cat(context, hidden)

5. We pass this combined tensor to the already existing linear layer. By already existing linear layer, i mean the second line of this code where we get the logits

                out, hidden = self.LSTM(embeds.view(1, 1, -1), hidden)
                logits = self.linear(out) # this will be self.linear(combined_tensor)

6. everything else stays the same

### Training runs
1. 04/25/25: first successful training run, 2 epochs @ 100k examples each. Model outputs non-garbage. Recognizes pronouns, and uses EOS tag. 
2. 04/25/25: Added vectorization to teacher forcing forward code. Use MPS, 3 epochs @200k examples each. Loss drops to around 4.6. Model still learns pronouns, but also common words like "need" --> 想要