<a href="https://colab.research.google.com/github/amrahmani/Pythorch/blob/main/Ch9_Transformer_MachineTranslation_EnglishToPortuguese.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Machine Translation**

**Problem:** Using PyTorch, build a Transformer neural network for machine translation from English to Portuguese. First, create a dataset of 20 greeting sentences and define the vocabulary. Then, train and evaluate the Transformer on this dataset.

**Example dataset**

English: Hello

Portuguese: Ola

English: How are you?

Portuguese: Como esta?

English: Thank you

Portuguese: Obrigado

English: Good morning

Portuguese: bom dia

English: Good night

Portuguese: bom noite


**Tokenization**

In [8]:
import torch
import torch.nn as nn

english_vocab = {'<sos>': 0, '<eos>': 1, '<pad>': 2, 'Hello': 3, 'How': 4, 'are': 5, 'you': 6, '?': 7, 'Thank': 8, 'Good': 9, 'morning': 10, 'night': 11}
portuguese_vocab = {'<sos>': 0, '<eos>': 1, '<pad>': 2, 'Ola': 3, 'como': 4, 'esta': 5, '?': 6, 'Obrigado': 7, 'bom': 8, 'dia': 9, 'noite': 10}

**Build Vocabulary**

In [9]:
english_sentences = [
    [3],  # Hello
    [4, 5, 6, 7],  # How are you?
    [8, 6],  # Thank you
    [9, 10],  # Good morning
    [9, 11]  # Good night
]
portuguese_sentences = [
    [3],  # Ola
    [4, 5, 6],  # Como esta?
    [7],  # Obrigado
    [8, 9],  # bom dia
    [8, 10]  # bom noite
]

**Convert Tokens to Indices**

In [10]:
src_data = [
    [0, 3, 1],  # <sos> Hello <eos>
    [0, 4, 5, 6, 7, 1],  # <sos> How are you? <eos>
    [0, 8, 6, 1],  # <sos> Thank you <eos>
    [0, 9, 10, 1],  # <sos> Good morning <eos>
    [0, 9, 11, 1]  # <sos> Good night <eos>
]
tgt_data = [
    [0, 3, 1],  # <sos> Ola <eos>
    [0, 4, 5, 6, 1],  # <sos> Como esta? <eos>
    [0, 7, 1],  # <sos> Obrigado <eos>
    [0, 8, 9, 1],  # <sos> bom dia <eos>
    [0, 8, 10, 1]  # <sos> bom noite <eos>
]

# Convert lists to tensors
src_tensors = [torch.tensor(sentence) for sentence in src_data]
tgt_tensors = [torch.tensor(sentence) for sentence in tgt_data]

**Transfer Model**

In [11]:
class SimpleTransformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=256, nhead=8, num_encoder_layers=3, num_decoder_layers=3, dim_feedforward=512):
        # d_model is a parameter that specifies the dimensionality of the input and output vectors of the Transformer model.
        #   Each word in the input sequences is represented as a d_model-dimensional vector.
        # nhead is a parameter that specifies the number of heads in the multi-head attention mechanism.
        # num_encoder_layers and num_decoder_layers are parameters that specify the number of encoder and decoder layers in the Transformer model.
        # dim_feedforward is a parameter that specifies the dimensionality of the hidden layer (number of units) within the feedforward network.

        super(SimpleTransformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        # Define the Transformer model with specified parameters
        self.transformer = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward)
        # Define the final fully connected layer that maps the output of the transformer to the target vocab size, 12 here
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
        # Store the source vocabulary size and the target vocabulary size for reference, the model dimension for scaling embeddings
        self.src_vocab_size = src_vocab_size
        self.tgt_vocab_size = tgt_vocab_size
        self.d_model = d_model

    def forward(self, src, tgt):
        # Apply the embedding layer to the source input and scale it by the square root of d_model
        src = self.encoder_embedding(src) / (self.d_model ** 0.5)
        # Apply the embedding layer to the target input and scale it by the square root of d_model
        tgt = self.decoder_embedding(tgt) / (self.d_model ** 0.5)
        # Ppermute the source embedding for the transformer (required shape: sequence length, batch size, embedding dimension)
        # If src has dimensions (10, 32, 256): indices(0, 1, 2), after executing src = src.permute(1, 0, 2), src will have dimensions (32, 10, 256).
        src = src.permute(1, 0, 2)
        # Ppermute the target embedding for the transformer (required shape: sequence length, batch size, embedding dimension)
        tgt = tgt.permute(1, 0, 2)
        # Pass the source embeddings through the encoder part of the transformer, store the output in a variable for reference
        memory = self.transformer.encoder(src)
        # Pass the target embeddings and encoder output through the decoder part of the transformer
        output = self.transformer.decoder(tgt, memory)
        # Apply the final fully connected layer to map transformer output to target vocabulary size
        output = self.fc_out(output)
        # Ppermute the output back to (batch size, sequence length, vocab size)
        return output.permute(1, 0, 2)

    # Generation of a translation or output sequence based on a given source sequence (src)
    def generate(self, src, max_len=10):
        # Apply the embedding layer to the source input and scale it by the square root of d_model
        src = self.encoder_embedding(src) / (self.d_model ** 0.5)
        # Ppermute the source embedding for the transformer (required shape: sequence length, batch size, embedding dimension)
        src = src.permute(1, 0, 2)
        # Pass the source embeddings through the encoder part of the transformer
        memory = self.transformer.encoder(src)
        # Initialize the target sequence with the start-of-sequence token <sos> (assumed to be 0)
        tgt = torch.tensor([[0]])
        # List to store generated token indices
        generated_indices = []
        for _ in range(max_len):
            # Apply the embedding layer to the current target sequence and scale it by the square root of d_model
            tgt_emb = self.decoder_embedding(tgt) / (self.d_model ** 0.5)
            # Permute the target embedding for the transformer (required shape: sequence length, batch size, embedding dimension)
            tgt_emb = tgt_emb.permute(1, 0, 2)
            # Pass the target embeddings and encoder output through the decoder part of the transformer
            output = self.transformer.decoder(tgt_emb, memory)
            # Apply the final fully connected layer to map transformer output to target vocabulary size
            output = self.fc_out(output)
            # Get the index of the next token by finding the argmax of the output. unsqueeze(0): reshapes the tensor by adding a new dimension at position 0.
            next_token = output.argmax(2)[-1, :].unsqueeze(0)
            # Append the generated token index to the list
            generated_indices.append(next_token.item())
            # Concatenate the next token to the current target sequence
            tgt = torch.cat((tgt, next_token), dim=1)
            # Stop if the end-of-sequence token <eos> (assumed to be 1) is generated
            if next_token.item() == 1:
                break
        # Return the list of generated token indices
        return generated_indices

**Training**

In [27]:
# Define the vocabulary sizes and initialize the model
src_vocab_size = len(english_vocab)
tgt_vocab_size = len(portuguese_vocab)
model = SimpleTransformer(src_vocab_size, tgt_vocab_size)

criterion = nn.CrossEntropyLoss(ignore_index=2)  # Ignore <pad> token == 2  in loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Convert sentences to tensors and pad sequences to the same length
def pad_sequences(sequences, max_len, pad_value=2): # <pad> token == 2
    return [torch.cat([seq, torch.full((max_len - len(seq),), pad_value)]) for seq in sequences]

max_len = max(max(len(s) for s in src_tensors), max(len(s) for s in tgt_tensors))
src_tensors = pad_sequences(src_tensors, max_len)
tgt_tensors = pad_sequences(tgt_tensors, max_len)

# torch.stack concatenates (combines) multiple tensors into a single tensor.
src_tensor = torch.stack(src_tensors)
tgt_tensor = torch.stack(tgt_tensors)

# Train model
for epoch in range(200):  # Train for 200 epochs
    # Set the model to training mode
    model.train()
    optimizer.zero_grad()

    # Forward pass. Target sequence [:, :-1] is slicing the tgt_tensor to include all rows (:) and all columns except the last one (:-1).
    # It removes the last token (which is often <eos>) to create the input for the decoder that includes all tokens up to, but not including, the final token.
    output = model(src_tensor, tgt_tensor[:, :-1])

    # Compute loss. Shapes: [batch size, sequence length, embedding dimension]. Reshapes the tensor to have a shape of (batch size * sequence length, embedding dimension).
    # This flattening is done to match the required input shape for the loss function, which expects a 2D tensor where each row corresponds to a token prediction.
    loss = criterion(output.reshape(-1, tgt_vocab_size), tgt_tensor[:, 1:].reshape(-1))

    # Backward pass and optimization
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')


Epoch 0, Loss: 2.462172746658325
Epoch 10, Loss: 0.8762478828430176
Epoch 20, Loss: 0.05221805348992348
Epoch 30, Loss: 0.014924539253115654
Epoch 40, Loss: 0.005738255102187395
Epoch 50, Loss: 0.004030696116387844
Epoch 60, Loss: 0.002821733709424734
Epoch 70, Loss: 0.002170687075704336
Epoch 80, Loss: 0.0019442916382104158
Epoch 90, Loss: 0.0016539767384529114
Epoch 100, Loss: 0.0015951860696077347
Epoch 110, Loss: 0.001571864471770823
Epoch 120, Loss: 0.0014183578314259648
Epoch 130, Loss: 0.0013564411783590913
Epoch 140, Loss: 0.0012715071206912398
Epoch 150, Loss: 0.0012826604070141912
Epoch 160, Loss: 0.001058325869962573
Epoch 170, Loss: 0.0010998089564964175
Epoch 180, Loss: 0.0010297000408172607
Epoch 190, Loss: 0.0009575058938935399


**Evaluation**

In [28]:
# Set the model to evaluation mode
model.eval()
# Example source sentence represented by token indices
src_sentence = [0, 8, 6, 1]  # <sos> Thank you <eos>
# Convert the source sentence to a tensor and add a batch dimension
src_tensor = torch.tensor(src_sentence).unsqueeze(0)
# Generate the translation by passing the source tensor to the model's generate method
generated_indices = model.generate(src_tensor)
# Convert the generated token indices to the corresponding words using the Portuguese vocabulary
translated_sentence = [list(portuguese_vocab.keys())[i] for i in generated_indices]
# Print the translated sentence by joining the words with spaces
print(' '.join(translated_sentence))

Obrigado <eos>


**Hands-on activities in the class**

Increase the vocabulary size and dataset size.

Create a mini-dictionary focused on a specific domain (e.g., renting rooms, classroom objects)

Modify hyperparameters to improve translation accuracy.

Translate greetings from English to the target language, rather than just Portuguese.