<a href="https://colab.research.google.com/github/amrahmani/Pythorch/blob/main/Ch9_Transformer_MachineTranslation_EnglishToPortuguese.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Machine Translation**

**Problem:** Using PyTorch, build a Transformer neural network for machine translation from English to Portuguese. First, create a dataset of 20 greeting sentences and define the vocabulary. Then, train and evaluate the Transformer on this dataset.

**Example dataset**

English: Hello

Portuguese: Ola

English: How are you?

Portuguese: Como esta?

English: Thank you

Portuguese: Obrigado

English: Good morning

Portuguese: bom dia

English: Good night

Portuguese: bom noite


**Tokenization**

In [64]:
import torch
import torch.nn as nn
import torch.optim as optim

english_vocab = {'<sos>': 0, '<eos>': 1, '<pad>': 2, 'Hello': 3, 'How': 4, 'are': 5, 'you': 6, '?': 7, 'Thank': 8, 'Good': 9, 'morning': 10, 'night': 11}
portuguese_vocab = {'<sos>': 0, '<eos>': 1, '<pad>': 2, 'Ola': 3, 'como': 4, 'esta': 5, '?': 6, 'Obrigado': 7, 'bom': 8, 'dia': 9, 'noite': 10}

**Build Vocabulary**

In [65]:
english_sentences = [
    [3],  # Hello
    [4, 5, 6, 7],  # How are you?
    [8, 6],  # Thank you
    [9, 10],  # Good morning
    [9, 11]  # Good night
]
portuguese_sentences = [
    [3],  # Ola
    [4, 5, 6, 7],  # Como esta?
    [8],  # Obrigado
    [9, 10],  # bom dia
    [11, 10]  # bom noite
]

**Convert Tokens to Indices**

In [66]:
src_data = [
    [0, 3, 1],  # <sos> Hello <eos>
    [0, 4, 5, 6, 7, 1],  # <sos> How are you? <eos>
    [0, 8, 6, 1],  # <sos> Thank you <eos>
    [0, 9, 10, 1],  # <sos> Good morning <eos>
    [0, 9, 11, 1]  # <sos> Good night <eos>
]
tgt_data = [
    [0, 3, 1],  # <sos> Ola <eos>
    [0, 4, 5, 6, 1],  # <sos> Como esta? <eos>
    [0, 7, 1],  # <sos> Obrigado <eos>
    [0, 8, 9, 1],  # <sos> bom dia <eos>
    [0, 8, 10, 1]  # <sos> bom noite <eos>
]

# Convert lists to tensors
src_tensors = [torch.tensor(sentence) for sentence in src_data]
tgt_tensors = [torch.tensor(sentence) for sentence in tgt_data]

**Transfer Model**

In [67]:
class SimpleTransformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=256, nhead=8, num_encoder_layers=3, num_decoder_layers=3, dim_feedforward=512):
        super(SimpleTransformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.transformer = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers, dim_feedforward)
        self.fc_out = nn.Linear(d_model, tgt_vocab_size)
        self.src_vocab_size = src_vocab_size
        self.tgt_vocab_size = tgt_vocab_size
        self.d_model = d_model

    def forward(self, src, tgt):
        src = self.encoder_embedding(src) * (self.d_model ** 0.5)
        tgt = self.decoder_embedding(tgt) * (self.d_model ** 0.5)
        src = src.permute(1, 0, 2)
        tgt = tgt.permute(1, 0, 2)
        memory = self.transformer.encoder(src)
        output = self.transformer.decoder(tgt, memory)
        output = self.fc_out(output)
        return output.permute(1, 0, 2)

    def generate(self, src, max_len=10):
        src = self.encoder_embedding(src) * (self.d_model ** 0.5)
        src = src.permute(1, 0, 2)
        memory = self.transformer.encoder(src)
        tgt = torch.tensor([[0]]).to(src.device)  # Start with <sos>
        generated_indices = []
        for _ in range(max_len):
            tgt_emb = self.decoder_embedding(tgt) * (self.d_model ** 0.5)
            tgt_emb = tgt_emb.permute(1, 0, 2)
            output = self.transformer.decoder(tgt_emb, memory)
            output = self.fc_out(output)
            next_token = output.argmax(2)[-1, :].unsqueeze(0)
            generated_indices.append(next_token.item())
            tgt = torch.cat((tgt, next_token), dim=1)
            if next_token.item() == 1:  # <eos>
                break
        return generated_indices

**Training**

In [68]:
# Define the vocabulary sizes and initialize the model
src_vocab_size = len(english_vocab)
tgt_vocab_size = len(portuguese_vocab)
model = SimpleTransformer(src_vocab_size, tgt_vocab_size)

criterion = nn.CrossEntropyLoss(ignore_index=2)  # Ignore pad token in loss
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Convert sentences to tensors and pad sequences to the same length
def pad_sequences(sequences, max_len, pad_value=2):
    return [torch.cat([seq, torch.full((max_len - len(seq),), pad_value)]) for seq in sequences]

max_len = max(max(len(s) for s in src_tensors), max(len(s) for s in tgt_tensors))
src_tensors = pad_sequences(src_tensors, max_len)
tgt_tensors = pad_sequences(tgt_tensors, max_len)

src_tensor = torch.stack(src_tensors)
tgt_tensor = torch.stack(tgt_tensors)

src_tensor = src_tensor.to(torch.int64)
tgt_tensor = tgt_tensor.to(torch.int64)

for epoch in range(100):  # Train for 100 epochs
    model.train()
    optimizer.zero_grad()

    # Forward pass
    output = model(src_tensor, tgt_tensor[:, :-1])

    # Compute loss
    loss = criterion(output.reshape(-1, tgt_vocab_size), tgt_tensor[:, 1:].reshape(-1))

    # Backward pass and optimization
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item()}')


Epoch 0, Loss: 2.773191452026367
Epoch 10, Loss: 0.21388719975948334
Epoch 20, Loss: 0.023206619545817375
Epoch 30, Loss: 0.007096690591424704
Epoch 40, Loss: 0.003973889164626598
Epoch 50, Loss: 0.0028921852353960276
Epoch 60, Loss: 0.0021512166131287813
Epoch 70, Loss: 0.00209934264421463
Epoch 80, Loss: 0.0017372026341035962
Epoch 90, Loss: 0.0015720758819952607


**Evaluation**

In [69]:
model.eval()
src_sentence = [0, 8, 6, 1]  # <sos> Thank you <eos>
src_tensor = torch.tensor(src_sentence).unsqueeze(0).to(torch.int64)
generated_indices = model.generate(src_tensor)

translated_sentence = [list(portuguese_vocab.keys())[i] for i in generated_indices]
print(' '.join(translated_sentence))


Obrigado <eos>


**Hands-on activities in the class**

Increase the vocabulary size and dataset size.

Create a mini-dictionary focused on a specific domain (e.g., renting rooms, classroom objects)

Modify hyperparameters to improve translation accuracy.

Translate greetings from English to the target language, rather than just Portuguese.