# Homework 6: Sequence-to-Sequence Model

#### Introduction to Natural Language Processing

* Hyerin, Seo. (hyseo@students.uni-mainz.de)
* Yeonwoo, Nam. (yeonam@students.uni-mainz.de)
* Yevin, Kim. (kyevin@students.uni-mainz.de)

In this homework we're going to look at an encoder-decoder model.

*Total Points: 20P*

# Evaluation

*Task 1: Decide which metric is the better one! Explain your decision.* -> Evaluation: **XX/2**

*TASK 2: As always, explain the dataset! (1P)* -> Ev aluation: **XX/1**

*TASK 3: Explain what an encoder-decoder structure is! (2P)* -> Evaluation: **XX/2**

*TASK 4: How can we use this structure to solve our task? (1P)* -> Evaluation: **XX/1**

*TASK 5: Create your vocabulary. (2P)* -> Evaluation: **XX/2**

*TASK 6: Create the PyTorch Dataset for the task! (2P)* -> Evaluation: **XX/2**

*TASK 7: Create an encoder-decoder LSTM model! (3P)* -> Evaluation: **XX/3**

*Task 8: Create a training loop for one epoch. (2P)* -> Evaluation: **XX/2**

*TASK 9: Compare the validation accuracy between version 1 and version 2. What is the difference? (1P)* -> Evaluation: **XX/1**

*Task 10: Now write the full training loop with multiple epochs and plot the training loss and validation accuracy. (1P)* -> Evaluation: **XX/2**

*Task 11: Now calculate the test accuracy with version1 and version2 accuracy! (1P)* -> Evaluation: **XX/1**

*Task 12: What are the reasons why a vanilla encoder-decoder RNN is bad in this task? Which modifications could help? (1P)* -> Evaluation: **XX/1**

**Total: XX/20**

# Section 1: Good or Bad Metric?

We start with a more fundamental question: How can we decide which metric is better for a specific task?

Assume we have a translation task. Each sentence is now being judged my a human who can speak both languages. The human gives a rating from 1-10. 

Now you created a new metric called BestMetric. You want to compare your metric vs. other established metrics, such as BLEU. 

The results are the following:

| Translated Sentence | Human Rating | BLEU | BestMetric |
| -------------------- | ------------ | ---------- | ---- |
| Sentence1            | 8            | 0.82       | 0.80 |
| Sentence2            | 7            | 0.78       | 0.81 |
| Sentence3            | 3            | 0.50       | 0.70 |
| Sentence4            | 9            | 0.95       | 0.90 |
| Sentence5            | 1            | 0.25       | 0.65 |



**Task 1: Decide which metric is the better one! Explain your decision. (2P)**

A better evaluation should be closer to the human evaluation criteria, so we should evaluate the correlation with the human rating to see which one has a higher correlation coefficient.

Here's a formula to calculate the Pearson correlation coefficient of BLEU and BestMetric with Human Rating respectively.

In [1]:
import numpy as np

Human_Rating = [8, 7, 3, 9, 1]
BLEU_Val = [0.82, 0.78, 0.50, 0.95, 0.25]
BestMetric_Val = [0.80, 0.81, 0.70, 0.90, 0.65]

corr_BLEU = np.corrcoef(BLEU_Val, Human_Rating)[0, 1]
corr_BestMetric = np.corrcoef(BestMetric_Val, Human_Rating)[0, 1]

# Print out the correlation coefficients
print(f"Pearson Correlation Coefficient for BLEU: {corr_BLEU}")
print(f"Pearson Correlation Coefficient for BestMetric: {corr_BestMetric}")

Pearson Correlation Coefficient for BLEU: 0.9914784487581396
Pearson Correlation Coefficient for BestMetric: 0.9650800974075744


As a result, I think BLEU is better because it has a higher correlation coefficient.

# Section 2: Morphological Inflection Generation

In this section, we will implement a sequence to sequence model to generate inflected forms: The task is to generate the inflected form of a given lemma corresponding to a particular linguistic transformation.

<img src="06_example.png" alt="drawing" width="400"/>

First load the data!

In [2]:
import os

data_dir = "morphological"

# Define the file paths
train_file = os.path.join(data_dir, "german-train-high.txt")
dev_file = os.path.join(data_dir, "german-dev.txt")
test_file = os.path.join(data_dir, "german-uncovered-test.txt")

def read_conll_file(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        current_sentence = []
        for line in file:
            line = line.strip()
            if not line:  # Empty line indicates the end of a sentence
                if current_sentence:
                    data.append(current_sentence)
                    current_sentence = []
            else:
                columns = line.split('\t')
                current_sentence.append(columns)
        data += current_sentence
    return data

# Read data
train_data = read_conll_file(train_file)
dev_data = read_conll_file(dev_file)
test_data = read_conll_file(test_file)

print("Train Data:")
print(f"Number of training samples: {len(train_data)}")
print(f"Example sentences:")
for example in train_data[-2:]:  # Displaying the last two sentences for illustration
    print(f"   {example}")

print("\nDev Data:")
print(f"Number of development samples: {len(dev_data)}")

print("\nTest Data:")
print(f"Number of test samples: {len(test_data)}")

Train Data:
Number of training samples: 10000
Example sentences:
   ['optimieren', 'optimieren', 'V;IND;PRS;3;PL']
   ['sabbeln', 'sabbelt', 'V;IND;PRS;2;PL']

Dev Data:
Number of development samples: 1000

Test Data:
Number of test samples: 1000


**Task 2: As always, explain the dataset! (1P) Refer to https://aclanthology.org/K17-2001.pdf**. Be short

The data set contains three types of data: 

1. Lemma: The basic form of a word.
2. Inflected Form: the inflected or surface form of a word.
3. Inflection: a bundle of morphosyntactic features.

Given the reference, the goal of the dataset is that the last column (the inflected form) in the test data is predicted by the system.

We want to solve the task with an encoder-decoder RNN.

**Task 3: Explain what an encoder-decoder structure is! Be short. (2P)**

The architecture of an encoder-decoder consists of two main components.
- Encoder: Converts the input sentences into vectors, matrices. (ex. LSTM, GRU) (Encoding)
- Decoder: Converts the encoded data values into the target language. (Decoding)

 Thus, an encoder-decoder works by encoding an input sequence into a vector of a certain size and decoding the vector into an output sequence through a decoder.

**Task 4: How can we use this structure to solve our task? (1P)**

To achieve our goal in this task (generating inflections from lemmas), the encoder-decoder structure can be used as follows. 
 The encoder takes a sequence of lemmas as input and generates a fixed-size context vector containing their semantic information. The decoder uses this context vector to generate a transformed form.

 In other words, the encoder captures the semantics of the lemmas and generates a context vector. The decoder generates a transformed form based on the context vector from the encoder.

Before we can create our dataset and a model, we need to encode all characters AND the _morphosyntactic features_ as indices (our vocabulary). Later, we will then create an embedding for each index - like we did for the last homeworks.

**Task 5: Create your vocabulary. (2P)**

In [3]:
'''

Your code here.

Create a vocabulary over your LOWER-CASED text (character-level) and the UPPER-CASED morphosyntactic features.

For the text (the lemma and Inflected form), we encode them on character-level,
i.e. each character has an index, e.g. "a" -> 25.

For the morphosyntactic features, we encode them directly, i.e. each inflection has an
index, e.g. "IND" -> 13.

PRE-append special tokens to your vocabulary. 
Sort the rest of the vocab alphabetically (already written):
# This is my order: Special tokens + inflection + characters created from the text
characters = sorted(characters)
tags = sorted(tags) 
vocab = special_tokens+tags+characters

'''

# We also have special tokens. <sep> is a separate token:
special_tokens = ['<s>', '</s>', '<sep>']

# To keep it simple, we collect our vocabulary from the train, dev and test dataset
all = train_data + dev_data + test_data

characters = set() # characters holds all lower-cased unique characters that appear in the text
tags = set() # Holds all upper-cased unique morphosyntactic feature (inflection)

for data in all:
    characters.update(list(data[0].lower())) #Lemma
    characters.update(list(data[1].lower()))  #Inflected form

    tag_components = data[2].upper().split(';')
    tags.update(tag_components)

# This is my order: Special tokens + inflection + characters created from the text (unordered
characters = sorted(list(characters))
tags = sorted(list(tags))

vocab = special_tokens + tags + characters

# Dictionary mapping from word (character and the inflection) to index
word_to_index = {word: idx for idx, word in enumerate(vocab)}
index_to_word = {idx: word for idx, word in enumerate(vocab)}

In [4]:
assert word_to_index["GEN"] == 8
assert word_to_index["a"] == 24

Now that we have our vocabulary, we can create our PyTorch dataset for this task.

Remember, our model has to learn to do the following: Given ['Reflektion', 'N;ACC;PL'], output: 'Reflektionen'.

How to we construct the input and output?

The input for the model is the lemma and the inflected form concatenated:
- '\<s\>' + lemma (as indices) + '\<sep\>' + morphosyntactic features (as indices) + '\</s\>'

The output for the model is the inflected form
- '\<s\>' + inflected form (as indices) + '\</s\>'


**Task 6: Create the PyTorch Dataset for the task! (2P)**

In [5]:
import torch
from torch.utils.data import Dataset
from torch.utils.data import DataLoader


'''
Your Code Here.

Create the PyTorch Dataset for the task!

The input is the lemma and the inflected form concatenated:
'<s>' + lemma (as indices) + '<sep>' + morphosyntactic features (as indices) + '</s>'

The output is the inflected form
'<s>' + inflected form (as indices) + '</s>'

'''

class InflectionDataset(Dataset):
    def __init__(self, data):
        self.data = data
        
    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        root_form, inflected_form, morphological_info = self.data[idx]

        # Start, end, and separator tokens
        sos_id = [word_to_index['<s>']]
        eos_id = [word_to_index['</s>']]
        sep_id = [word_to_index['<sep>']]

        # Convert characters and morphological_info to numerical representations using word_to_index
        input_root_morph = sos_id + [word_to_index[char] for char in root_form.lower()] + sep_id + [word_to_index[tag] for tag in morphological_info.split(';')] + eos_id
        output_inflected = sos_id + [word_to_index[char] for char in inflected_form.lower()] + eos_id

        # Convert to Input Tensor and Output tensor
        input_root_morph = torch.tensor(input_root_morph, dtype=torch.long)
        output_inflected = torch.tensor(output_inflected, dtype=torch.long)

        return input_root_morph, output_inflected

# Create datasets and loaders
train_dataset = InflectionDataset(train_data)
dev_dataset = InflectionDataset(dev_data)
test_dataset = InflectionDataset(test_data)

batch_size = 1

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
dev_loader = DataLoader(dev_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

# Example usage:
example_idx = 0
input_sequence, target_sequence = train_dataset[example_idx]

In [6]:
test_dataset = InflectionDataset([['Reflektion', 'Reflektionen', 'N;ACC;PL'], ['Scherz', 'Scherzes', 'N;GEN;SG']])
input_root_morph, output_inflected = test_dataset[0]
if "medium" in train_file:
    print("You are using the medium dataset")
    # ASSERT FOR MEDIUM TRAINING SET
    assert torch.equal(input_root_morph, torch.tensor([0, 41, 28, 29, 35, 28, 34, 43, 32, 38, 37,  2, 11,  6, 14,  1]))
    assert torch.equal(output_inflected, torch.tensor([0, 41, 28, 29, 35, 28, 34, 43, 32, 38, 37, 28, 37,  1]))
elif "high" in train_file:
    print("You are using the high dataset")
    # ASSERT FOR HIGH TRAINING SET
    assert torch.equal(input_root_morph, torch.tensor([0, 41, 28, 29, 35, 28, 34, 43, 32, 38, 37,  2, 11,  6, 14,  1]))
    assert torch.equal(output_inflected, torch.tensor([0, 41, 28, 29, 35, 28, 34, 43, 32, 38, 37, 28, 37,  1]))
else:
    print("low not done")

You are using the high dataset


In [7]:
print("Actual Input:", input_root_morph)
print("Actual Output:", output_inflected)

Actual Input: tensor([ 0, 41, 28, 29, 35, 28, 34, 43, 32, 38, 37,  2, 11,  6, 14,  1])
Actual Output: tensor([ 0, 41, 28, 29, 35, 28, 34, 43, 32, 38, 37, 28, 37,  1])


Now that we have the dataset, we have to create our model.

We will use an encoder-decoder LSTM model WITHOUT attention. This simplifies the architecture, but is not the best choice for (see last task).

**Task 7: Create an encoder-decoder LSTM model! (3P)**

(See the code for hints)

In [8]:
import torch.nn as nn
import torch.nn.functional as F


'''
Your Code here.


Create an encoder-decoder LSTM model!


Try to go along the comments that I wrote.


'''
class EncoderRNN(nn.Module):
    def __init__(self, input_size, hidden_size, dropout_p=0.1):
        super(EncoderRNN, self).__init__()
        self.hidden_size = hidden_size


        # Embedding Layer
        self.embedding = nn.Embedding(input_size, hidden_size)
       
        # Use LSTM
        self.rnn = nn.LSTM(hidden_size, hidden_size)
       
        # Use Dropout
        self.dropout = nn.Dropout(dropout_p)


    def forward(self, input):
        # Get the embeddings for the input
        embedded = self.dropout(self.embedding(input))

        # Initialize initial hidden and cell states
        h0 = torch.zeros(1, input.size(0), self.hidden_size)
        c0 = torch.zeros(1, input.size(0), self.hidden_size)
        
        # Forward through your rnn
        ooutput, (hidden, cell) = self.rnn(embedded, (h0, c0))
        return output, hidden


# Now we create our decoder RNN
class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size, dropout_p=0.1):
        super(DecoderRNN, self).__init__()


        # We will share the embeddings with our encoder!
        self.embedding = None


        # Again define RNN (LSTM Module)
        self.rnn = nn.LSTM(hidden_size, hidden_size)


        # Now we need a linear layer to classify over our vocabulary
        self.out = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(dropout_p)


    def forward_step(self, input, hidden):
        # This function just does one step of a decoder given the input and hidden (context) vector


        # Get the embeddings for the input
        output = self.dropout(self.embedding(input))


        # Forward through your lstm
        output, hidden = self.rnn(output, hidden)


        # Use your linear layer for classification (keep the logits)
        output = self.out(output.squeeze(1))
        
        return output, hidden


    def forward(self, encoder_outputs, encoder_hidden, target_tensor=None, max_length=50):
        batch_size = encoder_outputs.size(0)


        # We start with the start-of-sentence token
        decoder_input = torch.empty(batch_size, 1, dtype=torch.long).fill_(0)


        # Current context vector
        decoder_hidden = encoder_hidden
        decoder_outputs = []


        # Now we predict step by step the next token
        for i in range(max_length):
            # Use your forward_step to get the output and the hidden (context) vector
            decoder_output, decoder_hidden = self.forward_step(decoder_input, decoder_hidden)


            # Save the output
            decoder_outputs.append(decoder_output)


            # Now it is getting tricky: When we have the ground truth, we
            # do not want to keep predicting the next token based on our predicted output as input
            # we want to use the real output token to predict the next token
            if target_tensor is not None:
                # Case 1: We have the ground truth
                if target_tensor.shape[-1] == i + 1:
                    # This ensures that we stop when we exhausted the GT
                    break
                # Teacher forcing: Feed the target as the next input
                decoder_input = target_tensor[:, i].unsqueeze(1)
           
            else:
                # Case 2: Use its own predictions as the next input
                # based on the decoder_output, get the predicted token as index
                # Hint: no need to calculate the probability, just get the highest logit index
                # as prediction
                _, topi = decoder_output.topk(1)
                decoder_input = topi.detach()  # To save space, use .detach() at the end


            # Check if the end-of-sequence token is predicted
            if decoder_input.item() == 1:  # Assuming 1 is the index for the end token
                break


        # Now, concatenate all decoder_outputs
        decoder_outputs = torch.cat(decoder_outputs, dim=1)


        return decoder_outputs, decoder_hidden, None  # We return `None` for consistency in the training loop


    def share_embedding(self, encoder):
        # Share the embedding with the encoder
        self.embedding = encoder.embedding

Now that we have our encoder-decoder model, we need to create the training loop and a validation loop.



**Task 8: Create a training loop for one epoch based on your model implementation. Return the training loss. (2P)**

In [9]:
'''
Your code here.

Create a training loop for ONE epoch based on your model implementation.

See the next-next cell to see the full training loop.

Return the training loss.

'''

# Assume the following inputs for your training epoch loop.
def training_epoch(encoder, decoder, train_loader, criterion, encoder_optimizer, decoder_optimizer):
    # Hint: you now have an encoder optimizer AND and decoder optimizer. 
    # Both have to be called with .zero_grad() and .step()

    total_loss = 0

    # Set the models to train mode
    encoder.train()
    decoder.train()

    for data in train_loader:
        # Your Code
        # Extract input and target data
        input_data, target_data = data
        
        # Zero the gradients for both optimizers
        encoder_optimizer.zero_grad()
        decoder_optimizer.zero_grad()
        
        # Forward pass
        encoder_output = encoder(input_data)
        decoder_output = decoder(encoder_output)
        
        # Calculate the loss
        loss = criterion(decoder_output, target_data)
        
        # Perform backpropagation and optimization
        loss.backward()
        encoder_optimizer.step()
        decoder_optimizer.step()
        
        # Accumulate the total loss
        total_loss += loss.item()

    # Calculate the average loss for the epoch
    avg_loss = total_loss / len(train_loader)
    
    return avg_loss
    

Now that we have the training loop, we only need the validation loop.

This time, you do not have to write the validation loop yourself. I will give you two validation loops and your task is to tell me what the difference between the metrics are. Both calculate accuracies, but based on what exactly?

In [10]:
# First Version:
def validate(encoder, decoder, val_loader, criterion, print_frequence=100000):
    encoder.eval()
    decoder.eval()
    #total_loss = 0
    correct_predictions = 0
    total_samples = 0

    with torch.no_grad():
        for index, data in enumerate(val_loader):
            input_tensor, target_tensor = data

            encoder_outputs, encoder_hidden = encoder(input_tensor)
            decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, target_tensor)

            # Calculate accuracy
            _, predicted = decoder_outputs.max(2)
            correct_predictions += (predicted == target_tensor).sum().item()
            total_samples += target_tensor.size(0) * target_tensor.size(1)

            # Just for printing purposes (ignore)
            if index % print_frequence == 0:
                # Convert tensor to words
                words_input = [index_to_word[idx.item()] for idx in input_tensor[0]]
                sep_index = words_input.index('<sep>')
                input_str = ''.join(words_input[:sep_index]) 
                input_str += '<sep>' + ';'.join(words_input[sep_index+1:-1]) + '</s>'
                words_target = [index_to_word[idx.item()] for idx in target_tensor[0]]
                words_predicted = [index_to_word[idx.item()] for idx in predicted[0]]
                print(f"\nExample {index}")
                print(f"Input: {input_str}")
                print(f"Target: {''.join(words_target)}")
                print(f"Predicted: {''.join(words_predicted)}")

    accuracy = correct_predictions / total_samples
    return accuracy

# Second Version:
def validate_hard(encoder, decoder, val_loader, criterion, print_frequence=10):
    encoder.eval()
    decoder.eval()
    correct_predictions = 0
    total_samples = 0

    with torch.no_grad():
        for index, data in enumerate(val_loader):
            input_tensor, target_tensor = data

            encoder_outputs, encoder_hidden = encoder(input_tensor)
            decoder_outputs, _, _ = decoder(encoder_outputs, encoder_hidden, None)
            
            # Calculate accuracy
            _, predicted = decoder_outputs.max(2)
            if predicted.shape == target_tensor.shape and torch.equal(predicted, target_tensor):
                correct_predictions += 1
            total_samples += target_tensor.size(0)

            # Just for printing purposes (ignore)
            if index % print_frequence == 0:
                # Convert tensor to words
                words_input = [index_to_word[idx.item()] for idx in input_tensor[0]]
                sep_index = words_input.index('<sep>')
                input_str = ''.join(words_input[:sep_index]) 
                input_str += '<sep>' + ';'.join(words_input[sep_index+1:-1]) + '</s>'
                words_target = [index_to_word[idx.item()] for idx in target_tensor[0]]
                words_predicted = [index_to_word[idx.item()] for idx in predicted[0]]
                print(f"\nExample {index}")
                print(f"Input: {input_str}")
                print(f"Target: {''.join(words_target)}")
                print(f"Predicted: {''.join(words_predicted)}")

    accuracy = correct_predictions / total_samples
    return accuracy

def transform_input_tensor_to_string(input_tensor):
    words_input = [index_to_word[idx.item()] for idx in input_tensor[0]]
    

**Task 9: Compare the validation accuracy between version 1 and version 2. What is the difference? (1P)**

 Both validate functions are used to measure the performance (accuracy) of a model, but they differ in their approach.

 In the first version, the accuracy is calculated by computing the agreement between the overall prediction and the actual target sequence in each sample: it compares all the predictions and targets in each batch, considers each match to be an accurate prediction, and returns the accuracy by calculating the percentage of tokens that match. 
 
 The second version, however, checks whether the entire sequence is correctly matched in each sample, incrementing the accuracy whenever the entire sequence is matched in each batch, thus calculating a binary accuracy.

**Task 10: Now write the full training loop with multiple epochs and plot the training loss and validation accuracy. (2P)**

In [11]:
from torch import optim
import torch.nn.functional as F


'''
Your code here.

Write the full training loop with multiple epochs!

Collect the training loss (per sample) and the validation accuracy. Plot both.

'''

# define encoder, decoder
# I used hidden_size=64
input_size = len(data)
hidden_size = 64 
# Feel free to adjust
encoder = EncoderRNN(input_size, hidden_size)
decoder = DecoderRNN(hidden_size, input_size)

# Share the embeddings
decoder.share_embedding(encoder)

learning_rate = 0.001
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate)


# One epoch takes 1 minute on my laptop
# Either reduce the size of the model or reduce the data amount (medium training data)
# You can also increase the epochs to get the best model possible
num_epochs = 5

# Lists to store the loss and accuracy for each epoch
train_losses = []
val_accuracies = []

# Define criterion
criterion = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    # Your Code
    # Call your training_epoch()
    train_loss = training_epoch(encoder, decoder, train_loader, criterion, encoder_optimizer, decoder_optimizer)
    train_losses.append(train_loss)
    # Call my validate() function 
    # I am printing one example per epoch, 
    val_accuracy = validate(encoder, decoder, val_loader, criterion)
    val_accuracies.append(val_accuracy)
    # you can change that if you want in the validation function

    # Use training loss per sample
    print(f'Training Epoch [{epoch + 1}/{num_epochs}], Loss: {avg_loss:.4f}')
    print(f'Validation Epoch [{epoch + 1}/{num_epochs}], Accuracy: {val_accuracy:.4f}')

# Plot everything

IndexError: index out of range in self

**Task 11: Now calculate the test accuracy with version1 and version2 accuracy! (1P)**

In [None]:
encoder.eval()
decoder.eval()
'''
Your code here.
Calculate the test accuracy with version1 and version2 accuracy!

'''

with torch.no_grad():
    test_accuracy_version_1 = validate(encoder, decoder, test_loader, criterion)
    print(f'Test Accuracy for Version 1: {test_accuracy_version_1:.4f}')

    test_accuracy_version_2 = validate_hard(encoder, decoder, test_loader, criterion)
    print(f'Test Accuracy for Version 2: {test_accuracy_version_2:.4f}')

    
    # Your Code

**Task 12: What are the reasons why a vanilla encoder-decoder RNN is bad in this task? Which modifications could help? (1P)**

A vanilla encoder-decoder RNN suffers from the Long-Term Dependency Problem. This is the disadvantage that as the length of the sequence increases, it does not retain the information of the previous cell while moving on to the next cell. To compensate for this, we can use the "Attention" technique. It is a technique that reflects the semantic association information between words in the input sequence by 'paying attention' to the input of the encoder that is most relevant to the decoder cell.