# Homework 9: Transformer Models
#### Introduction to Natural Language Processing

* Hyerin, Seo. (hyseo@students.uni-mainz.de)
* Yeonwoo, Nam. (yeonam@students.uni-mainz.de)
* Yevin, Kim. (kyevin@students.uni-mainz.de)

You can reach 20 points on this homework.

In this assignment, we'll be using transformer models to tackle the inflection task from Homework 06. We'll make use of the widely used Hugging Face library (https://huggingface.co). Don't worry, this assignment serves as a simple introduction to this new framework.

If you have questions, you can reach out via mail: minhducbui@uni-mainz.de

# Section 1: Downloading Huggingface and Playing around

In this section, we'll first download huggingface and get familiar with the framework.

## Installing Everything

Execute the following cell or in your terminal, if this doesnt work.

In [1]:
!pip install transformers



In [2]:
!pip install accelerate -U



In [3]:
!pip install datasets



In [4]:
!pip install sentencepiece



You might have to restart your kernel after installing everything!

## T5 Model Playing Around

For this task, we are using the T5 Model: https://arxiv.org/abs/1910.10683

Before we start with the task, we have to download the model!

In [5]:
from transformers import AutoTokenizer, T5ForConditionalGeneration

tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

**Task 1: Describe the pre-training task of T5! (2P)**

T5 stands for "Text-to-Text Transfer Transformer," and it is focused on addressing natural language processing problems by framing them as "Text-to-Text" problems. This involves training the model to handle all natural language processing (NLP) tasks by transforming them into a text input to text output format.

The pre-training process of T5 utilizes the corruption masking technique. In this approach, a portion of the input text is randomly masked, and the model is trained to predict and restore the masked portion. Essentially, T5 is trained to predict and reconstruct missing parts within a given text, enhancing its ability to understand context and integrate relevant information. Through this process, the model improves its proficiency in comprehending context and restoring missing information within the text.

**Task 2: Explain the following Code Snippet! Explain all steps. (2P)**

In [6]:
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

# training
input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits

# inference
input_ids = tokenizer(
    "summarize: studies have shown that owning a dog is good for you", return_tensors="pt"
).input_ids  # Batch size 1
outputs = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

studies have shown that owning a dog is good for you.


tokenizer = AutoTokenizer.from_pretrained("t5-small")
Using the AutoTokenizer with the configuration "t5-small", the tokenizer for the T5 model is initialized then loaded. The tokenizer converts text into a format that can be input into the model and is automatically selected based on the specified model architecture.


model = T5ForConditionalGeneration.from_pretrained("t5-small")
An instance of the T5ForConditionalGeneration model is created, and the pre-trained weights from the "t5-small" checkpoint are loaded to initialize the T5 model for conditional text generation. This class is specialized for conditional text generation tasks.

input_ids = tokenizer("The <extra_id_0> walks in <extra_id_1> park", return_tensors="pt").input_ids
labels = tokenizer("<extra_id_0> cute dog <extra_id_1> the <extra_id_2>", return_tensors="pt").input_ids
The input text and labels are tokenized using the tokenizer. The input sequence contains placeholders like <extra_id_0> and <extra_id_1>, which are special tokens that will be filled with specific content during training. The argument return_tensors="pt" indicates that the returned values should be PyTorch tensors.


outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss
logits = outputs.logits
The input tokens (input_ids) and target labels (labels) are passed to the T5 model. The model calculates the loss and logits (raw output scores). It is for minimizing this loss by adjusting the model parameters to generate outputs close to the target labels.


input_ids = tokenizer("summarize: studies have shown that owning a dog is good for you", return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_new_tokens=50)
For inference, a new input sequence with a summarization prompt is prepared. Inference involves tokenizing the new input sequence and using the generate method to generate text. The max_new_tokens parameter specifies the maximum number of tokens the generated output can have.


print(tokenizer.decode(outputs[0], skip_special_tokens=True))
The generated output tokens are decoded to a human-readable string, and skip_special_tokens=True removes special tokens like <extra_id_0> from the final output. The decoded text is printed, providing the summary based on the provided input.


# Section 2: Training T5!

We will use a T5 model for the inflection task in Homework 06. You'll notice that training is significantly simplified with this framework.

Load the dataset from Homework 06!

**IMPORTANT:** This assignment involves computationally intensive tasks. We strongly recommend testing your code on a small subset first. You may also submit the homework with a minimal amount of data if needed. For that, just change take a subset of the training/test data.

In [7]:
import os

data_dir = "morphological"

# Define the file paths
train_file = os.path.join(data_dir, "german-train-medium.txt")
dev_file = os.path.join(data_dir, "german-dev.txt")
test_file = os.path.join(data_dir, "german-uncovered-test.txt")

def read_conll_file(file_path):
    data = []
    with open(file_path, 'r', encoding='utf-8') as file:
        current_sentence = []
        for line in file:
            line = line.strip()
            if not line:  # Empty line indicates the end of a sentence
                if current_sentence:
                    data.append(current_sentence)
                    current_sentence = []
            else:
                columns = line.split('\t')
                current_sentence.append(columns)
        data += current_sentence
    return data

# Read data
train_data = read_conll_file(train_file)
dev_data = read_conll_file(dev_file)
# We are going to reduce the amount of dev data
dev_data = dev_data[:50]
test_data = read_conll_file(test_file)

print("Train Data:")
print(f"Number of training samples: {len(train_data)}")
print(f"Example sentences:")
for example in train_data[-2:]:  # Displaying the last two sentences for illustration
    print(f"   {example}")

print("\nDev Data:")
print(f"Number of development samples: {len(dev_data)}")

print("\nTest Data:")
print(f"Number of test samples: {len(test_data)}")

Train Data:
Number of training samples: 1000
Example sentences:
   ['Reflektion', 'Reflektionen', 'N;ACC;PL']
   ['Scherz', 'Scherzes', 'N;GEN;SG']

Dev Data:
Number of development samples: 50

Test Data:
Number of test samples: 1000


Download T5 Sequence to Sequence model.

In [8]:
from itertools import chain

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset

model_name = "t5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name, max_length=64)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, n_positions=64)



**Task 3: Explain the following Code. What are input_ids and attention_mask? How can the attention_mask be useful? (2P)**

In [9]:
example = train_data[0]
input_tokenized = tokenizer(example[0] + " " + example[2], return_tensors='pt')
print(input_tokenized)

output_tokenized = tokenizer(example[1], return_tensors='pt')
print(output_tokenized)

{'input_ids': tensor([[30926,   445,   117,  4296,   382,   117,  9945,     1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}
{'input_ids': tensor([[30926,     1]]), 'attention_mask': tensor([[1, 1]])}


example = train_data[0]
It assumes that train_data is a dataset and get the first example from this dataset.


input_tokenized = tokenizer(example[0] + " " + example[2], return_tensors='pt')
print(input_tokenized)
The variable `input_tokenized` holds the tokenized result for the input. The input text is combined with additional information, and this concatenated text is tokenized using the tokenizer. For each example mentioned above, `example[0]` corresponds to the input text, `example[1]` to the target (output) text, and `example[2]` to additional information concatenated to the input text. The argument `return_tensors='pt'` indicates that the returned values should be PyTorch tensors. The printed output includes the tokenized representation of the input text, containing `input_ids` and `attention_mask`.

output_tokenized = tokenizer(example[1], return_tensors='pt')
print(output_tokenized)
The variable `output_tokenized` stores the tokenized result for the target (output) text. Similarly, the printed output includes the tokenized representation of the output text, containing `input_ids` and `attention_mask`.

input_ids
It is a sequence of token IDs representing the tokenized version of the input text. Each token in the input text is mapped to a unique numerical ID in the model's vocabulary. This tensor is what the model will use as input during training or inference.

attention_mask
It is a binary tensor that informs the model about which tokens are padding and which ones are actual input data, indicating their importance. The `attention_mask` has the same shape as `input_ids`, with 1 filled at positions where the corresponding token in `input_ids` requires attention as an actual token and 0 filled where padding tokens should be ignored.

Reason why attention_mask is useful
The `attention_mask` indicates which parts the model should pay attention to. It sets the padded parts to 0, allowing the model to ignore them and focus only on the actual input. This enhances computational performance, especially when dealing with input sequences of varying lengths.


**Task 4: Write a PyTorch dataset for our task. (3P)**

In [10]:
import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, tokenized_data, tokenizer):
        # 토크나이저 및 토큰화된 데이터 초기화
        self.tokenizer = tokenizer
        self.tokenized_data = tokenized_data

    def __len__(self):
        # 데이터셋의 전체 길이를 반환
        return len(self.tokenized_data)

    def __getitem__(self, idx):
        # 인덱스에 해당하는 데이터 추출
        input_text = self.tokenized_data[idx][0] + " " + self.tokenized_data[idx][2]
        output_text = self.tokenized_data[idx][1]

        # 입력 및 출력 텍스트를 토크나이저로 처리하여 텐서로 변환
        input_tokenized = self.tokenizer(input_text, return_tensors='pt', padding=True, truncation=True)
        output_tokenized = self.tokenizer(output_text, return_tensors='pt', padding=True, truncation=True)

        # PyTorch 데이터로 반환
        return {
            "input_ids": input_tokenized['input_ids'].squeeze(),
            "labels": output_tokenized['input_ids'].squeeze()
        }


# PyTorch 데이터셋 생성
train_dataset = CustomDataset(train_data, tokenizer)
dev_dataset = CustomDataset(dev_data, tokenizer)
test_dataset = CustomDataset(test_data, tokenizer)


In [11]:
assert torch.equal(dev_dataset[0]["input_ids"], torch.tensor([26801,   445,   117, 14775,   117,  5329,     1]))
assert torch.equal(dev_dataset[0]["labels"], torch.tensor([26801,    35,     1]))

**Task 5: Explain the questions inside the relevant code snippets. (3P)**

In [12]:
from datasets import load_metric


# (1) What is this function calculating?
def compute_metrics(preds):
    output, labels = preds
    logits = output[0]
    logits = torch.tensor(logits)
    labels = torch.tensor(labels)
    # (2) What is this step doing?
    predictions = logits.argmax(-1)

    # (3) Why do you think we had -100 inside our labels? 
    # This is done by the framework.
    labels = torch.where(labels != -100, labels, torch.tensor(0))

    correct_sequences = torch.all(predictions == labels, dim=1)
    accuracy = torch.mean(correct_sequences.float())
    return {"accuracy": accuracy.item()}


# (4) What is the trainer replacing compared to our normal PyTorch model training? 
# Name atleast three components
training_args = Seq2SeqTrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    num_train_epochs=1,
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,  # Number of steps before evaluation
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=dev_dataset,
    compute_metrics=compute_metrics
)

(1) This function calculates accuracy by comparing model predictions with actual ground truth labels. In simple terms, it extracts model predictions and true labels from 'output, labels = preds.' Then, it obtains the logits for each class from the output, assigns the class with the highest logit value to 'predictions.' Subsequently, the code 'correct_sequences = torch.all(predictions == labels, dim=1)' checks if the predicted sequence matches the actual sequence. Finally, it computes the average accuracy.

(2) The step 'predictions = logits.argmax(-1)' selects the class with the highest logit value. In other words, it plays the role of finding the class with the maximum predicted probability for the input sequence.

(3) The value -100 represents the absence of labels. Specifically, -100 is used to signify tokens, such as padding tokens, that should not contribute to the loss calculation. The model does not calculate loss at positions with -100, ensuring that these positions do not influence the model's training.

(4) Using Seq2SeqTrainer introduces differences compared to normal PyTorch model training. Firstly, the trainer automatically handles data loading, unlike in normal PyTorch training where users have to define it explicitly. Secondly, the trainer internally manages the training loop, handling gradient descent and model parameter updates automatically, in contrast to the user-defined training loop in normal PyTorch model training. Lastly, while in normal PyTorch training users have to implement metric calculations directly, using Seq2SeqTrainer allows the trainer to automatically compute metrics when provided a function through 'compute_metrics.'


In [13]:
# Execute the training here! This can take a while, scale down the data if necessarily.

trainer.train()

Step,Training Loss,Validation Loss,Accuracy
100,2.4206,1.747426,0.32
200,2.0821,1.545658,0.32
300,1.504,1.484275,0.32
400,1.828,1.415549,0.32
500,1.5969,1.397326,0.32
600,1.4051,1.341447,0.32
700,1.7057,1.319453,0.32
800,1.9848,1.238837,0.32
900,1.5336,1.23317,0.32
1000,1.6474,1.225518,0.32


Checkpoint destination directory ./output\checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./output\checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1000, training_loss=1.7708260498046875, metrics={'train_runtime': 609.9735, 'train_samples_per_second': 1.639, 'train_steps_per_second': 1.639, 'total_flos': 3212253069312.0, 'train_loss': 1.7708260498046875, 'epoch': 1.0})

**Task 6: Now write your own evaluation (accuracy) on the test set using PyTorch style. (3P)**

In [21]:
# 배치 크기 및 로깅 간격 설정
batch_size = 1
log_interval = 10
# 테스트 데이터 로더 설정
test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)

# 모델을 평가 모드로 설정 및 초기화
model.eval()
total_correct = 0

# 평가를 위한 루프
with torch.no_grad():
    for idx, batch in enumerate(test_dataloader):
        # 입력 데이터를 지정한 디바이스로 이동
        input_ids = batch["input_ids"].to(device)

        # 훈련된 모델을 사용하여 텍스트 생성
        outputs = model.generate(input_ids, max_length=50)  # Adjust max_length as needed

        # 생성된 텍스트 디코딩 및 출력
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

         # 생성된 텍스트를 레이블과 길이에 맞게 자르거나 패딩
        generated_text_ids = tokenizer.encode(generated_text, return_tensors="pt", truncation=True, padding="max_length", max_length=len(batch["labels"][0])).to(device)

        # 생성된 텍스트와 레이블을 비교하여 정확도 계산
        labels = batch["labels"].to(device).tolist()
        correct_sequences = generated_text_ids == torch.tensor(labels, device=device)
        accuracy = torch.mean(correct_sequences.float()).item()
        total_correct += accuracy

        # 일정 간격으로 진행 상황 로깅
        if (idx + 1) % log_interval == 0 or (idx + 1) == len(test_dataloader):
            print(f"Processed {idx+1}/{len(test_dataloader)} examples - Correct: {total_correct}")

# 전체 정확도 계산
overall_accuracy = total_correct / len(test_dataloader)
print(f"Overall Accuracy: {overall_accuracy * 100:.2f}%")

Processed 10/1000 examples - Correct: 5.169047713279724
Processed 20/1000 examples - Correct: 12.096825487911701
Processed 30/1000 examples - Correct: 20.203968413174152
Processed 40/1000 examples - Correct: 26.192857317626476
Processed 50/1000 examples - Correct: 33.492857329547405
Processed 60/1000 examples - Correct: 39.39047644287348
Processed 70/1000 examples - Correct: 46.434920974075794
Processed 80/1000 examples - Correct: 52.98015909641981
Processed 90/1000 examples - Correct: 60.33015915006399
Processed 100/1000 examples - Correct: 68.95873060077429
Processed 110/1000 examples - Correct: 77.44920682162046
Processed 120/1000 examples - Correct: 86.34920685738325
Processed 130/1000 examples - Correct: 94.26587354391813
Processed 140/1000 examples - Correct: 101.32301642745733
Processed 150/1000 examples - Correct: 108.48968317359686
Processed 160/1000 examples - Correct: 117.08968322724104
Processed 170/1000 examples - Correct: 123.8896832242608
Processed 180/1000 examples - Co

**Task 7: Write a function, where I can input a word (as a string) with its inflection features, and it outputs the models prediction as a string! (3P)**

In [23]:
def model_prediction_as_word(input_str, inflection_features, model, tokenizer):
    # 입력 단어와 변형 특징을 결합
    input_text = f"{input_str} {inflection_features}"

    # 입력 텍스트를 토큰화
    input_ids = tokenizer(input_text, return_tensors='pt').input_ids.to(device)

    # 훈련된 모델을 사용하여 텍스트 생성
    outputs = model.generate(input_ids, max_length=50)  # 필요에 따라 max_length 조절

    # 생성된 텍스트 디코딩 및 반환
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

**Task 8: Examine the model errors. What is your hypothesis for why the model is not perfect? (2P)**

A simple reason for the model's suboptimal performance could be the insufficiency of training data. Although there were 1000 examples, language, being complex, might require a more substantial dataset for effective learning. Additionally, considering the intricacies of language phenomena, especially with respect to special vocabulary or inflections, it's likely that the tokenizer couldn't comprehend and handle them perfectly. This limitation in understanding intricate linguistic elements could contribute to the lower accuracy observed.