# Homework 11: Syntactic Parsing
#### Introduction to Natural Language Processing

* Yevin, Kim. (kyevin@students.uni-mainz.de)
* Yeonwoo, Nam. (yeonam@students.uni-mainz.de)
* Hyerin, Seo. (hyseo@students.uni-mainz.de)

You can reach 20 points on this homework.

In this homework, we will revisit Homework 05: Syntactic Constituency Parsing. In homework 05, we solved the task by transforming it to a sequence labeling task using RNN models. In this homework, we exchange the model type from RNN to transformer models!

If you have questions, you can reach out via mail: minhducbui@uni-mainz.de

# Evaluation

*Task 1:* Explain why there is more tokens than labels! Give two reasons. XX/2

*Task 2:* Explain why this is an issue for our task. -> Evaluation: XX/2

*Task 3:* Think about other transformations (similar to mine) to solve the task. -> Evaluation: XX/2

*Task 4:* Create the corresponding dataset for my solution! -> Evaluation: XX/4

*Task 5:* Now initialize the pre-trained model "distilbert-base-uncased". Use DistilBertForTokenClassification. -> Evaluation: XX/2

*Task 6:* Write the evaluation function which calculates the accuracy per token during the training loop of the Trainer. XX/2

*Task 7:* Calculate the token accuracy for the test set! -> Evaluation: XX/2

*Task 8:*  Train a non-pre-trained distilbert! Then test the fine-tuned model. -> Evaluation: XX/2

*Task 9:* What are your results? Reason, why pre-training helps/does not help on this task. -> Evaluation: XX/2


**Total: XX/20**

# Load the Data

In homework 05, we converted the Syntactic Constituency Parsing task to a sequence labeling task. Execute the following scripts:

In [4]:
!pip install -U scikit-learn



In [6]:
import nltk
nltk.download('treebank')

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\kimye\AppData\Roaming\nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

In [1]:
import nltk
from nltk.corpus import treebank
    
# Download the Penn Treebank datas
treebank_trees = treebank.parsed_sents()

# Get an example
sorted_treebank_trees = sorted(treebank_trees, key=lambda tree: len(tree.leaves()))
test_example = sorted_treebank_trees[31].copy()

# Function to get the common ancestors and their count using the relative scale
def relative_scale_encoding(tree):
    # Initialize an empty list to store the relative scale encoding for each word pair
    result = []

    # Save the previous number of common ancestors
    prev_common_ancestors = 0

    # Iterate over each pair of consecutive leaves in the tree
    for i in range(len(tree.leaves()) - 1):
        # Extract the current pair of words
        word1, word2 = tree.leaves()[i], tree.leaves()[i + 1]

        # Get the tree positions of the leaves for both words
        # E.g. (1, 0, 0) and (1, 1, 0)
        path1 = tree.leaf_treeposition(i)
        path2 = tree.leaf_treeposition(i + 1)

        # Find the common path between the tree positions
        # E.g. (1, 0, 0) and (1, 1, 0) -> [1]
        common_path = []
        for j, (p1, p2) in enumerate(zip(path1, path2)):
            if p1 == p2:
                common_path.append(p1)
            else:
                break

        # Extract the common ancestor of the current word pair
        common_ancestor = tree[common_path]

        # Calculate the number of common ancestors + 1 (including the leaves themselves)
        common_ancestors_count = len(common_path) + 1

        # Calculate the relative scale by subtracting the previous common ancestors count
        relative_scale = common_ancestors_count - prev_common_ancestors

        # Append the result for the current word pair as a list of [relative_scale, common_ancestor_label]
        result.append([str(relative_scale), common_ancestor.label()])

        # Update the previous common ancestors count for the next iteration
        prev_common_ancestors = common_ancestors_count

    # Return the list of relative scale encodings for each word pair
    return result


# Get the common ancestors and their count for each pair of adjacent words
encoding = relative_scale_encoding(test_example)

# Print the transformed tree in the specified format
print("Original Tree:")
print(test_example)
print("\nTransformed Tree:")
for i, (n_i, c_i) in enumerate(encoding, start=1):
    print(f"  (w{i} ({n_i}, {c_i}))")


Original Tree:
(S (NP-SBJ (DT This)) (VP (VBZ is) (NP-PRD (NNP Japan))) (. ?))

Transformed Tree:
  (w1 (1, S))
  (w2 (1, VP))
  (w3 (-1, S))


In [2]:
# You might have to install sklearn first. Use the following command and execute it in another cell:
# !pip install -U scikit-learn
from sklearn.preprocessing import LabelEncoder

sentences = []
# sentences = Holds the sentence of the tree

encodings = []
# encodings = Holds the label pairs with a dummy variable at the end
# Hint: I would encode the labels with the dummy variable as ['-1', 'DUMMY']
# For our test example: [['1', 'S'], ['1', 'VP'], ['-1', 'S'], ['-1', 'DUMMY']]

class_labels = []
# labels = Holds the class indices, e.g. [768, 829, 165, 72] for one sentence (each token has one)
# Hint: I would transform each pair to a string by joining them with "_"
# E.g. '1_S', '1_VP', '-1_S', '-1_DUMMY', ...
# Collect all labels and fit your LabelEncoder on all labels (use .fit() method)
# Then transform each example with your LabelEncoder (use .transform() method)
# For our Test Example, I got: [768 829 165 72]


# Sentences
sentences = [tree.leaves() for tree in treebank_trees]

# Encodings
for tree in treebank_trees:
    encoding = relative_scale_encoding(tree)
    # Add a dummy target for the last token
    encoding.append(["-1", "DUMMY"])
    encodings += [encoding]

# Class Labels
def transform_to_string(encoding):
    return ["_".join(pair) for pair in encoding]
label_encoder = LabelEncoder()
targets = [transform_to_string(encoding) for encoding in encodings]
flattened = []
for target in targets:
    flattened += target
label_encoder.fit(flattened)
class_labels = [label_encoder.transform(transform_to_string(encoding)) for encoding in encodings]

Now, we converted the dataset into two components: _sentences_ which holds the sentence of each tree and _class_labels_ which holds the label for each word in the sentence.

In [3]:
test_index = 7
print("Sentences (length: {}): {}".format(len(sentences[test_index]), sentences[test_index]))
print("Labels (length: {}): {}".format(len(class_labels[test_index]), class_labels[test_index]))

Sentences (length: 12): ['A', 'Lorillard', 'spokewoman', 'said', ',', '``', 'This', 'is', 'an', 'old', 'story', '.']
Labels (length: 12): [859 494 165 829 628 628 768 829 688 493 324  72]


# Pre-trained DistilBert

We will experiment with "DistilBert" (https://arxiv.org/pdf/1910.01108.pdf), which is a small BERT-like model, i.e. an encoder-only model. Load the tokenizer of this model:

In [4]:
from transformers import DistilBertTokenizer, DistilBertForTokenClassification
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [5]:
# Example usage of the tokenizer:

test_sentence = " ".join(sentences[test_index])
print("Sentence:\n{}\n".format(test_sentence))
print("Label (length={}):\n{}\n".format(len(class_labels[test_index]), class_labels[test_index]))
print("Tokenizer Output (length={}): \n{}".format(len(tokenizer(test_sentence).input_ids), tokenizer(test_sentence)))
print()

Sentence:
A Lorillard spokewoman said , `` This is an old story .

Label (length=12):
[859 494 165 829 628 628 768 829 688 493 324  72]

Tokenizer Output (length=18): 
{'input_ids': [101, 1037, 18669, 17305, 2094, 3764, 10169, 2056, 1010, 1036, 1036, 2023, 2003, 2019, 2214, 2466, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}



There is an issue with the task now: Our tokenizer creates more tokens than labels! Our task assumes one label per word/token.

**Task 1**: Explain why there is more tokens than labels! Give two reasons. (2P)

: First, during the process of tokenizing a sentence, it is possible that a compound word that was classified as a single word was broken down into two or more tokens. For example, 'spokewoman' may have been broken down into 'spoke' and 'woman'. Also, depending on the tokenizer, special tokens may have been added to signify the beginning and end of a sequence or to handle punctuation.

**Task 2:** Explain why this is an issue for our task. (2P)

: When label lengths and token lengths do not match, a dimensional mismatch occurs between the model's input and output. Since the transformer model relies on attention mechanisms that operate on fixed-size sequences, this mismatch is unclear how the model should handle it during training and prediction. It can also cause problems with loss computation, which can hinder the model learning process.

To solve this task, I propose the following solution:

- We only take into consideration the last token of each word to calculate our loss function and therefore updating our model:
   - e.g. "spokewomen" is splitted into two tokens (3764, 10169). In this case, we only consider the prediction for the last token (10169) into our loss function!

**Task 3:** Think about other transformations (similar to mine) to solve the task. (2P)

: Using the DistilBERT tokenizer for labels is crucial for maintaining consistency in the tokenization process with the input sentences. The labels are initially tokenized to break them into individual tokens, aligning them with the model's understanding. Each resulting token is then matched with its corresponding index in the DistilBERT vocabulary, assigning a specific position within the model's embedding space. The labels are encoded based on these indices, creating an organized list or array. To address additional tokens like [CLS] and [SEP] introduced by the BERT tokenizer, masking is applied to prioritize essential information. Applying these encoded labels during model training ensures that both sentence and label information are considered concurrently, improving the model's performance on tasks related to the provided labels.


So, how do we ignore predictions of _non-last_ tokens? In Huggingface, the **class index "-100" will be ignored in the calculation of the loss function**. So for every token, that should not be considered, we will add a -100 to the corresponding index in class_labels.

**Task 4:** Create the corresponding dataset for my solution! (4P)

In [6]:
import torch
from torch.utils.data import Dataset, DataLoader

'''

Your code here.

Create PyTorch Dataset for my solution. Only take into consideration the last token of 
each word to calculate our loss function and therefore updating our model.

E..g. "spokewomen" is splitted into two tokens (3764, 10169). 
In this case, we only consider the prediction for the last token (10169) 
into our loss function!

Hint: I used special token [SEP] (token_id = 102) to track boundaries of each word, e.g.
-> "[SEP]".join(sentences[test_index])
-> each token_id before 102 is the last token of a word

'''


class TreeDatasetGloVeIndexed(Dataset):
    def __init__(self, sentences, class_labels):
        self.sentences = sentences
        self.class_labels = class_labels

    def __len__(self):
        return len(self.sentences)

    def __getitem__(self, idx):
        sentence = self.sentences[idx]
        labels = self.class_labels[idx]

        # Join the tokens to form the input sentence & Tokenize the sentence
        input_sentence = "[SEP]".join(sentence)
        tokenized_output = tokenizer(input_sentence)
        
        # Get the last token for each word based on [SEP] token
        current_index = 0
        last_token_indices = []
        
        for i, token in enumerate(tokenized_output['input_ids']):
            if token != 102:
                current_index += 1
            else:
                last_token_indices.append(current_index-1)

        last_token_labels = []

        last_token_embeddings = [token for token in tokenized_output['input_ids'] if token != 102]

        idx = 0
        
        for i in range(len(last_token_embeddings)):
            if i < last_token_indices[idx]:
                last_token_labels.append(-100)
            else:
                last_token_labels.append(labels[idx])
                idx += 1
                
        
        indices_tensor = torch.tensor(last_token_embeddings)
        class_label = torch.tensor(last_token_labels)
        
        # Output should be a dictionary with keys input_ids and labels
        # Both are tensors (see assert function)
        return {"input_ids": indices_tensor, "labels": class_label}


# We use a batch size of 1 only.
batch_size = 1

# Train
train_sentences = sentences[:3000]
train_labels = class_labels[:3000]
train_tree_dataset = TreeDatasetGloVeIndexed(train_sentences, train_labels)

# Dev/Val
dev_sentences = sentences[3000:3100]
dev_labels = class_labels[3000:3100]
dev_tree_dataset = TreeDatasetGloVeIndexed(dev_sentences, dev_labels)

# Test
test_sentences = sentences[3139:]
test_labels = class_labels[3139:]
test_tree_dataset = TreeDatasetGloVeIndexed(test_sentences, test_labels)


In [7]:
# This was my assert. Ignore, if you used a slightly different strategy.
# Multiple strategies could be valid.

assert torch.equal(train_tree_dataset[0]["input_ids"], torch.tensor([  101,  5578, 19354,  7520,  1010,  6079,  2086,  2214,  1010,  2097,
          3693,  1996,  2604,  2004,  1037,  3904,  2595,  8586, 28546,  2472,
         13292,  1012,  2756,  1012]))

assert torch.equal(train_tree_dataset[0]["labels"], torch.tensor([-100, 1009, -100,   90,  494,  855,   61,   90,  165,  829,  829,  675,
         196,  729,  675, -100, -100, -100,  482,  257, -100,  713,  279,   72]))

**Task 5**: Now initialize the **pre-trained** model "distilbert-base-uncased". Use DistilBertForTokenClassification. (2P)

In [15]:
from transformers import DistilBertForTokenClassification

# Convert NumPy arrays to lists
train_labels_list = [label.tolist() for label in train_labels]
dev_labels_list = [label.tolist() for label in dev_labels]
test_labels_list = [label.tolist() for label in test_labels]

# Combine all labels and find unique classes
all_labels = train_labels_list + dev_labels_list + test_labels_list
num_classes = len(set(tuple(label) for label in all_labels))

# Look up how you can initialize the pre-trained model with the correct number of classes!
model = DistilBertForTokenClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=num_classes
)

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Task 6:** Write the evaluation function which calculates the accuracy per token during the training loop of the Trainer. (2P)

_Hint: Ignore labels with -100!_

In [24]:
import torch
from transformers import Trainer, TrainingArguments, DataCollatorForTokenClassification

'''

Your code here.

Write the evaluation function which calculates the accuracy per token.

Hint: Ignore labels with -100! Also test this function by executing the training loop.

'''

def compute_token_accuracy(preds):
    output, labels = torch.from_numpy(preds.predictions), torch.from_numpy(preds.label_ids)
    active_loss = labels != -100
    logits_argmax = torch.argmax(output, dim=2)
    
    logits_flat = logits_argmax[active_loss]
    labels_flat = labels[active_loss]
    
    correct_tokens = torch.sum(logits_flat == labels_flat).item()
    total_tokens = len(labels_flat)
    
    return {"token_accuracy": correct_tokens / total_tokens}

Now, let's train the model:

**Important: You can reduce the num_train_epochs to reduce training speed!!**

In [25]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./sequence_labeling_model",
    evaluation_strategy="steps",
    eval_steps=50,
    logging_steps=50,
    save_total_limit=3,
    learning_rate=2e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=1,
    num_train_epochs=5,
    weight_decay=0.01,
)

# Create data collator for token classification
data_collator = DataCollatorForTokenClassification(tokenizer)

# Create Trainer with the custom metric
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tree_dataset,
    eval_dataset=dev_tree_dataset,
    compute_metrics=compute_token_accuracy,
    data_collator=data_collator,
)

trainer.train()

Step,Training Loss,Validation Loss,Token Accuracy
50,0.6618,1.226698,0.746503
100,1.2176,1.019712,0.765734
150,1.0148,0.950359,0.778409
200,0.9484,0.908865,0.792395
250,0.6583,0.907179,0.79021
300,0.6399,0.88764,0.80201
350,0.6558,0.915546,0.788024
400,0.5596,0.886522,0.799388
450,0.4393,0.864333,0.807255
500,0.4473,0.865639,0.804196


Token indices sequence length is longer than the specified maximum sequence length for this model (688 > 512). Running this sequence through the model will result in indexing errors


TrainOutput(global_step=940, training_loss=0.5288104270366912, metrics={'train_runtime': 2093.464, 'train_samples_per_second': 7.165, 'train_steps_per_second': 0.449, 'total_flos': 296767600186656.0, 'train_loss': 0.5288104270366912, 'epoch': 5.0})

**Task 7**: Calculate the token accuracy for the test set! (2P)

In [26]:
'''

Your code here.

Calculate the token accuracy for the test set!

'''
test_results = trainer.evaluate(test_tree_dataset)

# Extract token accuracy from the evaluation results
token_accuracy = test_results["eval_token_accuracy"]

print(f"Token Accuracy on Test Set: {token_accuracy:.4f}")

Token Accuracy on Test Set: 0.8429


The model looks pretty good! As a last test, let us train an non-pre-trained distilbert and see how much pre-training is helping us on this task.

**Task 8:** Train a non-pre-trained distilbert! Then test the fine-tuned model. (2P)

In [30]:
'''

Your code here.

Hint: Basically copy+paste the above pipeline.

'''

# Define your non-pre-trained DistilBERT model
model = DistilBertForTokenClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=num_classes
)

# Define the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Train
train_sentences = sentences[:3000]
train_labels = class_labels[:3000]
train_tree_dataset = TreeDatasetGloVeIndexed(train_sentences, train_labels)

# Dev/Val
dev_sentences = sentences[3000:3100]
dev_labels = class_labels[3000:3100]
dev_tree_dataset = TreeDatasetGloVeIndexed(dev_sentences, dev_labels)

# Test
test_sentences = sentences[3139:]
test_labels = class_labels[3139:]
test_tree_dataset = TreeDatasetGloVeIndexed(test_sentences, test_labels)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./sequence_labeling_model",
    evaluation_strategy="steps",
    eval_steps=50,
    logging_steps=50,
    save_total_limit=3,
    learning_rate=2e-4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=1,
    num_train_epochs=5,
    weight_decay=0.01,
)

# Create data collator for token classification
data_collator = DataCollatorForTokenClassification(tokenizer)

# Create Trainer with the custom metric
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tree_dataset,
    eval_dataset=dev_tree_dataset,
    compute_metrics=compute_token_accuracy,
    data_collator=data_collator,
)

trainer.train()

# Evaluate on the test set
test_results = trainer.evaluate(test_tree_dataset)

# Print the token accuracy for the test set
print("Token Accuracy on Test Set:", test_results["eval_token_accuracy"])

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss,Validation Loss,Token Accuracy
50,3.9885,2.103124,0.583042
100,1.925,1.504329,0.69493
150,1.4423,1.288113,0.737762
200,1.3061,1.173345,0.736014
250,1.0111,1.06778,0.753059
300,0.9126,1.005751,0.766608
350,0.9078,0.947636,0.781031
400,0.7978,0.926845,0.795455
450,0.643,0.882969,0.803759
500,0.6215,0.885691,0.802448


Token indices sequence length is longer than the specified maximum sequence length for this model (688 > 512). Running this sequence through the model will result in indexing errors


Token Accuracy on Test Set: 0.8427457882826251


**Task 9:** What are your results? Reason, why pre-training helps/does not help on this task. (2P)

: In conclusion, there was no significant difference in performance between the trained (pre-trained) model and the untrained (non-pre-trained) model. From these results, it can be inferred that the pre-trained model did not provide substantial assistance in capturing the specific details of the current task. As an interpretation of these findings, it is possible that the dataset used for pre-training the model was either unrelated or insufficiently representative of the current task. For instance, although pre-training may have been helpful in learning general language features, it might not have adequately captured the domain-specific information crucial for the present task. Alternatively, the small size of the dataset could have made it challenging for the model to undergo sufficient training.
