Assignment: POS tagging with RoBERTa

In this assignment, you will reimplement the part-of-speech tagger from an earlier exercise with neural networks. We will use the Pytorch library and make extensive use of prepackaged datasets and pretrained models from Huggingface. More specifically, we will implement a very simple tagging model that just puts a linear layer on top of XLM-RoBERTa, an optimized, crosslingual variant of BERT that is trained on 100 languages.

Below, I am giving you detailed instructions to help you navigate these more complicated libraries. You are welcome to ignore my instructions if you are a Pytorch expert, but please make sure that (a) you use the dataset specified below, (b) you use the standard Pytorch training loop (and not the more compact code offered e.g. by the Huggingface Trainer class), and (c) the tutors can clearly distinguish the pieces of the code that take care of the four following steps:

Data loading and tokenization: 15 points
Defining the neural model: 30 points
Training: 35 points
Evaluation (on the development set): 20 points

Of course you are also welcome to simply follow the instructions below.

There are a lot of tutorials on implementing neural tagging models on the Internet. You are welcome to read and watch them, but please submit code that is your own and not copied or LLM-generated. Be aware that an online tutorial may be based on older versions of Pytorch or Huggingface than yours, so they may no longer work.

Finally, be aware that training a neural network is time-intensive. On my laptop, one epoch of training takes a few minutes, and a complete training run takes up to an hour. Consider developing and debugging your code on a subset of the training corpus (e.g. the first minibatch) until you believe it works correctly. Also, make sure that you start early. If your own computer is not fast enough to train a neural network, feel free to use the department's compute servers or Google Colab.

Loading the corpus

For this assignment, we will use the de_gsd section of the German Universal Dependencies dataset. Each token in this dataset has been annotated with a POS tag from the Universal POS tagset.

Use the load_dataset function to download de_gsd from the Huggingface dataset hub. As of Huggingface Datasets 4, loading the dataset is a little clunky:

base_url = f'https://huggingface.co/datasets/universal-dependencies/universal_dependencies/resolve/refs/convert/parquet/de_gsd/'
data_files = {
    'train': base_url + 'train/0000.parquet',
    'validation': base_url + 'validation/0000.parquet',
    'test': base_url + 'test/0000.parquet',
}
dataset = load_dataset('parquet', data_files=data_files)

Explore the dataset, e.g. by counting the instances in the train, validation, and test set. Read the first couple of sentences with their associated POS tags (in the upos feature), and store the list of possible POS tags in a variable.

Tokenization

XLM-RoBERTa comes with its own tokenizer, which can split words into subword units to avoid out-of-vocabulary issues. Install the XLM-RoBERTa tokenizer as explained in the documentation (note this detail). Any comments that you find on the Internet about the original RoBERTa tokenizer also apply to XLM-RoBERTa; but note that other language models, such as BERT, use different tokenizers.

The tokens created by the XLM-RoBERTa tokenizer therefore don't necessarily match the tokens in the dataset; the sequence of POS tag labels will typically be shorter than the sequence of input tokens, and we have to match them up again.

Use the following function, inspired by this tutorial, to tokenize the data and create a labels feature that assigns a POS tag to the first token for each word:

def tokenize_and_align_labels(examples, label_all_tokens=False, skip_index=-100):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True, padding=True) 
    labels = []
    
    for i, label in enumerate(examples["upos"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids : list[int] = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(skip_index)

            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])

            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else skip_index)

            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

This function wants a slice of a dataset as input in the examples variable. For instance, if you loaded the corpus into datasets, you could pass datasets["train"][:5]. It will then tokenize the sentences in this dataset and will assign a POS tag ID (a number between 0 and 17) to the first token from each word. It will assign the ID -100 to all other tokens to indicate that this token doesn't get a POS tag of its own. You can later ignore POS tags with ID -100 in the computation of the loss function and accuracy.

Note that the function relies on a global variable called tokenizer containing the tokenizer.

Tokenize a piece of the training corpus. Compare the original sentence to the tokenized sentence, and check that the POS tags are correctly assigned to the first tokens of the words. To do this, write code to convert the POS tag IDs into human-readable POS tags and print the tokenized first five sentences of the training set, together with their human-readable POS tags.

Data loading

You now have a function to tokenize a Huggingface dataset, but for your training loop, you will want a Pytorch DataLoader that returns batches of training instances. Fortunately, you can easily convert a dataset into a DataLoader.

Use Huggingface's map function to tokenize the entire training set. Then follow the tutorial above to convert the tokenized training set into a DataLoader.

While you're at it, group your training instances into minibatches when you create the DataLoader; then each time you obtain a new element from the DataLoader when you iterate over it, you will obtain a tensor that represents the entire minibatch. As you do this, you need to make sure that all input sentences in the same batch are of the same length. You can achieve this by allowing the map function to pad them with dummy tokens.

It is easiest to start with a batch size of 1 (one sentence per minibatch), because then you don't have to think about padding and the linear algebra in your model is easier to think about. Increasing the batch size up to the limit that your CPU or GPU can handle will make training much faster. The bottleneck on a GPU is usually the GPU's memory.

Note that the dataset then contains both the input_ids feature as before and an attention_mask feature that tells you which tokens are real tokens and which ones were added in the padding step. Make sure you pass the attention_mask as input to XLM-RoBERTa so it handles them correctly in its internal computations.

Try out the DataLoader by iterating over a few elements. You should obtain dictionaries with the keys input_ids, attention_mask, and labels. Its values should be Pytorch LongTensors of shape (batch size, seqlen), where seqlen is the length of the (padded) sequences in each batch. Notice that seqlen can have different values for different batches. Print the first batch.

Defining your neural model

In this assignment, we will implement a very simple neural tagging model. Given a single input sequence x of length n, we will first apply XLM-RoBERTa to obtain a sequence h of hidden states, one for each token in x. We will then run each element of h through a Linear layer, which projects the 768-dimensional XLM-RoBERTa states down to the 18 possible choices of POS tag, obtaining a sequence y of n logits. Finally, we apply a softmax operation at each position to obtain a probability distribution over POS tags.

In practice, the neatest way to implement such a model is to implement your own subclass of the Module class from Pytorch. You can set up all the component models (RoBERTa, Linear) in the __init__ method, and then you implement the forward method to transform your model inputs into your model outputs. Here are some notes that may help you:

It is customary to return not the actual probability distributions (after the application of softmax) from the forward method, but only the logits. You can then directly pass these logits into the CrossEntropyLoss loss function for comparison with the gold labels.
When you train your neural network, the backpropagation algorithm will, by default, attempt to optimize all parameters of your model. In the case of the model sketched above, this includes all 250M parameters of XLM-RoBERTa, rather than the 15k of your linear layer. You can set the requires_grad attribute of the Parameters in XLM-RoBERTa to False to "freeze" them and keep them from being optimized; see e.g. this discussion. You can recognize XLM-RoBERTa's parameters by the fact that their names all start with roberta.
Note that XLMRobertaModel's forward method accepts an attention_mask parameter in which you can pass the attention mask you got from the padding step above. This is described better in the documentation of BertModel, from which XLMRobertaModel inherits.
A very common source of bugs is to pass tensors of the wrong shape to a function. I suggest adding a comment after every line of code that computes a tensor to write down its shape. Print a tensor's shape attribute to keep track of whether it has the shape you think it does.

Implement the Module described above, apply it to a minibatch of inputs, and print the outputs. (Note that they will change every time you run the program because the weights in the Linear layer are initialized randomly.)

Training

Now it's time to train your model. A typical training loop in Pytorch looks, schematically, as follows:

loss_function = <create your loss function>
optimizer = <create your optimizer>

for epoch in range(NUM_EPOCHS):
	for batch in dataloader:
		inputs = batch["input_ids"]
		gold_outputs = batch["labels"]
		
		predicted_logits = model(inputs)
		batch_loss = loss_function(predicted_logits, gold_outputs)
		
		optimizer.zero_grad()
		batch_loss.backward()
		optimizer.step()

The call to the loss_function computes the loss of your model's predictions against the gold outputs in the batch. The gradient of this loss with respect to your model parameters is computed in the call to backward, and accumulated in a way that will allow the optimizer to perform a backpropagation step. Your model's parameters are then updated in the call to optimizer.step, and then the next batch of training instances is processed.

Implement this properly for your own model. Here are some tips:

Typical choices for the loss and optimizer are CrossEntropyLoss and Adam. You may need to play around a bit to find a good learning rate for the optimizer. Note that if you change the batch size, you may need to also adjust the learning rate.
Observe that the shape of the tensor that your model returns may not be suitable to be passed directly into the loss function. The view, flatten, and/or transpose functions may be useful here.
By default, the loss function returns a tensor with a separate loss for each token. Check out the reduce and reduction parameters to aggregate them into a single FloatTensor for the whole minibatch.

Plot learning curves during training, so you can diagnose whether your model is learning and how well. There are many ways to do this; my favorite right now is Weights & Biases, which is very easy to set up and free for personal use.

Submit some representative pictures of your learning curves (training and development). Your goal should be to get the training loss very close to zero. (A training loss of exactly zero would mean that the model has perfectly memorized the training data.)

Evaluation

Finally, you should evaluate your model on unseen data. Run your model on the validation dataset after each epoch of training and report the accuracy. Compare your results to those of the HMM assignment, and discuss the advantages and disadvantages of the two tagging models.

Some notes:

You can switch your model to "evaluation mode" by calling its eval method and calling no_grad. This will turn off training-only features such as dropout and suppress the computation of data that is needed for backpropagation, speeding up the forward function. Be sure to switch it back to training mode before processing the next batch by calling train. See here for an example.
When you compute the accuracy, make sure that you only look at tokens where the gold label is not -100, the "no token" label from the tokenize_and_align_label function above.
You could also evaluate on the test set after you complete training, but this is not required.

Where to go from here

Congratulations! You have implemented a POS tagger with a neural network. I get just under 95% development accuracy on a typical training run with my own implementation.

However, there is still room for improvement. Feel free to extend the model for extra credit. Here are some ideas that might improve the accuracy:

Use multiple linear layers rather than just one, separated by a nonlinearity.
Replace the linear layers by an LSTM or transformer layer.
Finetune XLM-RoBERTa directly; this will probably be much slower to train because RoBERTa has far more parameters than the linear layer. You can do this by not setting requires_grad to False.
Try your tagger out on other languages that XLM-RoBERTa supports and compare your results. You can find the list in Appendix A of the XLM-RoBERTa paper.
Optimize your hyperparameters to see if you can improve the accuracy.

Furthermore, while you can probably train such a small model on the CPU, you could also try out training on the GPU if you have one. If you want to train on a modern Mac (with Apple Silicon, e.g. the M1 chip), you can use the MPS backend (but be sure to use an up-to-date version of Pytorch).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment: POS tagging with RoBERTa

Loading the corpus

Tokenization

Data loading

Defining your neural model

Training

Evaluation

Where to go from here

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally