Assignment: Dependency parsing

In this assignment, you will implement a graph-based dependency parser. More specifically, you will implement the neural edge scoring model by Dozat & Manning, using RoBERTa encodings for the tokens rather than a three-layer BiLSTM as they did.

As in the neural tagging assignment, you are free to follow the detailed instructions below or not, but please make sure that it is easy for the tutors to distinguish the code for the different sections of the assignment. This time, you can use Huggingface classes in addition to anything in Pytorch or Numpy, but please make sure that you keep to the spirit of the assignment; e.g. don't just use a pre-existing implementation of the Dozat & Manning model.

Throughout the assignment, be aware of the following pitfalls:

The tokenization in the CoNLL-U files with the dependency annotations is not the same as the tokenization produced by the RoBERTa tokenizer. You are familiar with this problem from the tagging assignment, but it is now even more unpleasant because you are predicting positions in the sentence and not just POS tags, and may therefore have to map RoBERTa positions back into CoNLL-U positions.
The HEAD annotations in the CoNLL-U files start counting at 1 for the position of the first word; a head of 0 is used to indicate the root node of the dependency tree (which does not correspond to a word or token in the sentence). By contrast, Python will start counting positions in a list or tensor at 0. Finally, the RoBERTa tokenizer inserts a beginning-of-sequence symbol <s> at the start of the sentence, pushing the position of the first "real" token to 1. You should document your data structures extra carefully to navigate this jungle of indexing decisions.

Loading the corpus (10 points)

Use the Huggingface and Pytorch methods for loading the en_ewt UD dataset, as in the tagging assignment. Unlike in that assignment, we will ignore the annotated POS tags and focus on the head and deprel features. Check the CoNLL-U documentation and the documentation of UD dependency relations for details.

You can use this list of UD dependency relations (= edge labels):


all_deprels = [
                # these are the default UD dependency relations according to https://universaldependencies.org/u/dep/
                "acl", "acl:relcl", "advcl", "advcl:relcl", "advmod", "advmod:emph", "advmod:lmod", "amod", "appos",
               "aux", "aux:pass", "case", "cc", "cc:preconj", "ccomp", "clf", "compound", "compound:lvc",
               "compound:prt", "compound:redup", "compound:svc", "conj", "cop", "csubj", "csubj:outer",
               "csubj:pass", "dep", "det", "det:numgov", "det:nummod", "det:poss", "discourse", "dislocated",
               "expl", "expl:impers", "expl:pass", "expl:pv", "fixed", "flat", "flat:foreign", "flat:name",
               "goeswith", "iobj", "list", "mark", "nmod", "nmod:poss", "nmod:tmod", "nsubj", "nsubj:outer",
               "nsubj:pass", "nummod", "nummod:gov", "obj", "obl", "obl:agent", "obl:arg", "obl:lmod",
               "obl:tmod", "orphan", "parataxis", "punct", "reparandum", "root", "vocative", "xcomp",

                # we need some more for en_ewt
               "det:predet", "obl:npmod", "nmod:npmod"
               ]

As mentioned above, the (XLM-) RoBERTa tokenizer splits the sentences into smaller tokens than the words that are annotated in the CoNLL-U annotation scheme. You can use my preimplemented data handling function tokenize_and_align_labels to tokenize a list of sentences and return some useful information for mapping back and forth between words and RoBERTa tokens. It serves as a drop-in replacement for the function with the same name from the tagging assignment.

The function returns a dictionary with the following keys:

input_ids: For each position in the tokenized string, a numeric ID for the token at that position. The tokenizer prepends a beginning-of-sequence token <s> at the start of each sentence and appends an end-of-sequence token </s>. The sentences you pass to tokenize_and_align_labels are padded to the same length by adding None tokens, so they can later be made into a single minibatch.
attention_mask: Hints to RoBERTa that certain tokens were padding (None) and should always receive zero attention.
head: The head field from the UD annotation. It contains a list of ints, one for each RoBERTa token. The head values from the original annotation, which referred to word positions, are now remapped to token positions in the RoBERTa tokenization.
deprel_ids: The deprel field of the annotation, mapped to ints. Each deprel ID is an index in the list all_deprels.
tokens_representing_words: The positions of the RoBERTa tokens that represent words in the CoNLL-U annotation. For each word, the list contains the position of the leftmost token that is aligned to that word. The list always starts with 0, the index of the BOS token. It doesn't technically align to a word, but I found it useful to keep it around; if you don't want it, feel free to change the function or remove it post-hoc.
num_words: The number of words in the CoNLL-U annotation (plus the BOS token). This is useful to know because the tokens_representing_words lists for the same batch were padded to the same length.
tokenid_to_wordid: Maps each token position to the position of its corresponding word.

Implement a function that will print the RoBERTa tokens, one per row, along with their heads, human-readable deprel tokens, and mapping of words and tokens. Use this function to explore the first ten sentences or so of the training set and familiarize yourself with the structure of the data. Submit the output of your function for the first ten sentences.

Create a Pytorch DataLoader for the train and dev set, like in the other assignment.

Defining your neural model (40 points)

The Dozat & Manning edge scoring model is quite simple, even if it is not explained very well in their paper. Dozat's comment on the paper reviews might be helpful to clarify it.

In a nutshell, you will proceed as follows:

From the XLM-RoBERTa embeddings of each token, extract representations H_head and H_dep using a one-layer MLP with some output dimension $d$ (see the D&M paper for suggestions on hyperparameters). Note that you need a separate MLP for the head and for the dep representation.
Calculate a score for each pair of a potential head $i$ and potential dependent $j$, by multiplying H_head[i].T * U1 * H_dep[j] + H_head[i].T * u2. U1 is a $d \times d$ matrix, and u2 is a $d$-dimensional vector; their entries are parameters of the model which are learned in training. Make sure that H_head and H_dep are of the right shapes to make all the matrix multiplications work; the result should be a single number. You may have to transpose one of the tensors (indicated by the .T above).
Treat these scores as the logits in a cross-entropy loss, as discussed in the tagging assignment.

As a reminder, a multilayer perceptron (MLP) is essentially what I explained to you in class as a feed-forward neural network: a stack of linear layers that are separated by nonlinearities. A one-layer MLP is just a linear layer with a nonlinearity of your choice on top.

You can compute the edge scores individually by looping over all pairs of positions with a for loop, but such an implementation may be unworkably slow. Feel free to think through the multiplications with pen and paper and consolidate all the $O(n^2)$ matrix multiplications into a single tensor multiplication; this will exploit hardware parallelization and make the code much faster (and more concise). In addition to transpose, you may want to look into unsqueeze and broadcasting. Instead of using low-level operations like transpose and matmul, you could also have a look at einsum for specifying complicated operations on tensors in a human-readable way.

Note also that you can use the Parameter class in Pytorch to allocate a tensor whose elements will be optimized in training. This may be useful to you, but if you use Parameters, be aware that Pytorch will not automatically initialize them with reasonable values when training starts. You would have to initialize them yourself, e.g. using one of the methods provided by Pytorch. (Pytorch automatically initializes all other modules that you are likely to use in a suitable way.)

Training (20 points)

Training will mostly proceed as in the tagging assignment. Use W&B or a similar tool to plot your training loss. On my Mac Mini, one epoch of training takes roughly two minutes.

Evaluation: Head prediction (20 points)

After each epoch, report the "head tagging accuracy" on the dev set. This is a non-standard, but easy to implement measure in which you simply predict the highest-scoring head for each word and count the proportion of words that were assigned the correct head. Note that if you work on the RoBERTa tags, you need to ignore tokens that were assigned the head -100 by tokenize_and_align_labels. This is a similar evaluation as for the tagger.

Evaluation: MST parsing (10 points)

When you parse the dev set by predicting best heads, there is no guarantee that the edges you predict form a tree. Thus we will need to feed the log-softmax of each edge score into the Chu-Liu-Edmonds algorithm to obtain a maximum spanning tree. Feel free to use an off-the-shelf implementation of CLE (e.g. this one), or implement your own for extra credit.

You will need to take great care to pass the right values to your CLE implementation, and that you interpret these values correctly. If you want to pass the scores for all token pairs to CLE, you will need to make sure that the algorithm doesn't assign heads to tokens that should not have heads (because they are not the first token for a word). If you filter the token scores so only the scores for first word tokens remain, you will have to make sure that your predicted heads and the annotated gold heads agree on whether they represent token positions or word positions.

Advanced indexing in Numpy may or may not be useful to you; although it is not well documented, the same indexing will work in Pytorch.

Submission

Document the best values you obtained for training loss, head tagging accuracy, and UAS (unlabeled attachment score), and how you got them. Submit a picture of the learning curves that show how these metrics evolved in training.

Where to go from here

Feel free to extend and modify the model in interesting ways. Here are some ideas:

Predict edge labels and compute labeled attachment scores (LAS).
See if you can improve the UAS of your parser to exceed 90 points on en_ewt, e.g. with a more complex model architecture, by including the weights of XLM-RoBERTa in the finetuning, or by implementing a (structured) hinge loss instead of simple cross-entropy.
Save your trained model to a file, and then write some code that loads it and evaluates it on the test set. Maybe you want to read the test data from a CoNLL-U file, save your parses to a CoNLL-U file, and use the official evaluation script.
Try parsing other corpora and languages and discuss your findings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment: Dependency parsing

Loading the corpus (10 points)

Defining your neural model (40 points)

Training (20 points)

Evaluation: Head prediction (20 points)

Evaluation: MST parsing (10 points)

Submission

Where to go from here

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally