Assignment: Dependency parsing

In this assignment, you will implement a graph-based dependency parser. More specifically, you will implement the neural edge scoring model by Dozat & Manning, using RoBERTa encodings for the tokens rather than a three-layer BiLSTM as they did.

Throughout the assignment, be aware of the following pitfalls:

The tokenization in the CoNLL-U files with the dependency annotations is not the same as the tokenization produced by the RoBERTa tokenizer. You are familiar with this from the tagging assignment, but it is now even more unpleasant because you are predicting positions in the sentence and not just POS tags, and may therefore have to map RoBERTa positions back into CoNLL-U positions.
The HEAD annotations in the CoNLL-U files start counting at 1 for the position of the first word; a head of 0 is used to indicate the root node of the dependency tree. By contrast, Python will start counting positions in a list or tensor at 0. Finally, the RoBERTa tokenizer inserts a beginning-of-sequence symbol <s> at the start of the sentence, pushing the position of the first "real" token to 1. You should document your data structures extra carefully to navigate this jungle of indexing decisions.

Loading the corpus

Use the Huggingface and Pytorch methods for loading the en_ewt UD dataset, as in the RoBERTa POS tagging assignment. Unlike in that assignment, we will ignore the annotated POS tags and focus on the head and deprel features. Check the CoNLL-U documentation and the documentation of UD dependency relations for details.

I couldn't find a comprehensive list of UD dependency relations anywhere in the Huggingface source code. Feel free to use this one:


all_deprels = [
                # these are the default UD dependency relations according to https://universaldependencies.org/u/dep/
                "acl", "acl:relcl", "advcl", "advcl:relcl", "advmod", "advmod:emph", "advmod:lmod", "amod", "appos",
               "aux", "aux:pass", "case", "cc", "cc:preconj", "ccomp", "clf", "compound", "compound:lvc",
               "compound:prt", "compound:redup", "compound:svc", "conj", "cop", "csubj", "csubj:outer",
               "csubj:pass", "dep", "det", "det:numgov", "det:nummod", "det:poss", "discourse", "dislocated",
               "expl", "expl:impers", "expl:pass", "expl:pv", "fixed", "flat", "flat:foreign", "flat:name",
               "goeswith", "iobj", "list", "mark", "nmod", "nmod:poss", "nmod:tmod", "nsubj", "nsubj:outer",
               "nsubj:pass", "nummod", "nummod:gov", "obj", "obl", "obl:agent", "obl:arg", "obl:lmod",
               "obl:tmod", "orphan", "parataxis", "punct", "reparandum", "root", "vocative", "xcomp",

                # and yet we need some more for en_ewt
               "det:predet", "obl:npmod", "nmod:npmod"
               ]

Like in the tagging assignment, you will have to deal with the fact that (XLM-) RoBERTa makes tokenization choices that are not compatible with those in the CoNLL-U annotation scheme. You can use the preimplemented tokenize_and_align_labels to tokenize a list of sentences and return some useful information for mapping back and forth between words and RoBERTa tokens. The function returns a dictionary with the following keys:

input_ids: For each position in the tokenized string, a numeric ID for the token at that position. The tokenizer prepends a beginning-of-sequence token <s> at the start of each sentence and appends an end-of-sequence token </s>. The sentences in the same batch of instances that you passed to tokenize_and_align_labels are padded to the same length by adding None tokens.
attention_mask: Hints to RoBERTa that certain tokens were padding (None) and should always receive zero attention.
head: The head field from the UD annotation. It contains a list of ints. In the original annotation, these are positions in the CoNLL-U tokenization; here they have been remapped to positions in the RoBERTa tokenization.
deprel_ids: The deprel field of the annotation, mapped to ints. Each deprel ID is an index in the list all_deprels.
tokens_representing_words: The positions of the RoBERTa tokens that represent words in the CoNLL-U annotation. For each word, the list contains the position of the leftmost token that is aligned to that word. The list always starts with 0, the index of the BOS token. It doesn't technically align to a word, but I found it useful to keep it around; if you don't want it, feel free to change the function or remove it post-hoc.
num_words: The number of words in the CoNLL-U annotation (plus the BOS token). This is useful to know because the tokens_representing_words lists for the same batch were padded to the same length.
tokenid_to_wordid: Maps each token position to the position of its corresponding word.

Implement a function that will print the RoBERTa tokens, one per row, along with their heads, human-readable deprel tokens, and reasonable visualizations of the other features. Use this function to explore the first ten sentences or so of the training set and familiarize yourself with the structure of the data.

Create a Pytorch DataLoader for the train and dev set, like in the other assignment.

Defining your neural model

The Dozat & Manning edge scoring model is quite simple, even if it is not explained very well in their paper. Dozat's comment on the paper reviews might be helpful to clarify it.

In a nutshell, you will proceed as follows:

From the XLM-RoBERTa embeddings of each token, extract representations H_head and H_dep using a one-layer MLP with some output dimension $d$ (see the D&M paper for suggestions on hyperparameters).
Calculate a score for each pair a potential head i and potential dependent j, by multiplying H_head[i].T * U1 * H_dep[j] + H_head[i].T * u2. U1 is a $d \times d$ matrix, and u2 is a $d$-dimensional vector; their entries are parameters of the model which are learned in training. Make sure that H_head and H_dep are of the right shapes to make all the matrix multiplications work; the result should be a single number. You may have to transpose one of the tensors (indicated by the .T above).
Treat these scores as the logits in a cross-entropy loss, as discussed in the tagging assignment.

Note that if you do a bit of thinking with pen and paper, you can consolidate the $O(n^2)$ matrix multiplications into a single tensor multiplication, yielding very concise and efficient code.

Note also that you can use the Parameter class in Pytorch to allocate a tensor whose elements will be optimized in training. This may be useful to you, but if you use Parameters, be aware that Pytorch will not automatically initialize them with reasonable values when training starts. You would have to initialize them yourself, e.g. using one of the methods provided by Pytorch.

Training

Training will mostly proceed as in the tagging assignment. Use W&B or a similar tool to plot your training loss. On my Mac Mini, one epoch of training takes roughly two minutes.

Evaluation

After each epoch, report the "head tagging accuracy" on the dev set. This is a non-standard, but easy to implement measure in which you simply count the proportion of tokens that were assigned the correct head by your parser (ignoring tokens that were assigned the head -100 by the data processing function above). This is a similar evaluation as for the tagger.

However, there is no guarantee that if you simply pick the best head for each token, you will get a tree. Thus we will need to feed the log-softmax of each edge score into the Chu-Liu-Edmonds algorithm to obtain a maximum spanning tree. Feel free to use an off-the-shelf implementation of CLE (e.g. this one), or implement your own for extra credit.

You will need to take great care to pass the right values to your CLE implementation, and that you interpret these values correctly. If you want to pass the scores for all token pairs to CLE, you will need to make sure that the algorithm doesn't assign heads to tokens that should not have heads (because they are not the first token for a word). If you filter the token scores so only the scores for first word tokens remain, you will have to make sure that your predicted heads and the annotated gold heads agree on whether they represent token positions or word positions.

Advanced indexing in Numpy may or may not be useful to you; although it is not well documented, the same indexing will work in Pytorch.

Plot the head tagging accuracy, UAS, and LAS in W&B, and document your best values.

Where to go from here

TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assignment: Dependency parsing

Loading the corpus

Defining your neural model

Training

Evaluation

Where to go from here

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally