# Assignment 3: Cross-lingual Dependency Parsing

In this assignment you will build a cross-lingual transition-based dependency parser. You will have access to English data for training, but your goal with be to optimize the performance of your model on Danish data.

Your mark will depend on:

* The **reasoning** you provide to explain your dependency parsing model and its predictions,
* The correct **implementation** of your dependency parsing model, and
* The **performance** of your model on a held-out test set.

To develop your model you have access to:

* The data in `data/ud/`. Remember to un-tar the `data.tar.gz` file. The folder contains the train set, the dev set and **a pseudo test set, which matches exactly the name and format of the actual held-out test set**. Note that this pseudo test set is just a copy of the development data, but during grading unseen test data will be used instead.

* Libraries on the [docker image](https://cloud.docker.com/repository/docker/bjerva/stat-nlp-book) which contains everything in [this image](https://github.com/jupyter/docker-stacks/tree/master/scipy-notebook), including scikit-learn, torch 1.2.0 and tensorflow 1.14.0. 


As with the previous assignment, since we have to run the notebooks of all students, and because writing efficient code is important, your notebook should run in 10 minutes at most, including package loading time, on your machine.
Furthermore, you are welcome to provide a saved version of your model with loading code. In this case loading, testing, and evaluation has to be done in 10 minutes. You can use the pseudo test set to check if this is the case, and assume that it will be fine for the held-out test set if so.

## Setup Instructions
It is important that this file is placed in the **correct directory**. It will not run otherwise. The correct directory is

    DIRECTORY_OF_YOUR_BOOK/assignments/2019/assignment3/problem/
    
where `DIRECTORY_OF_YOUR_BOOK` is a placeholder for the directory you downloaded the book to. After you placed it there, **rename the file** to your UCPH ID (of the form `xxxxxx`). 

## General Instructions
This notebook will be used by you to provide your solution, and by us to assess your solution. It contains three types of sections:

1. **Setup** Sections: these sections set up code and resources for assessment. **Do not edit these**. 
2. **Assessment** Sections: these sections are used for evaluating the output of your code. **Do not edit these**. 
3. **Task** Sections: these sections require your solutions. They may contain stub code, and you are expected to edit this code. For free text answers simply edit the markdown field.  

Note that you are free to **create additional notebook cells** within a task section. 

**Do not share** this assignment publicly, by uploading it online, emailing it to friends etc. 

**Do not** copy code from the Web or from other students, this will count as plagiarism.

## Submission Instructions

To submit your solution:

* Make sure that your solution is fully contained in this notebook
and possibly in saved model files. 
* **Rename this notebook to your UCPH ID** (of the form "xxxxxx"), if you have not already done so.
* Download the notebook in Jupyter via *File -> Download as -> Notebook (.ipynb)*.
* Upload the notebook to Absalon, zipped with any saved model files.


## <font color='green'>Setup 1</font>: Load Libraries
This cell loads libraries important for evaluation and assessment of your model. **Do not change it.**

In [2]:
#! SETUP 1
import sys, os
_snlp_book_dir = "../../../../"
sys.path.append(_snlp_book_dir) 
from os.path import join
from collections import deque
import copy
from statnlpbook.dep import *

In this assignment, we will use [Universal Dependencies](https://universaldependencies.org)
treebanks in English and Danish, which can be found in the `/data/ud/`
directory of the repository.
The treebanks are given in [the CoNLL-U format](https://universaldependencies.org/format.html).
Familiarize yourself with the format and the information contained in it.

For example, this is a simplified version of one of the annotated sentences from the (English) training set:
~~~
# sent_id = email-enronsent30_01-0030
# text = I would like to get this done ASAP.
1 	I     	I     	PRON  	PRP 	_ 	3 	nsubj  	_ 	_             
2 	would 	would 	AUX   	MD  	_ 	3 	aux    	_ 	_             
3 	like  	like  	VERB  	VB  	_ 	0 	root   	_ 	_             
4 	to    	to    	PART  	TO  	_ 	5 	mark   	_ 	_             
5 	get   	get   	VERB  	VB  	_ 	3 	xcomp  	_ 	_             
6 	this  	this  	PRON  	DT  	_ 	5 	obj    	_ 	_             
7 	done  	do    	VERB  	VBN 	_ 	5 	xcomp  	_ 	_             
8 	ASAP  	asap  	ADV   	RB  	_ 	7 	advmod 	_ 	_ 
9 	.     	.     	PUNCT 	.   	_ 	3 	punct  	_ 	_                                                    
~~~

In this assignment, we will use the `ID`, `FORM`, `LEMMA`, `UPOS`,
`XPOS`, `HEAD` and `DEPREL` columns,
and ignore the `FEATS`, `DEPS` and `MISC` columns (the 6th, 9th and 10th).

We will use the `load_conllu` and `save_to_conllu` functions from the
`dep` module in the `statnlpbook` package (imported above) to load CoNLL-U files to
sequences of `dict`s and to save them back.
You will use these functions for training and evaluating a dependency parsing model,
and for saving your model's predictions in this format.

Run the following cell to load the English training data and
the Danish development data,
and to show an example Python representation
for an English training sentence:

In [5]:
train_file_path = join(_snlp_book_dir, "data", "ud", "en_ewt-ud-train.conllu")
train_data = load_conllu(train_file_path)
dev_file_path = join(_snlp_book_dir, "data", "ud", "da_ddt-ud-dev.conllu")
dev_data = load_conllu(dev_file_path)
train_data[1740]

OSError: Failed loading line 0 in '':
../../../../data/ud/en_ewt-ud-train.conllu

In [4]:
os.listdir(join(_snlp_book_dir, "data", "ud"))

['da_ddt-ud-dev.conllu',
 'da_ddt-ud-test.conllu',
 'en_ewt-ud-train.conllu',
 'data.tar.gz']

## <font color='blue'>Task 1</font>: Develop an oracle for training the parser

In this task, you will implement a static oracle for the arc-standard transition system.
The oracle's job is to get a parsed dependency tree, and return the sequence
of transitions (actions) needed to reach the tree in the arc-standard transition system.

You can find a description of the expected behaviour of the oracle in the
[reading material](https://web.stanford.edu/~jurafsky/slp3/14.pdf)
and in the [dependency parsing lecture slides](https://nbviewer.jupyter.org/github/copenlu/stat-nlp-book/blob/master/chapters/dependency_parsing_slides.ipynb).

To perform this task, edit the `TODO` block in the following.
You may also edit anything else in this cell as needed:

In [3]:
## You should improve this cell


class Configuration:
    def __init__(self, nodes):
        # This implements the initial configuration for a sentence
        self.arcs = set()
        self.nodes = nodes
        self.buffer = deque(nodes[1:])  # Initialize with the words
        self.stack = [nodes[0]]  # Initialize with the root

    def apply_transition(self, transition):
        """Modify the configuration accordingly, preparing for the next step"""
        if transition == "shift":
            token = self.buffer.popleft()
            self.stack.append(token)
        elif transition.startswith("leftArc"):
            head = self.stack[-1]
            dependent = self.stack.pop(-2)
            label = transition.split("-")[1]
            self.arcs.add((int(head["index"]), int(dependent["index"]), label))
        elif transition.startswith("rightArc"):
            head = self.stack[-2]
            dependent = self.stack.pop()
            label = transition.split("-")[1]
            self.arcs.add((int(head["index"]), int(dependent["index"]), label))

def oracle(tree):
    """Given a parsed (gold-standard) sentence, return its arc-standard oracle transition sequence.
    Args:
        tree: A parsed sentence.
    Returns:
        Sequence of transitions, as strings: "shift" / "leftArc-"+LABEL / "rightArc-"+LABEL.
    """
    transitions = []  # This stores the generated transition strings
    configuration = Configuration(tree["nodes"])  # Initialize the configuration
    # While the buffer is not empty or the stack contains non-root nodes:
    while configuration.buffer or len(configuration.stack) > 1:
        ### TODO: replace with code to find the correct next transition:
        transition = None
        break
        ### END TODO
        transitions.append(transition)
        configuration.apply_transition(transition)
    return transitions

## <font color='red'>Assessment 1</font>: Correctness of the oracle implementation (20 pts)

We assess if your code implements a correct oracle for the arc-standard transition system:

* 0-5 pts: the oracle does not run correctly or does not constitute a correct oracle
* 5-15 pts: the oracle runs, but is missing some of the requirements laid out above
* 15-20 pts: the oracle correctly implements the requirements

You can test your oracle by running it on some sentences from the training set (will not be assessed explicitly):

In [4]:
oracle(train_data[1740])

[]

## <font color='blue'>Task 2</font>: Design your cross-lingual parser

Before implementing your parser, make the necessary design choices
and motivate them. Note that there are many correct parser designs,
but you must provide reasonable explanations for your choices.

Specifically, answer the following questions:

2.1. What challenges are there in our setting, where training is done on one language (English) and parsing on a different language (Danish)?

2.2. Out of the CoNLL-U columns (`ID`, `FORM`, `LEMMA`, `UPOS`, `XPOS`, `HEAD` and `DEPREL`), which are you going to use for features for the parser? Justify your decision.

2.3. Can beam search be used to improve the parser? How? What kind of parsing errors may it help to overcome?

## <font color='red'>Assessment 2</font>: Assess your explanation (30 pts)

We will mark the explanation along the following dimension: 

* Substance (30pts): correctly explained reasons for model design challenges and decisions.

## <font color='blue'>Task 3</font>: Develop and train a dependency parser

In this task, you will develop a dependency parser,
train it on English, and apply it to Danish.

**Your task is to build a dependency parser, train it on `en_ewt-ud-train.conllu`
(using `da_ddt-ud-dev.conllu` for tuning and model selection if you like),
and save its output on `da_ddt-ud-test.conllu` in CoNLL-U format.**

This model should, at a minimum, implement the following:

1. Classifier to select the next transition given a configuration.

1. Produce valid dependency trees for input sentences from `da_ddt-ud-test.conllu`
and from an unseen test set.

You may assume gold features, i.e.,
that the `ID`, `FORM`, `LEMMA`, `UPOS`, and `XPOS` fields in the input
sentences are annotated correctly
(note that even though it simplifies things, this is an unrealistic scenario,
since for real text no gold part-of-speech tags are available, for example, and the
parser would have to use automatically tagged text).
Of course, the `HEAD` and `DEPREL` fields will be given only during training.

You are allowed to use PyTorch or Tensorflow,
and static or contextual word representations,
as in previous assignments.
You are also free to add other improvements, such as
cross-lingual word embeddings
(e.g., [fastText](https://fasttext.cc/docs/en/aligned-vectors.html)),
recurrent neural networks for input representations,
attention over elements in the configuration,
or beam search for decoding.
Please keep the running time limit in mind,
if you choose to add this type of extension.

Your task then consists of three steps:

1. Write code that builds a dependency parser that fulfills the requirements laid out above.

2. Train your model on `en_ewt-ud-train.conllu`, optionally using `da_ddt-ud-dev.conllu` as a development set.
If you perform hyperparameter search, model selection or early stopping, make sure the runtime requirement is not
violated. Otherwise run the tuning yourself and include the best hyperparameters in the code.

3. Run your model on `da_ddt-ud-test.conllu` and save the
result as a separate `.conllu` file. Reminder: the test data you have is just a copy of the development data,
but during grading an unseen test file will be used instead.

If training your model takes more than 10 minutes,
you must also implement the saving and loading functions for your model,
and provide saved model files along with your submitted notebook.

The function `evaluate_las` will be used to score your model's predictions.

As a basic test, loading the CoNLL-U files and immediately saving them back should produce well-formed `.conllu` files
that score 100% LAS against the original ones:

In [5]:
dev_copy_file_path = join(_snlp_book_dir, "data", "ud", "da_ddt-ud-dev_copy.conllu")
save_to_conllu(dev_data, dev_copy_file_path)
evaluate_las(dev_file_path, dev_copy_file_path)

1.0

Edit the following cell to implement your model.
You should not have to modify `build_tree`, and in `parse` you should
only have to edit the `TODO` block. However, you are free to edit any
part of the cell as needed:

In [6]:
## You should improve this cell


def create_model():
    """Instantiate a dependency parser model.
    Returns:
        A dependency parser model.
    """
    pass

def train_model(model, train_data, dev_data):
    """Train a dependency parser model on the given sequence of parsed sentences.
    Args:
        model: The model to train.
        train_data: The sequence of parsed sentences to train on.
        dev_data: The sequence of parsed sentences to (optionally) use for development.
    """
    pass

def save_model(model, file_path):
    """Save a dependency parser model to the given file path.
    Args:
        model: The model to save.
        file_path: file path to save the model to.
    """
    pass

def load_model(file_path):
    """Load a dependency parser model from a given file path.
    Args:
        file_path: file path to load the model from.
    Returns:
        A dependency parser model.
    """
    pass

def build_tree(sentence, configuration):
    """Insert the arcs from a configuration to a tree"""
    tree = copy.deepcopy(sentence)
    for arc in configuration.arcs:
        head_index, dependent_index, label = arc
        dependent = tree["nodes"][dependent_index]
        dependent["head"] = str(head_index)
        dependent["deprel"] = label
    return tree

def parse(model, data):
    """Apply a dependency parser model to parse the given sequence of (test) sentences.
    Args:
        model: The trained model.
        data: The sequence of unparsed sentences to parse,
        in the same format as parsed sentences but without the
        "head" and "deprel" fields.
    Returns:
        The sequence of sentences parsed by the model.
    """
    trees = []
    for sentence in data:
        configuration = Configuration(sentence["nodes"])  # Initialize the configuration
        # While the buffer is not empty or the stack contains non-root nodes:
        while configuration.buffer or len(configuration.stack) > 1:
            ### TODO: replace with code to find the predicted next transition:
            transition = None
            break
            ### END TODO
            configuration.apply_transition(transition)
        trees.append(build_tree(sentence, configuration))
    return trees

## <font color='red'>Assessment 3</font>: Correctness of the implementation (30 pts)

We assess if your code implements a correct dependency parser (20 points):

* 0-5 pts: the model does not run correctly or does not constitute a dependency parser
* 5-15 pts: the model runs, but is missing some of the requirements laid out above
* 15-20 pts: the model correctly implements the requirements

Additionally, we will assess how well your model performs on an unseen test set (10 points):

* 0-5 pts: performance worse than a simple baseline model
* 5-10 pts: performance better than a simple baseline model


In [7]:
model = create_model()
train_model(model, train_data, dev_data)
# Evaluate on the development set to check yourself:
dev_data_pred = parse(model, dev_data)
dev_pred_file_path = join(_snlp_book_dir, "data", "ud", "dev_predictions.conllu")
save_to_conllu(dev_data_pred, dev_pred_file_path)
evaluate_las(dev_file_path, dev_pred_file_path)

1.0

In [None]:
# Do not modify this cell. We will run this on an unseen test set:
test_file_path = join(_snlp_book_dir, "data", "ud", "da_ddt-ud-test.conllu")
test_data = load_conllu(test_file_path)
test_data_pred = parse(model, test_data)
test_pred_file_path = join(_snlp_book_dir, "data", "ud", "test_predictions.conllu")
save_to_conllu(test_data_pred, test_pred_file_path)
evaluate_las(test_file_path, test_pred_file_path)

## <font color='blue'>Task 4</font>: Error analysis

Reflect on the model implemented in Task 3.
What worked and didn't work well, and how would you explain this?
You are welcome to perform a small error analysis on the development set
in order to answer the last question.

## <font color='red'>Assessment 4</font>: Assess your explanation (20 pts)

We will mark the explanation along the following dimension: 

* Substance (20pts): correctly explained reasons for performance of the model.