# Neural Dependency Parsing

Derived from code for Stanford CS224N, by:
- Sahil Chopra <schopra8@stanford.edu>
- Haoshen Hong <haoshen@stanford.edu>

In this homework, you’ll be implementing a neural-network based dependency parser with the goal of maximizing performance on the UAS (Unlabeled Attachment Score) metric.

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between head words, and dependent words which modify those heads.
There are several types of dependency parsers, including transition-based parsers, graph-based parsers, and feature-based parsers. Your implementation will be a transition-based parser, which incrementally builds up a parse one step at a time.
The parser maintains a state, which is represented as follows:

- A `stack` of words that are currently being processed.
- A `buffer` of words yet to be processed.
- A `list` of dependencies predicted by the parser.

Initially, the stack only contains `ROOT`, the dependencies list is empty, and the buffer contains the list of words of the sentence. At each step, the parser applies a transition to its state until its buffer is empty and the stack size is 1.
The following transitions can be applied:

- `SHIFT`: removes the first word from the buffer and pushes it onto the stack.
- `LEFT-ARC`: marks the second (second most recently added) item on the stack as a dependent of the first item and removes the second item from the stack, adding a first word → second word dependency to the dependency list.
- `RIGHT-ARC`: marks the first (most recently added) item on the stack as a dependent of the second item and removes the first item from the stack, adding a second word → first word dependency to the dependency list.

On each step, your parser will decide among the three transitions using a neural network classifier.

# 1   Preliminaries

## 1.1  Transitions
Provide the sequence of Attardi’s non-projective transitions for parsing the following sentence:

`The president scheduled a meeting yesterday that nobody attended.`

`S S LA S LA S S LA S RA2 S S S S LA LA RA RA`

The action `RA2` ocurs when the state is:

stack: `ROOT`, `scheduled`, `meeting`<br/>
buffer: `yesterday`, `that`, `nobody`, `attended`

creating the arc `scheduled -> yesterday`.

## 1.2 Features

*What is the difference in terms of features between neural network dependency parsers (e.g. Chen&Manning 2014, https://cs.stanford.edu/~danqi/papers/emnlp2014.pdf) and non-neural network dependency parsers (e.g. parsers with lots of features like Zhang&Nivre 2011, www.anthology.aclweb.org/P/P11/P11-2033.pdf), in particular in terms of sparsity?*

Non-neural network dependency parsers classify based on millions of sparse indicator features that
generalize poorly and are computationally expensive to compute.
A neural network dependency parser uses a small number of dense features, pretrained on a large amount of documents, that encode hidden representations of tokens. Such parser can be both more accurate and more efficient.

## 1.3  Ambiguity

*What is the ambiguity in parsing the following sentence?*<br/>
`There are statistics about poverty that no one is willing to accept`

The subordinate might refer to either `statistics` or to `poverty`

## 1.4 Parse Tree

*Mention which errors that make the following an incorrect dependency tree:*

![image.png](attachment:b0612af2-31cc-4a3e-a7e9-d54a7cfa5359.png)

This is not a tree since B has two heads. The non-projectivity of arc C->A is not a problem.

## Exercise 1.
Implement the `__init__` and `step` methods in the `ParseState` class in `parser_state.py`. This implements the transition mechanics your parser will use.

In [1]:
from parser_state import ParserState

We will represent sentences as list of tokens, where tokens are named tuples:

In [2]:
from collections import namedtuple

Token = namedtuple('Token', ['id', 'form', 'pos', 'head', 'deprel'], defaults=(0,)*5)

Example of a sentence:

In [3]:
[Token(1, 'The'), Token(2, 'cat')]

[Token(id=1, form='The', pos=0, head=0, deprel=0),
 Token(id=2, form='cat', pos=0, head=0, deprel=0)]

## Test a single parser step

In [4]:
def test_step(transition, stack, buf, deps,
              ex_stack, ex_buf, ex_deps):
    """Tests that a single parse step returns the expected output"""
    ps = ParserState([Token(i, f) for i,f in enumerate(stack)],
                    [Token(i + len(stack), f) for i,f in enumerate(buf)],
                    deps)
    
    ps.step(ps.tr2id[transition]) # covert action name to it numeric id
    stack = [t.form for t in ps.stack] # collect the words
    buf = [t.form for t in ps.buffer]
    deps = [(a[0].form, a[1].form) for a in sorted(ps.arcs)]
    assert stack == ex_stack, \
        f"{transition} test resulted in stack {stack}, expected {ex_stack}"
    assert buf == ex_buf, \
        f"{transition} test resulted in buffer {buf}, expected {ex_buf}"
    assert deps == ex_deps, \
        f"{transition} test resulted in dependency list {deps}, expected {ex_deps}"
    print(f"{transition} test passed!")

Perform a few tests:

In [5]:
test_step("S", ["ROOT", "the"], ["cat", "sat"], [],
          ["ROOT", "the", "cat"], ["sat"], [])
test_step("LA", ["ROOT", "the", "cat"], ["sat"], [],
          ["ROOT", "cat"], ["sat"], [("cat", "the")])
test_step("RA", ["ROOT", "run", "fast"], [], [],
          ["ROOT", "run"], [], [("run", "fast")])

S test passed!
LA test passed!
RA test passed!


## Test parsing a sentence

In [6]:
ROOT = Token(0, 'ROOT')

def test_parse():
    """Simple tests for the PartialParse.parse function.
    Warning: these are not exhaustive.
    """
    sentence = [Token(i+1, f) for i,f in enumerate(["parse", "this", "sentence"])]
    state = ParserState(stack=[ROOT], buffer=sentence)
    dependencies = state.parse(["S", "S", "S", "LA", "RA", "RA"])
    dependencies = [(a[0].form, a[1].form) for a in sorted(dependencies)]
    expected = [('ROOT', 'parse'), ('parse', 'sentence'), ('sentence', 'this')]
    assert dependencies == expected, \
        f"parse test resulted in dependencies {dependencies}, expected {expected}"
    assert [t.form for t in sentence] == ["parse", "this", "sentence"], \
        f"parse test failed: the input sentence should not be modified"
    print("parse test passed!")

In [7]:
test_parse()

parse test passed!


# Exercise 2

We are now going to train a neural network to predict, given the state of the stack, buffer, and dependencies, which transition should be applied next.<br/>
First, the model extracts a feature vector representing the current state. We will be using the feature set presented in the  paper by  Chen and Manning (2014), "A Fast and Accurate Dependency Parser using Neural Networks", https://nlp.stanford.edu/pubs/emnlp2014-depparser.pdf.

The method `ParserState.extract_features()` to extract these features is  implemented in `parser_state.py`.
These features consist of a triple:
- a list of tokens (e.g., the last word in the stack, first word in the buffer, dependent of the second-to-last word in the stack if there is one, etc.).
- a list of POS tags for the same tokens
- a list of DEPRELs for the same tokens.
Each element is represented by an integer ids, and therefore it consists of:

$$[ [w_1,w_2,...,w_m], [p_1, p_2,...,p_m], [d_1, d_2,..., d_m] ]$$

where $m$ is the number of features and each $0 ≤ w_i < |V|$ is the index of a token in the vocabulary ($|V|$ is the vocabulary size) and similarly for $p_i$ and $d_i$.
Then our network looks up an embedding for each word and tags and concatenates them into a single input vector:
$$x = [E_{w_1},...,E_{w_m},Ep_{p_1},...,Ep_{p_m},Ed_{d_1},...,Ed_{d_m}] ∈ \mathbb{R}^{(d+d_p+d_d)m}$$
where $E ∈ \mathbb{R}^{|V|×d}$ is an embedding matrix with each row $E_w$ as the vector for a particular word $w$, and similarly $Ep$ and $Ed$ for tags, with dimesions respectively $d_p$ and $d_d$.<br/>
We then compute our prediction as:
$$h = ReLU(xW + b_1)$$
$$l = hU + b_2$$
$$\hat{y} = softmax(l)$$
where $h$ is referred to as the hidden layer, $l$ is referred to as the logits, $\hat{y}$ is referred to as the predictions, and $ReLU(z) = max(z, 0)$. We will train the model to minimize cross-entropy loss:
$$J(θ) = CE(y,\hat{y}) = \sum_{i=1}^a{−y_i log \hat{y}_i}$$
where $a$ is the number of possible parser actions.
To compute the loss for the training set, we average this $J(θ)$ across all training examples.
We will use UAS score as our evaluation metric. UAS refers to Unlabeled Attachment Score, which is computed as the ratio between number of correctly predicted dependencies and the number of total dependencies irrespective of the relations.

In `model.py` you will find skeleton code to implement this simple neural network using Keras. Complete the `__init__` methods to implement the model.

Then complete the train for epoch and train functions. Finally execute python `run.py` to train your model and compute predictions on test data from Penn Treebank (annotated with Universal Dependencies), available in files `data/traing.gold.conll`, `data/dev.gold.conll` and `data/test.gold.conll`.

##Note:##


## Load the training data

In [8]:
from corpus import read_conll

train_file = 'data/train.gold.conll'
dev_file = 'data/dev.gold.conll'
test_file = 'data/test.gold.conll'

max_sent = 1000 # limit sentences during development

train_sents = read_conll(train_file, max_sent=max_sent)
dev_sents = read_conll(dev_file, max_sent=max_sent//2)
test_sents = read_conll(test_file, max_sent=max_sent//2)

## Create the parser

In [9]:
from parser import Parser

parser = Parser(train_sents)



## Convert to numeric vectors

In [10]:
train_vectors = parser.vectorize(train_sents)
dev_vectors = parser.vectorize(dev_sents)

## Build the training set

The implementation considers 18 tokens (i.e. $m=18$), more precisely:
  - top 3 from stack, first 3 from buffer                             
  - the following children of top 2 stack tokens:                      
             lc[0], rc[0], lc[1], rc[1], llc[0], rrc[0]                       
The deprels of the following children of the top 2 stack tokens are considered:         
             lc[0], rc[0], lc[1], rc[1], llc[0], rrc[0]                         

A parser state is represented by a triple of such features:<br/>
[list of form ids], [list of POS ids], [list of DEPREL ids].                               

In [11]:
train_x, train_y = parser.create_features(train_vectors)
dev_x, dev_y = parser.create_features(dev_vectors)

100%|██████████| 1000/1000 [00:01<00:00, 728.27it/s]
100%|██████████| 500/500 [00:00<00:00, 646.93it/s]


Show sample of features:

In [12]:
dev_x[3], dev_y[3]

(([2, 0, 1963, 286, 4281, 249, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
  [2, 0, 13, 38, 32, 30, 8, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
  [2, 2, 2, 2, 2, 2, 16, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 0)

## Build the datasets

In [13]:
from tensorflow.data import Dataset

ds_train = Dataset.from_tensor_slices((train_x, train_y)).shuffle(1000).batch(32)
ds_dev = Dataset.from_tensor_slices((dev_x, dev_y)).shuffle(1000).batch(32)

## Load the embeddings

In [14]:
#!pip install glove-python-binary
from glove import Glove

glove_path = 'data/en-cw.txt'

glove_embeddings = Glove.load_stanford(glove_path)

## Prepare embedding matrix
Trimmed to the parser vocabulary.

In [15]:
num_tokens = len(parser.tok2id)
embedding_dim = glove_embeddings.word_vectors.shape[1]

import numpy as np

# Fill the matrix with Glove embeddings
embedding_matrix = np.random.uniform(-1, 1, (num_tokens, embedding_dim))
for word, i in parser.tok2id.items():
    idx = glove_embeddings.dictionary.get(word)
    if idx is not None:
        # Words not found in embedding index will be random.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = glove_embeddings.word_vectors[idx]

## Create the model

In [16]:
from model import ParserModel

n_pos = len(parser.pos2id)
n_tags = len(parser.dep2id)
tag_size = 20 # size of embeddings for POS and DEPRELs 
n_actions = n_tags * 2 + 1 # L-d + R-d + 1
hidden_size = 200

model = ParserModel(embeddings=embedding_matrix,
                    n_pos=n_pos, n_tags=n_tags, tag_size=tag_size,
                    n_actions=n_actions, hidden_size=hidden_size)

## Train the model

Choose an optimizer: `SparseCategoricalCrossentropy` expects numerical categories.

Select metrics to measure the loss and the accuracy of the model. These metrics accumulate the values over epochs and then print the overall result.

Compile the model:

In [17]:
from tensorflow.keras import losses, metrics, optimizers

model.compile(
    # Optimizer
    optimizer = optimizers.Adam(),
    # Loss function to minimize
    loss = losses.SparseCategoricalCrossentropy(name='train_loss'),
    # List of metrics to monitor
    metrics = [metrics.SparseCategoricalAccuracy(name='train UAS')],
)

Train the model:

In [18]:
EPOCHS = 3
history = model.fit(ds_train, epochs=EPOCHS,
                    validation_data=ds_dev)

Epoch 1/3
(None, 3, 18) (None, 1602)
(None, 3, 18) (None, 1602)
Epoch 2/3
Epoch 3/3


# Test the model

In [28]:
UAS, LAS = parser.parse(test_sents[:10], model)
print(f'UAS: {UAS*100:.2f}, LAS: {LAS*100:.2f}')

..........UAS: 75.98, LAS: 0.00


In [20]:
model.summary()

Model: "parser_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 features_embedding (Feature  multiple                 268732    
 sEmbedding)                                                     
                                                                 
 dense (Dense)               multiple                  320600    
                                                                 
 dropout (Dropout)           multiple                  0         
                                                                 
 dense_1 (Dense)             multiple                  16683     
                                                                 
Total params: 606,015
Trainable params: 606,015
Non-trainable params: 0
_________________________________________________________________


# Exercise 3

Print the output of a few sentences in CoNLL-U format:

In [23]:
parser.parse(test_sents[:3], model, conllu=True)

1	No	_	RB	_	_	7	advmod	_	_
2	,	_	,	_	_	7	punct	_	_
3	it	_	PRP	_	_	7	nsubj	_	_
4	was	_	VBD	_	_	7	cop	_	_
5	n't	_	RB	_	_	7	neg	_	_
6	Black	_	NNP	_	_	7	compound	_	_
7	Monday	_	NNP	_	_	0	root	_	_
8	.	_	.	_	_	7	punct	_	_

1	But	_	CC	_	_	0	root	_	_
2	while	_	IN	_	_	10	mark	_	_
3	the	_	DT	_	_	7	det	_	_
4	New	_	NNP	_	_	7	compound	_	_
5	York	_	NNP	_	_	7	compound	_	_
6	Stock	_	NNP	_	_	7	compound	_	_
7	Exchange	_	NNP	_	_	10	nsubj	_	_
8	did	_	VBD	_	_	10	aux	_	_
9	n't	_	RB	_	_	10	neg	_	_
10	fall	_	VB	_	_	1	advcl	_	_
11	apart	_	RB	_	_	12	advmod	_	_
12	Friday	_	NNP	_	_	10	nmod:tmod	_	_
13	as	_	IN	_	_	33	mark	_	_
14	the	_	DT	_	_	16	det	_	_
15	Dow	_	NNP	_	_	16	compound	_	_
16	Jones	_	NNP	_	_	33	nsubj	_	_
17	Industrial	_	NNP	_	_	18	compound	_	_
18	Average	_	NNP	_	_	19	nsubj	_	_
19	plunged	_	VBD	_	_	33	advmod	_	_
20	190.58	_	CD	_	_	21	nummod	_	_
21	points	_	NNS	_	_	19	dobj	_	_
22	--	_	:	_	_	19	punct	_	_
23	most	_	JJS	_	_	33	dep	_	_
24	of	_	IN	_	_	25	case	_	_
25	it	_	PRP	_	_	23	nmod	_	_
26	in	_	IN	_	_	29	

(0.7397260273972602, 0.0)

# Exercise 4

In [24]:
ex4 = read_conll('data/ex4.conll')
parser.parse(ex4[0:1], model, conllu=True)

1	Moscow	_	NNP	_	_	2	nsubj	_	_
2	sent	_	VBD	_	_	0	<ROOT>	_	_
3	troops	_	NNS	_	_	2	dobj	_	_
4	to	_	IN	_	_	5	case	_	_
5	Afghaninstan	_	NNP	_	_	0	<ROOT>	_	_



(1.0, 0.0)

`to Afghanistan` is correctly attached to `sent`, not to `troops`, which would be a `Prepositional Phrase Attachment Error`.

In [25]:
parser.parse(ex4[1:2], model, conllu=True)

1	I	_	PR	_	_	2	nsubj	_	_
2	disembarked	_	VBD	_	_	0	<ROOT>	_	_
3	and	_	CC	_	_	2	cc	_	_
4	was	_	VBD	_	_	5	auxpass	_	_
5	heading	_	VBG	_	_	0	<ROOT>	_	_
6	to	_	IN	_	_	8	case	_	_
7	a	_	DT	_	_	8	det	_	_
8	wedding	_	NN	_	_	0	<ROOT>	_	_
9	fearing	_	VBG	_	_	0	<ROOT>	_	_
10	for	_	IN	_	_	12	case	_	_
11	my	_	PRP$	_	_	12	nmod:poss	_	_
12	death	_	NN	_	_	9	nmod	_	_



(0.875, 0.0)

`fearing` is incorrectly attached to `wedding`: a `Verb Phrase Attachement Error`.

In [26]:
parser.parse(ex4[2:3], model, conllu=True)

1	It	_	PRP	_	_	2	nsubj	_	_
2	makes	_	VBZ	_	_	0	root	_	_
3	me	_	PRP	_	_	4	nsubj	_	_
4	want	_	VBP	_	_	2	xcomp	_	_
5	to	_	TO	_	_	6	mark	_	_
6	rush	_	VB	_	_	4	xcomp	_	_
7	out	_	IN	_	_	6	advmod	_	_
8	and	_	CC	_	_	7	cc	_	_
9	rescue	_	VB	_	_	7	conj	_	_
10	people	_	NN	_	_	9	dobj	_	_
11	from	_	IN	_	_	12	case	_	_
12	dilemmas	_	_	_	_	9	nmod	_	_
13	of	_	IN	_	_	16	case	_	_
14	their	_	PRP	_	_	16	nmod:poss	_	_
15	own	_	JJ	_	_	16	amod	_	_
16	making	_	NN	_	_	12	nmod	_	_
17	.	_	.	_	_	2	punct	_	_



(0.8235294117647058, 0.0)

`rescue` is incorrectly attached to `out`: a `Coordination Attachment Error`.

`dilemmas` is incorrectly attached to `people` instead of `rescue`.

In [27]:
parser.parse(ex4[3:4], model, conllu=True)

1	Brian	_	NNP	_	_	2	nsubj	_	_
2	has	_	VBZ	_	_	0	root	_	_
3	been	_	_	_	_	2	dobj	_	_
4	one	_	CD	_	_	3	appos	_	_
5	of	_	IN	_	_	9	case	_	_
6	the	_	DT	_	_	9	det	_	_
7	most	_	JJ	_	_	9	amod	_	_
8	crucial	_	JJ	_	_	9	amod	_	_
9	elements	_	NNS	_	_	4	nmod	_	_
10	to	_	IN	_	_	12	case	_	_
11	the	_	DT	_	_	12	det	_	_
12	success	_	NN	_	_	4	nmod	_	_
13	of	_	IN	_	_	14	case	_	_
14	Mozilla	_	NNP	_	_	12	nmod	_	_
15	.	_	.	_	_	2	punct	_	_



(0.5333333333333333, 0.0)

`success` should attach to `crucial`, not to `one`, `Mozilla` should depend on `success`.

Notice that `crucial -> success` is a non-projective dependency, which crosses the arc `one -> slements`.