# Assignement 1 Solutions

Possible solutions for the assignment questions.
Input-output types could be different.

## Assignment: Working with Dependency Graphs (Parses)

The objective of the assignment is to learn how to work with dependency graphs by defining functions.

Read [spaCy documentation on dependency parser](https://spacy.io/api/dependencyparser) to learn provided methods.

Define functions to:
- extract a path of dependency relations from the ROOT to a token
- extract subtree of a dependents given a token
- check if a given list of tokens (segment of a sentence) forms a subtree
- identify head of a span, given its tokens
- extract sentence subject, direct object and indirect object spans

In [1]:
# Let's import spacy & create a doc
import spacy
from spacy.tokens import Doc, Span, Token

nlp = spacy.load('en')
txt = "I saw a man with a telescope."
doc = nlp(txt)

## Q1.1
Function to extract a path of dependency relations from the ROOT to a token.

The objective is to learn to traverse a dependency parse. 

Also useful as a feature for some sequence labeling tasks.

In [2]:
def get_token_dependency_path(token, root=None):
        """
        Get dependency path of a token: list of dependency relations from ROOT
        :param token: spaCy Token
        :type token: Token
        :return:
        :rtype: list
        """
        root = 'ROOT' if root is None else root
        path = []
        while token.dep_ != root:
            path.append(token.dep_)
            token = token.head
        return [root] + list(reversed(path))

In [3]:
# test
for sent in doc.sents:
    for token in sent:
        path = get_token_dependency_path(token)
        print(path)

['ROOT', 'nsubj']
['ROOT']
['ROOT', 'dobj', 'det']
['ROOT', 'dobj']
['ROOT', 'prep']
['ROOT', 'prep', 'pobj', 'det']
['ROOT', 'prep', 'pobj']
['ROOT', 'punct']


## Q1.2
Function to extract subtree of a dependents given a token.

Required for other questions.

several solutions possible:
1. use `subtree`
2. use `left_edge`, `right_edge`

In [4]:
def get_token_subtree_span(token, doc):
    """
    Get span of the dependency subtree of a token
    :param token: spaCy Token or string
    :param doc: spaCy Doc
    :type doc: Doc
    :return:
    :rtype: Span
    """
    if type(token) is str:
        # need to find it in doc; get the first
        for t in doc:
            if t.text == token:
                token = t
                break
        else:
            return None

    if type(token) is Token:
        # return Span(doc, token.left_edge.i, token.right_edge.i + 1)
        subtree = list(token.subtree)  # generator object
        return doc[subtree[0].i:subtree[-1].i + 1]
        

In [5]:
# test
print(get_token_subtree_span("telescope", doc))
print(get_token_subtree_span("with", doc))

# spacy tokens
for token in doc:
    print(get_token_subtree_span(token, doc))


a telescope
with a telescope
I
I saw a man with a telescope.
a
a man
with a telescope
a
a telescope
.


## Q1.3
Function to check if a given list of tokens (segment of a sentence) forms a subtree

In Q1.2 we are extracting the subtrees of each token, the solution is to check if any of this subtrees match the token list.

In [6]:
def contains(doc, tokens):
    """
    if token is not a contiguous within doc, it cannot be a substring
    :param doc:
    :type doc: spaCy Doc
    :param tokens:
    :type tokens: list
    :return:
    """
    for i in range(len(doc) - len(tokens) + 1):
        for j in range(len(tokens)):
            if doc[i+j] != tokens[j]:
                break
        else:
            return (i, i + len(tokens))
    return False


def is_subtree(tokens, doc):
    """
    Return True if a given list of tokens forms a subtree in the dependency parse of a sentence (doc)
    :param tokens: list of strings
    :type tokens: list
    :param doc: spaCy Doc
    :type doc: Doc
    :return:
    :rtype: bool
    """
    # get beginning and end indices of tokens in a doc
    indices = contains([t.text for t in doc], tokens)
    
    if indices is False:
        return False
    
    for sent in doc.sents:
        for token in sent:
            subtree = get_token_subtree_span(token, doc)
            st_indices = subtree[0].i, subtree[-1].i + 1  # to be compatibe with indices
            if indices == st_indices:
                return True
            
    return False

In [7]:
# let's test contains
print("indices:", contains(txt.split(), ["a", "man", "with"]))
print(doc[2:5])

indices: (2, 5)
a man with


In [8]:
print(is_subtree(["a", "man", "telescope"], doc))  # not contiguous
print(is_subtree(["a", "man", "with"], doc))       # not a subtree
print(is_subtree(["a", "man"], doc))               # subtree
print(is_subtree(["saw", "a", "man"], doc))        # not a subtree
print(is_subtree(["with", "a", "telescope"], doc))               # subtree

False
False
True
False
True


## Q1.4
Function to identify head of a span, given its tokens

The objective is to learn relation between Span, Doc, etc. and how to convert one to another

In [9]:
def get_span_head(span):
    """
    Get head of a span
    :param span:
    :param doc:
    :type doc: Doc
    :return:
    """
    if type(span) is Span:
        span = span
    elif type(span) is Doc:
        span = span[0:]
    elif type(span) is str:
        span = nlp(span)[0:]
    elif type(span) is list:
        span = nlp(" ".join(span))
    else:
        raise TypeError
    return span.root           
        

In [10]:
print(get_span_head("a man with"))
print(get_span_head(doc[2:]))
print(get_span_head(doc))

man
man
saw


## Q1.5
Function to extract sentence subject, direct object and indirect object spans.

In [11]:
def get_sentence_args(sent):
    sent = sent if type(sent) in [Doc, Span] else nlp(sent)
    args = {}
    for token in sent:
        # print(token.dep_)
        if token.dep_ == 'nsubj':
            args["subj"] = get_token_subtree_span(token, sent).text
        elif token.dep_ in ['dobj', 'obj']:  # mean the same, spacy uses dobj
            args["dobj"] = get_token_subtree_span(token, sent).text
        elif token.dep_ in ['iobj', 'dative']:  # mean the same, spacy uses dative
            args["iobj"] = get_token_subtree_span(token, sent).text
    return args

In [12]:
print(get_sentence_args(doc))
print(get_sentence_args("she gave me a book"))

{'subj': 'I', 'dobj': 'a man'}
{'subj': 'she', 'iobj': 'me', 'dobj': 'a book'}


# Assignment 2

Assigment is in the intersection of Named Entity Recognition and Dependency Parsing.

0. Evaluate spaCy NER on CoNLL 2003 data (provided)
    - report token-level performance (per class and total)
        - accuracy of correctly recognizing all tokens that belong to named entities (i.e. tag-level accuracy) 
    - report CoNLL chunk-level performance (per class and total);
        - precision, recall, f-measure of correctly recognizing all the named entities in a chunk per class and total  

1. Grouping of Entities.
Write a function to group recognized named entities using `noun_chunks` method of [spaCy](https://spacy.io/usage/linguistic-features#noun-chunks). Analyze the groups in terms of most frequent combinations (i.e. NER types that go together). 

2. One of the possible post-processing steps is to fix segmentation errors.
Write a function that extends the entity span to cover the full noun-compounds. Make use of `compound` dependency relation.

In [48]:
# to import conll
import os
import sys
sys.path.insert(0, os.path.abspath('../src/'))

from conll import evaluate, read_corpus_conll

In [49]:
# reading corpus
tst = read_corpus_conll('conll2003/test.txt', fs=" ")
print(tst[1])

[('SOCCER', 'NN', 'B-NP', 'O'), ('-', ':', 'O', 'O'), ('JAPAN', 'NNP', 'B-NP', 'B-LOC'), ('GET', 'VB', 'B-VP', 'O'), ('LUCKY', 'NNP', 'B-NP', 'O'), ('WIN', 'NNP', 'I-NP', 'O'), (',', ',', 'O', 'O'), ('CHINA', 'NNP', 'B-NP', 'B-PER'), ('IN', 'IN', 'B-PP', 'O'), ('SURPRISE', 'DT', 'B-NP', 'O'), ('DEFEAT', 'NN', 'I-NP', 'O'), ('.', '.', 'O', 'O')]


In [50]:
def conll2sent_str(conll_segment):
    return " ".join([t[0] for t in conll_segment])

In [None]:
# we are going to work only with these entities, removing the rest
# alternatively we can convert them
spacy2conll = {
    "PERSON": "PER",
    "GPE": "LOC",
    "ORG": "ORG",
}

In [165]:
def join_label(iob, label, oos=None):
    oos = 'O' if oos is None else oos
    if iob == oos:
        return oos
    elif label not in spacy2conll:
        return oos
    else:
        return "-".join([iob, spacy2conll.get(label)])

In [166]:
# let's define function to get proper output
def doc2conll(doc):
    out = []
    for t in doc:
        out.append((t.text, join_label(t.ent_iob_, t.ent_type_)))
    return out

## 2.1. Evaluation

### Solution 1
First solution to tokenization issues

In [167]:
from spacy.tokenizer import Tokenizer

nlp1 = spacy.load("en")
nlp1.tokenizer = Tokenizer(nlp1.vocab)  # to use white space tokenization (generally a bad idea for unknown data)

In [168]:
# parsing test set
out1 = []
for seg in tst:
    seg_txt = conll2sent_str(seg)
    seg_doc = nlp1(seg_txt)
    out1.append(doc2conll(seg_doc))

In [169]:
import pandas as pd

res1 = evaluate(tst, out1)
pd_tbl = pd.DataFrame().from_dict(res1, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
LOC,0.795,0.665,0.725,1668
PER,0.674,0.542,0.601,1617
MISC,1.0,0.0,0.0,702
ORG,0.431,0.286,0.344,1661
total,0.648,0.436,0.521,5648


In [174]:
# token-level performances (ignoring zero division for misc)
from sklearn.metrics import classification_report

hyp_list = [tok[-1] for seg in out1 for tok in seg]
ref_list = [tok[-1] for seg in tst  for tok in seg]
print(classification_report(ref_list, hyp_list))

              precision    recall  f1-score   support

       B-LOC       0.81      0.68      0.74      1668
      B-MISC       0.00      0.00      0.00       702
       B-ORG       0.49      0.33      0.39      1661
       B-PER       0.72      0.58      0.64      1617
       I-LOC       0.58      0.50      0.54       257
      I-MISC       0.00      0.00      0.00       216
       I-ORG       0.40      0.55      0.47       835
       I-PER       0.69      0.75      0.72      1156
           O       0.94      0.98      0.96     38554

    accuracy                           0.89     46666
   macro avg       0.52      0.48      0.50     46666
weighted avg       0.87      0.89      0.88     46666



  _warn_prf(average, modifier, msg_start, len(result))


### Solution 2
Second solution, using `whitespace_`


In [184]:
def doc2conll_merge(doc):
    out = []
    text = ""
    iob_ = []
    ent_ = []
    for t in doc:
        text += t.text
        iob_.append(t.ent_iob_)
        ent_.append(t.ent_type_) 
        
        if len(t.whitespace_) > 0 or t.i == len(doc) - 1:
            if len(ent_) == 1:
                tag = join_label(iob_[0], ent_[0])
            else:
                # you can use your logic here
                # check if it is all 'O'
                if all([x == 'O' for x in iob_]):
                    tag = 'O'
                else:
                    # entity which is not 'O', but all tokens are the same
                    if len(set(ent_)) == 1:
                        tag = join_label('B', ent_[0])
                    else:
                        # take the last, since it is the head usually
                        tag = join_label('B', ent_[-1])
                
            out.append((text, tag))
            
            text = ""
            iob_ = []
            ent_ = []

    return out

In [176]:
# parsing test set
nlp2 = spacy.load("en")
out2 = []
for seg in tst:
    seg_txt = conll2sent_str(seg)
    seg_doc = nlp2(seg_txt)
    out2.append(doc2conll_merge(seg_doc))

In [177]:
hyp_list = [tok[-1] for seg in out2 for tok in seg]
print(classification_report(ref_list, hyp_list))

              precision    recall  f1-score   support

       B-LOC       0.81      0.68      0.74      1668
      B-MISC       0.00      0.00      0.00       702
       B-ORG       0.51      0.33      0.40      1661
       B-PER       0.72      0.60      0.66      1617
       I-LOC       0.63      0.50      0.56       257
      I-MISC       0.00      0.00      0.00       216
       I-ORG       0.42      0.55      0.47       835
       I-PER       0.72      0.75      0.73      1156
           O       0.94      0.98      0.96     38554

    accuracy                           0.90     46666
   macro avg       0.53      0.49      0.50     46666
weighted avg       0.87      0.90      0.88     46666



In [178]:
res2 = evaluate(tst, out2)
pd_tbl = pd.DataFrame().from_dict(res2, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
LOC,0.802,0.671,0.731,1668
PER,0.67,0.562,0.612,1617
MISC,1.0,0.0,0.0,702
ORG,0.442,0.285,0.347,1661
total,0.655,0.443,0.529,5648


## 2.2. Grouping

In [179]:
def group_ner(doc):
    out = []
    for nc in doc.noun_chunks:
        group = []
        for ent in nc.ents:
            group.append(ent.label_)
        out.append(group)
    return out

In [183]:
from collections import Counter
groups = []
for seg in tst:
    seg_txt = conll2sent_str(seg)
    seg_doc = nlp(seg_txt)
    chunks = group_ner(seg_doc)
    groups.extend(chunks)

print(Counter(["+".join(g) for g in groups]))

Counter({'': 5992, 'GPE': 1232, 'PERSON': 1024, 'ORG': 826, 'DATE': 515, 'CARDINAL': 358, 'NORP': 294, 'ORDINAL': 87, 'EVENT': 55, 'MONEY': 51, 'PERCENT': 51, 'LOC': 48, 'NORP+PERSON': 45, 'QUANTITY': 45, 'TIME': 41, 'CARDINAL+PERSON': 31, 'GPE+PERSON': 29, 'ORG+PERSON': 28, 'FAC': 22, 'CARDINAL+NORP': 14, 'GPE+GPE': 13, 'CARDINAL+GPE': 13, 'CARDINAL+ORG': 13, 'WORK_OF_ART': 12, 'GPE+ORG': 12, 'PRODUCT': 10, 'DATE+EVENT': 9, 'NORP+ORG': 8, 'DATE+TIME': 8, 'ORG+ORG': 7, 'PERSON+PERSON': 6, 'CARDINAL+CARDINAL': 5, 'ORG+GPE': 5, 'NORP+NORP': 5, 'DATE+NORP': 5, 'PERSON+GPE': 4, 'DATE+PERSON': 4, 'GPE+NORP': 4, 'ORG+DATE': 4, 'GPE+FAC': 4, 'ORDINAL+CARDINAL': 3, 'CARDINAL+DATE': 3, 'LAW': 3, 'PERSON+ORDINAL': 3, 'DATE+ORG': 3, 'NORP+ORDINAL': 3, 'GPE+CARDINAL': 3, 'ORG+NORP': 3, 'GPE+LOC': 3, 'GPE+DATE': 3, 'QUANTITY+ORDINAL': 3, 'LANGUAGE': 3, 'CARDINAL+EVENT': 2, 'ORDINAL+NORP': 2, 'DATE+FAC': 2, 'ORDINAL+PERSON': 2, 'GPE+ORDINAL': 2, 'ORDINAL+GPE': 2, 'MONEY+CARDINAL+CARDINAL': 2, 'ORG+O

## 2.3. Compounding

In [40]:
def get_compounds(doc):
    # get all tokens in compound dependency relation
    compounds = [token for token in doc if token.dep_ == 'compound']
    # remove the middle ones to avoid overlaps
    compounds = [c for c in compounds if c.i == 0 or doc[c.i - 1].dep_ != 'compound']
    # take a slice of doc (i.e. make a span) from token to its head
    compounds = [doc[token.i:token.head.i + 1] for token in compounds]
    return compounds


def get_span_compound(span, doc):
    """
    Get head of a span and return NN compound
    :param span:
    :return:
    """
    span_head = get_span_head(span)
    compounds = get_compounds(doc)
    
    # we are interested in compounds that share the head with the entity,
    # otherwise tag changes
    compounds = [c for c in compounds if c.root == span.root]

    # if there are several, we are taking the shortest (just for safety)
    if len(compounds) > 1:
        min_span = min(compounds, key=len)
        return min_span

    return compounds[0] if len(compounds) == 1 else span

In [215]:
def entities2compound(doc):
    out = doc2conll(doc)
    for ent in doc.ents:
        nnc = get_span_compound(ent, doc)
        if len(nnc) > len(ent):
            bos = nnc[0].i   # 1st index of span
            eos = nnc[-1].i  # last index of span
            # subsequent entities will over-write
            for i in range(bos, eos+1):
                if i == bos:
                    out[i] = (doc[i].text, join_label("B", ent.label_))
                else:
                    out[i] = (doc[i].text, join_label("I", ent.label_))
    return out

In [220]:
# for evaluation this needs to be combined with tokenization & conversion to conll (w.r.t. labels and format)
hyps = []
for seg in tst:
    txt = conll2sent_str(seg)
    doc = nlp1(txt)
    cmp = entities2compound(doc)
    hyps.append(cmp)

In [222]:
res3 = evaluate(tst, hyps)
pd_tbl = pd.DataFrame().from_dict(res3, orient='index')
pd_tbl.round(decimals=3)

Unnamed: 0,p,r,f,s
LOC,0.787,0.649,0.712,1668
PER,0.605,0.485,0.538,1617
MISC,1.0,0.0,0.0,702
ORG,0.423,0.28,0.337,1661
total,0.618,0.413,0.495,5648


It performs worse than without post-processing. Though, it has a bit higher recall at token level for `I-PER` and `I-LOC`.

In [223]:
hyp_list = [tok[-1] for seg in hyps for tok in seg]
print(classification_report(ref_list, hyp_list))

              precision    recall  f1-score   support

       B-LOC       0.80      0.66      0.73      1668
      B-MISC       0.00      0.00      0.00       702
       B-ORG       0.48      0.32      0.39      1661
       B-PER       0.66      0.53      0.58      1617
       I-LOC       0.52      0.51      0.51       257
      I-MISC       0.00      0.00      0.00       216
       I-ORG       0.40      0.55      0.46       835
       I-PER       0.63      0.76      0.69      1156
           O       0.94      0.97      0.95     38554

    accuracy                           0.89     46666
   macro avg       0.49      0.48      0.48     46666
weighted avg       0.87      0.89      0.88     46666



  _warn_prf(average, modifier, msg_start, len(result))
