<a href="https://colab.research.google.com/github/adefgreen98/NLU2021-Assignment1/blob/main/code/Assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Understanging Assignment 1 - Dependency Grammars
_Federico Pedeni_, 223993

### Requirements & Test Sentences

In [1]:
import spacy
from typing import Union

nlp_spacy = spacy.load('en')

In [2]:
sentences = "i saw the man with the telescope"
doc_spacy = nlp_spacy(sentences)

test_sentences = {
    'subtree': ["saw", "I saw a woman that saw a man who saw me yesterday"],
    'check': ["with the telescope", "telescope with the"],
    'head': "the man with the telescope",
    'info': "I gave Pooh all my honey. Also, I told Tigger a good bedtime story. Cristopher Robin was relieved to see all these friends."
}

### 1) Extract a path of dependency relations from ROOT to a token
This function extracts a list of paths that allow to traverse the dependency graph from the root to each node of the tree. Each dependency relation is stated as a `3-tuple` containing `(head, child, type of dependency)`; for this reason, for each list, the `child` value of a tuple is also the `head` value of the subsequent tuple.

For each sentence in the given SpaCy `Doc`, it is created a `dict` that maps each token to its path; for the `root` node it is reported only a recursive relation. Tokens that occurr multiple times inside a sentence are distinguished thanks to __token offsets__ that are part of the dictionary's keys.

Dictionaries' values are lists of dependency relations: while iterating over a sentence, for each key is initialized an empty list and new relations are progressively addedd in a backwards-recursive manner, substituting the token with its head until the root node is reached. At this point, the root-to-itself relation is added and the list is reversed.

This function returns a list where dictionaries at each index refer to the corresponding sentences in the `Doc`.

In [3]:
def get_dependencies(doc:spacy.tokens.Doc=doc_spacy):
    # iterates over all sentences in the doc
    rlist = []
    for sentence in doc.sents:
        res = {}
        rt = sentence.root
        # gets offset for the sentence, so that keys will be indexed independently for each sentence
        offset = sentence[0].i
        for wd in sentence:
            token = wd
            # forms the key
            k = wd.text + f'<{wd.i - offset}>'
            res[k] = []
            while rt != token:
                res[k].append((token.head.text, token.text, token.dep_))
                token = token.head
            # at this point the root should add itself
            res[k].append((token.head.text, token.text, token.dep_))
            res[k].reverse()
        rlist.append(res)
    return rlist

print(f"Dependencies for sentence '{doc_spacy.text}'")
print(*get_dependencies()[0].items(), sep='\n')

Dependencies for sentence 'i saw the man with the telescope'
('i<0>', [('saw', 'saw', 'ROOT'), ('saw', 'i', 'nsubj')])
('saw<1>', [('saw', 'saw', 'ROOT')])
('the<2>', [('saw', 'saw', 'ROOT'), ('saw', 'man', 'dobj'), ('man', 'the', 'det')])
('man<3>', [('saw', 'saw', 'ROOT'), ('saw', 'man', 'dobj')])
('with<4>', [('saw', 'saw', 'ROOT'), ('saw', 'man', 'dobj'), ('man', 'with', 'prep')])
('the<5>', [('saw', 'saw', 'ROOT'), ('saw', 'man', 'dobj'), ('man', 'with', 'prep'), ('with', 'telescope', 'pobj'), ('telescope', 'the', 'det')])
('telescope<6>', [('saw', 'saw', 'ROOT'), ('saw', 'man', 'dobj'), ('man', 'with', 'prep'), ('with', 'telescope', 'pobj')])


### 2) Extract a subtree of dependants given a token

This exercise can be solved by creating a wrapper function accepting as parameter a `Span` object, so that it can be used again in exercise 3. 

Indeed, the external function does only the sentence parsing and the detection of specifed token in the `Doc` sentences. Since a token can appear multiple times in a sentence, its occurrences are distinguished by their offset (as in exercise 1) and the relative subtrees are returned as a mappings between the __token-occurrence string__ and the __subtree__ of dependants, representend as list of `Token` objects (ordered according to sentence order). Matches between dictionary's keys are computed by taking the first part of the dict's key (the one that comes before the offset specification).

The internal function accepts as input a single sentence and initializes a dict that is used to map each token in the sentence with its subtree. Each subtree is representend as a __list of `Token`__ objects ordered according to sentence order. These are obtained through a BFS-like exploration of the sentence graph: at each step, a queue is filled with a token's children (excluding the token itself) and these will be explored later, in discovering order. The exploration continues until leaves are met, which do not add any child to the queue and therefore terminate the scan. 



In [4]:
def _get_subtree(doc:spacy.tokens.Span):
    rsdict = {
        doc.root.text + f'<{doc.root.i}>': list(doc.root.subtree)
    }
    q = list(doc.root.children)
    while len(q) > 0:
        token = q.pop(0)
        rsdict[token.text + f'<{token.i}>'] = list(token.subtree)
        q.extend(filter(lambda x: x.i != token.i, token.subtree))
    return rsdict

def get_subtree(token:str, doc:str, parser=nlp_spacy):
    _doc = parser(doc)
    res = []
    for sent in _doc.sents:
        res.append({})
        for k,v in _get_subtree(sent).items():
            if k.split('<')[0] == token:
                # fills the last added dictionary (for the current sentence) with subtree list
                res[-1][k] = v
    return res

print(f"Subtree for token '{test_sentences['subtree'][0]}' in sentence '{test_sentences['subtree'][1]}'")
print(*get_subtree(test_sentences['subtree'][0], test_sentences['subtree'][1])[0].items(), sep='\n')

Subtree for token 'saw' in sentence 'I saw a woman that saw a man who saw me yesterday'
('saw<1>', [I, saw, a, woman, that, saw, a, man, who, saw, me, yesterday])
('saw<5>', [that, saw, a, man, who, saw, me, yesterday])
('saw<9>', [who, saw, me, yesterday])


### 3) Check if a given list of tokens (segment of a sentence) forms a subtree

This function accepts a list of tokens (that can be specified as a string with spaces between each token, too) and verifies if they are the _all and only_ components of a subtree which is contained inside a specified `Doc`. For first, it iterates over all the sentences in the `Doc`, trying to find if anyone of them contains all the specified tokens; if not, then they surely do not form a subtree for any sentence and thus it is returned `False`.

If at least one suitable sentence is found, then it iterates over all the subtrees of that sentence, obtained thanks to the internal function of exercise 2, and return `True` if it finds one where the tokens' ordering matches the sentence ordering, by comparing the tokens' list and the list of `Token.text` for each subtree.

In [5]:
def check_subtree(tokens:Union[list, str], doc:spacy.tokens.Doc=doc_spacy):
    if type(tokens) is str: tokens = tokens.split()
    else: pass
    token_pool = set(tokens)

    for sentence in doc.sents:
        acc = token_pool.issubset(set(sentence.text.split()))
        if acc == True: 
            for subtr in _get_subtree(sentence).values():
                subtr = [el.text for el in subtr]
                if subtr == tokens: return True
    return False

for sent in test_sentences['check']:    
    print(f"Test for check_subtree('{sent}')", " ---> ", check_subtree(sent))

Test for check_subtree('with the telescope')  --->  True
Test for check_subtree('telescope with the')  --->  False


### 4) Identify head of a span, given its tokens

In this case, the input has been considered to be a string; the function also needs a pre-initialized parser, to perform the sentence parsing.

This function creates a new `Doc` containing the specified span and then returns the root token of the single `Span` object that composes that `Doc`, by accessing it directly thanks to __Python slicing__. 

The returned value is a `Token` object.

In [6]:
def head_of_span(sentence:str, parser=nlp_spacy):
    tmp = parser(sentence)
    return tmp[:].root

print(f"Head of sentence '{test_sentences['head']}': ", head_of_span(test_sentences['head']))

Head of sentence 'the man with the telescope':  man


### 5) Extract sentence subject, direct object and indirect object spans

First, this function defines a mapping between some types of dependency relations that match the 3 requested categories and the categories themselves. 

After that, it iterates over each word of wach sentence of the `Doc` and populates a dictionary with the spans for each category. Each span is obtained by taking the subtree of a token (by casting to list the `Token.subtree` object) that has a specific dependency relation; in particular, these are the relations used to classify as subject, object and indirect object:
+ subject: `nsubj`, `nsubjpass`;
+ object: `dobj`;
+ indirect object: `dative`.

Returns a list indexed by each sentence in the `Doc`, where each element is a mapping from the requested object to the spans (list of `Token` objects ordered according to sentence order) that match the corresponding dependency relation.

In [7]:
def extract_info(doc:spacy.tokens.Doc):
    relations = {
        'nsubj': 'subject',
        'nsubjpass': 'subject',
        'dobj': 'object',
        'dative': 'indirect object',
    }
    res = []
    for sentence in doc.sents:
        tmp = {
        'subject': [],
        'object': [],
        'indirect object': []
        }
        for word in sentence:
            if word.dep_ in relations.keys():
                tmp[relations[word.dep_]].append(list(word.subtree))
        res.append(tmp)
    return res

for sent, info in zip(nlp_spacy(test_sentences['info']).sents, extract_info(nlp_spacy(test_sentences['info']))):
    print("Current sentence: ", sent)
    print("Info: ")
    print(*info.items(), sep='\n')
    print()

Current sentence:  I gave Pooh all my honey.
Info: 
('subject', [I])
('object', [all, my, honey])
('indirect object', [Pooh])

Current sentence:  Also, I told Tigger a good bedtime story.
Info: 
('subject', [I])
('object', [a, good, bedtime, story])
('indirect object', [Tigger])

Current sentence:  Cristopher Robin was relieved to see all these friends.
Info: 
('subject', [Cristopher, Robin])
('object', [all, these, friends])
('indirect object', None)



# Optional Part

Not done.