# DiaParser
**Direct Attentive Dependency Parser**

In [None]:
!pip install diaparser

In [1]:
from diaparser.parsers import Parser



### Create a parser for Italian

Load a pretrained model for Italian, named `it_isdt.dbmdz-xxl`, i.e. a parser trained on the Italian ISDT treebank, using the transformner model `dbmdz/bert-base-italian-xxl-cased`.

The model will be downloaded and cached locally for further use.

In [2]:
parser = Parser.load('it_isdt.dbmdz-electra-xxl')

  return torch._C._cuda_getDeviceCount() > 0
Some weights of the model checkpoint at dbmdz/electra-base-italian-xxl-cased-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.


Alternatively, we can just specify the language and accept a default model:

In [3]:
parser = Parser.load(lang='it')

Some weights of the model checkpoint at dbmdz/electra-base-italian-xxl-cased-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.


You may parse plain text, by telling the language used: 

In [4]:
dataset = parser.predict('Lo schermo è buono, ma la batteria dura poco.', text='it')

`dataset` is an instance of `diaparser.utils.Dataset` containing the parse trees for each sentence.

Let's look at the first one:

In [5]:
dataset.sentences[0]

# sent_id = 1
# text = Lo schermo è buono, ma la batteria dura poco.
1	Lo	_	_	_	_	2	det	_	_
2	schermo	_	_	_	_	4	nsubj	_	_
3	è	_	_	_	_	4	cop	_	_
4	buono	_	_	_	_	0	root	_	_
5	,	_	_	_	_	9	punct	_	_
6	ma	_	_	_	_	9	cc	_	_
7	la	_	_	_	_	8	det	_	_
8	batteria	_	_	_	_	9	nsubj	_	_
9	dura	_	_	_	_	4	conj	_	_
10	poco	_	_	_	_	9	advmod	_	_
11	.	_	_	_	_	4	punct	_	_

## Display parse tree

In [6]:
from spacy import displacy

def display(sent):
    displacy.render(sent.to_displacy(), style='dep', manual=True,
                    options={'compact': True, 'distance': 120, 
                             'word_spacing': 20, 'offset_x':20})

In [7]:
display(dataset.sentences[0])

In [8]:
morph = parser.predict('Non darmelo, prendilo tu', text='it')
morph.sentences[0]

# sent_id = 1
# text = Non darmelo, prendilo tu
1	Non	_	_	_	_	2	advmod	_	_
2-4	darmelo	_	_	_	_	_	_	_	_
2	dar	_	_	_	_	0	root	_	_
3	me	_	_	_	_	2	iobj	_	_
4	lo	_	_	_	_	2	obj	_	_
5	,	_	_	_	_	2	punct	_	_
6-7	prendilo	_	_	_	_	_	_	_	_
6	prendi	_	_	_	_	2	conj	_	_
7	lo	_	_	_	_	6	obj	_	_
8	tu	_	_	_	_	6	nsubj	_	_

## English
Load a pretrained model for English, named `en_ewt.electra-base`, i.e. a parser trained on the English EWT treebank, using the transformner model `electra-base-disciminator`.

In [9]:
parser_en = Parser.load('en_ewt.electra-base')

Downloading:   0%|          | 0.00/666 [00:00<?, ?B/s]

Some weights of the model checkpoint at google/electra-base-discriminator were not used when initializing ElectraModel: ['discriminator_predictions.dense.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing ElectraModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing ElectraModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Using bos_token, but it is not set yet.
Using eos_token, but it is not set yet.


You may parse plain text, by telling the language used: 

In [10]:
dataset = parser_en.predict('I like the display, but the battery life is short.', text='en')

`dataset` is an instance of `diaparser.utils.Dataset` containing the predicted syntactic trees.

Let's look at the first one:

In [11]:
dataset.sentences[0]

# sent_id = 1
# text = I like the display, but the battery life is short.
1	I	_	_	_	_	2	nsubj	_	_
2	like	_	_	_	_	0	root	_	_
3	the	_	_	_	_	4	det	_	_
4	display	_	_	_	_	2	obj	_	_
5	,	_	_	_	_	11	punct	_	_
6	but	_	_	_	_	11	cc	_	_
7	the	_	_	_	_	9	det	_	_
8	battery	_	_	_	_	9	compound	_	_
9	life	_	_	_	_	11	nsubj	_	_
10	is	_	_	_	_	11	cop	_	_
11	short	_	_	_	_	2	conj	_	_
12	.	_	_	_	_	2	punct	_	_

In [12]:
display(dataset.sentences[0])

## Parse from tokenized text

Or you can provide tokenized text, as weel ask to see the estimated probabiity for each predicted arc:

In [13]:
dataset = parser_en.predict(['I', 'liked', 'the', 'display', '.'], prob=True)

You may then look at individual fields of the tokens in a sentence and the probability of their arcs.

In [14]:
import torch
print(f"arcs:  {dataset.arcs[0]}\n"
      f"rels:  {dataset.rels[0]}\n"
      f"probs: {dataset.probs[0].gather(1,torch.tensor(dataset.arcs[0]).unsqueeze(1)).squeeze(-1)}")

arcs:  [2, 0, 4, 2, 2]
rels:  ['nsubj', 'root', 'det', 'obj', 'punct']
probs: tensor([1.0000, 1.0000, 1.0000, 1.0000, 0.9999])


In [45]:
# Fancy way of doing:
[float(dataset.probs[0][i,j]) for i,j in enumerate(dataset.arcs[0])]

[1.0, 1.0, 0.9999973773956299, 0.9999995231628418, 0.9998992681503296]

# Information Extraction

From tha AODA act of Ontario: https://www.aoda.ca/integrated/#etrame

In [15]:
art16 = """In addition to the requirements under section 7, obligated organizations that are school boards
or educational or training institutions shall provide educators with accessibility awareness training
related to accessible program or course delivery and instruction.
Obligated organizations that are school boards or educational or training institutions shall keep
a record of the training provided under this section, including the dates on which the training
is provided and the number of individuals to whom it is provided."""

In [16]:
output = parser_en.predict(art16, text='en')

In [17]:
output.sentences

[# sent_id = 1
 # text = In addition to the requirements under section 7, obligated organizations that are school boards or educational or training institutions shall provide educators with accessibility awareness training related to accessible program or course delivery and instruction.
 1	In	_	_	_	_	2	case	_	_
 2	addition	_	_	_	_	22	obl	_	_
 3	to	_	_	_	_	5	case	_	_
 4	the	_	_	_	_	5	det	_	_
 5	requirements	_	_	_	_	2	nmod	_	_
 6	under	_	_	_	_	7	case	_	_
 7	section	_	_	_	_	5	nmod	_	_
 8	7	_	_	_	_	7	nummod	_	_
 9	,	_	_	_	_	22	punct	_	_
 10	obligated	_	_	_	_	11	amod	_	_
 11	organizations	_	_	_	_	22	nsubj	_	_
 12	that	_	_	_	_	15	nsubj	_	_
 13	are	_	_	_	_	15	cop	_	_
 14	school	_	_	_	_	15	compound	_	_
 15	boards	_	_	_	_	11	acl:relcl	_	_
 16	or	_	_	_	_	20	cc	_	_
 17	educational	_	_	_	_	20	amod	_	_
 18	or	_	_	_	_	19	cc	_	_
 19	training	_	_	_	_	17	conj	_	_
 20	institutions	_	_	_	_	15	conj	_	_
 21	shall	_	_	_	_	22	aux	_	_
 22	provide	_	_	_	_	0	root	_	_
 23	educators	_	_	_	_	22	obj	_	_
 24	with	_

In [18]:
display(output.sentences[0])

# Match a pattern

In [20]:
import spacy
from spacy.matcher import DependencyMatcher
!python -m spacy download en_core_web_sm

nlp = spacy.load("en_core_web_sm")

  return torch._C._cuda_getDeviceCount() > 0
2022-03-23 09:24:15.842092: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-23 09:24:15.842141: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [21]:
def visualise_subtrees(doc, subtrees):

    words = [{"text": t.text, "tag": t.pos_} for t in doc]

    if not isinstance(subtrees[0], list):
        subtrees = [subtrees]

    for subtree in subtrees:
        arcs = []
        tree_indices = set(subtree)
        for index in subtree:
            token = doc[index]
            head = token.head
            if token.head.i == token.i or token.head.i not in tree_indices:
                continue
            else:
                if token.i < head.i:
                    arcs.append(
                        {
                            "start": token.i,
                            "end": head.i,
                            "label": token.dep_,
                            "dir": "left",
                        }
                    )
                else:
                    arcs.append(
                        {
                            "start": head.i,
                            "end": token.i,
                            "label": token.dep_,
                            "dir": "right",
                        }
                    )
        print("Subtree: ", subtree)
        displacy.render(
            {"words": words, "arcs": arcs},
            style="dep",
            options={'compact': True, "distance": 120},
            manual=True
        )

### Pattern for a subtree

- rooted at a VERB (`keep`)
- with a MODAL auxiliary (`shall`)
- with a SUBJECT (`nsubj`)
- and an OBJECT (`dobj`)

In [22]:
pattern = [{'RIGHT_ID': 'root',
            'RIGHT_ATTRS': {'ORTH': 'keep'}},
           {'LEFT_ID': 'root', 'REL_OP': '>', 'RIGHT_ID': 'MODAL',
             'RIGHT_ATTRS': {'ORTH': 'shall'}},
           {'LEFT_ID': 'root', 'REL_OP': '>', 'RIGHT_ID': 'SUBJ',
             'RIGHT_ATTRS': {'DEP': 'nsubj'}},
           {'LEFT_ID': 'root', 'REL_OP': '>', 'RIGHT_ID': 'OBJ',
            'RIGHT_ATTRS': {'DEP': 'dobj'}}]

In [23]:
matcher = DependencyMatcher(nlp.vocab)
matcher.add("pattern", [pattern])

Apply to second sentence from Art. 16

In [24]:
art16_2 = nlp("""Obligated organizations that are school boards or educational or training institutions shall keep a record of the training provided under this section, including the dates on which the training
is provided and the number of individuals to whom it is provided.""")

Show the parse tree

In [25]:
displacy.render(art16_2, options={'compact': True, 'distance': 120, 
                             'word_spacing': 20, 'offset_x':20})

Extract and visualize the subtree

In [26]:
match = matcher(art16_2)[0]
subtree = match[1]
visualise_subtrees(art16_2, subtree)

Subtree:  [12, 11, 1, 14]


Notice that the words that are in the requested relaztion are quite far apart and hence shallow analysis will not be bale to highlit them.  