# **Examples of use**

This notebook includes examples to use, deploy and even expand the [trasepar](https://github.com/anaezquerro/trasepar) repository. 

In [1]:
import sys
sys.path.append('../')

## **Data loading**

The [data](../trasepar/data/) module contains four main classes to load [CoNLL](https://universaldependencies.org/format.html), [Enhanced CoNLL](https://universaldependencies.org/v2/conll-u.html) and [SDP](https://alt.qcri.org/semeval2015/task18/) and [PTB](https://catalog.ldc.upenn.edu/desc/addenda/LDC99T42.mrg.txt) formats. All of them share the `from_file()` method to load instances from a formatted file.

In [2]:
from separ.data import CoNLL, EnhancedCoNLL, SDP, PTB

  from .autonotebook import tqdm as notebook_tqdm


### **Dependency Parsing**


This is an example of a sentence in the [CoNLL](https://universaldependencies.org/format.html) format ([sample.conllu](sample.conllu)):

```
# newdoc id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200
# sent_id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-0001
# newpar id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-p0001
# text = What if Google Morphed Into GoogleOS?
1	What	what	PRON	WP	PronType=Int	0	root	0:root	_
2	if	if	SCONJ	IN	_	4	mark	4:mark	_
3	Google	Google	PROPN	NNP	Number=Sing	4	nsubj	4:nsubj	_
4	Morphed	morph	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	1	advcl	1:advcl:if	_
5	Into	into	ADP	IN	_	6	case	6:case	_
6	GoogleOS	GoogleOS	PROPN	NNP	Number=Sing	4	obl	4:obl:into	SpaceAfter=No
7	?	?	PUNCT	.	_	4	punct	4:punct	_
```

The `CoNLL` class is used to load a complete CoNLL document and extract the _dependency graphs_.

In [3]:
data = CoNLL.from_file('sample.conllu')
print(f'The CoNLL file has ({len(data)}) sentences')
print(data[0].format()) # index as a list 

                                                    

The CoNLL file has (1) sentences
# newdoc id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200
# sent_id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-0001
# newpar id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-p0001
# text = What if Google Morphed Into GoogleOS?
1	What	what	PRON	WP	PronType=Int	0	root	0:root	_
2	if	if	SCONJ	IN	_	4	mark	4:mark	_
3	Google	Google	PROPN	NNP	Number=Sing	4	nsubj	4:nsubj	_
4	Morphed	morph	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	1	advcl	1:advcl:if	_
5	Into	into	ADP	IN	_	6	case	6:case	_
6	GoogleOS	GoogleOS	PROPN	NNP	Number=Sing	4	obl	4:obl:into	SpaceAfter=No
7	?	?	PUNCT	.	_	4	punct	4:punct	_




The `CoNLL.Graph` instance stores two key elements: (1) the list of nodes of the graph and (2) the list of arcs. Each element of the list of nodes is a `CoNLL.Node` object and each element of the list of arcs is an `Arc` object. The `CoNLL.Node` stores the information at word level. The `Arc` instances store the position of the head, the position of the parent and the dependency relation associated to the arc.

In [4]:
graph = data[0]
print(graph.arcs)

# take the first arc 
arc = graph.arcs[0]
print(arc.HEAD, arc.DEP, arc.REL)

[0 --(root)--> 1, 4 --(mark)--> 2, 4 --(nsubj)--> 3, 1 --(advcl)--> 4, 6 --(case)--> 5, 4 --(obl)--> 6, 4 --(punct)--> 7]
0 1 root


### **Semantic Parsing**



To load semantic graphs our code supports two different formats: [Enhanced CoNLL](https://universaldependencies.org/v2/conll-u.html) and [SDP](https://alt.qcri.org/semeval2015/task18/). The previous example ([sample.conllu](sample.conllu)) is also a valid for the [Enhanced CoNLL](https://universaldependencies.org/v2/conll-u.html) format. This is an example of an SDP sentence ([sample.sdp](sample.sdp)).
 
```
#SDP 2015
#22100001
1	Consumers	consumer	NNS	-	-	n_of:x-i	_	ARG1	ARG1	_	_	_	_	_	_
2	may	may	MD	+	+	v_modal:e-h	_	_	_	_	_	_	_	_	_
3	want	want	VB	-	+	v:e-i-h	ARG1	_	_	_	_	_	_	_	_
4	to	to	TO	-	-	_	_	_	_	_	_	_	_	_	_
5	move	move	VB	-	+	v_cause:e-i-p	_	_	_	_	_	subord	_	_	_
6	their	their	PRP$	-	+	q:i-h-h	_	_	_	_	_	_	_	_	_
7	telephones	telephone	NNS	-	-	n:x	_	_	ARG2	poss	_	_	_	_	_
8	a	a+little	DT	-	-	x:e-u	_	_	_	_	mwe	_	_	_	_
9	little	a+little	RB	-	+	x:e-u	_	_	_	_	_	_	_	_	_
10	closer	closer	RBR	-	+	a_to:e-i-i	_	ARG2	_	_	ARG1	_	ARG1	_	_
11	to	to	TO	-	+	p:e-u-i	_	_	_	_	_	_	_	_	_
12	the	the	DT	-	+	q:i-h-h	_	_	_	_	_	_	_	_	_
13	TV	tv	NN	-	+	n:x	_	_	_	_	_	_	_	_	_
14	set	set	NN	-	-	n_of:x	_	_	_	_	_	_	ARG2	BV	compound
15	.	_	.	-	-	_	_	_	_	_	_	_	_	_	_
```

For this type of data, we implemented the `SDP` class:

In [5]:
data = SDP.from_file('sample.sdp')
print(f'The SDP file has ({len(data)}) sentences')
print(data[0].format())

                                                 

The SDP file has (1) sentences
#22100001
1	Consumers	consumer	NNS	-	-	n_of:x-i	_	ARG1	ARG1	_	_	_	_	_	_
2	may	may	MD	+	+	v_modal:e-h	_	_	_	_	_	_	_	_	_
3	want	want	VB	-	+	v:e-i-h	ARG1	_	_	_	_	_	_	_	_
4	to	to	TO	-	-	_	_	_	_	_	_	_	_	_	_
5	move	move	VB	-	+	v_cause:e-i-p	_	_	_	_	_	subord	_	_	_
6	their	their	PRP$	-	+	q:i-h-h	_	_	_	_	_	_	_	_	_
7	telephones	telephone	NNS	-	-	n:x	_	_	ARG2	poss	_	_	_	_	_
8	a	a+little	DT	-	-	x:e-u	_	_	_	_	mwe	_	_	_	_
9	little	a+little	RB	-	+	x:e-u	_	_	_	_	_	_	_	_	_
10	closer	closer	RBR	-	+	a_to:e-i-i	_	ARG2	_	_	ARG1	_	ARG1	_	_
11	to	to	TO	-	+	p:e-u-i	_	_	_	_	_	_	_	_	_
12	the	the	DT	-	+	q:i-h-h	_	_	_	_	_	_	_	_	_
13	TV	tv	NN	-	+	n:x	_	_	_	_	_	_	_	_	_
14	set	set	NN	-	-	n_of:x	_	_	_	_	_	_	ARG2	BV	compound
15	.	_	.	-	-	_	_	_	_	_	_	_	_	_	_




The `SDP.Graph` instance also stores the list of nodes and arcs of the semantic graph. Note that, in this case, the number of arcs could be greater or lower than the number of nodes (while in `CoNLL.Graph` instances, since they represent a dependency graph, the number of nodes and arcs must match). 

In [6]:
data[0].arcs

[3 --(ARG1)--> 1,
 5 --(ARG1)--> 1,
 0 --(TOP)--> 2,
 2 --(ARG1)--> 3,
 10 --(subord)--> 5,
 5 --(ARG2)--> 7,
 6 --(poss)--> 7,
 9 --(mwe)--> 8,
 3 --(ARG2)--> 10,
 9 --(ARG1)--> 10,
 11 --(ARG1)--> 10,
 11 --(ARG2)--> 14,
 12 --(BV)--> 14,
 13 --(compound)--> 14]

The `EnhancedCoNLL` file, instead, parses the arcs from the `DEPS` column. See the following example:


In [7]:
data = EnhancedCoNLL.from_file('sample.conllu')
print(f'The EnhancedCoNLL file has ({len(data)}) sentences')
print(data[0].format())
print(data[0].arcs)

                                                    

The EnhancedCoNLL file has (1) sentences
# newdoc id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200
# sent_id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-0001
# newpar id = weblog-blogspot.com_zentelligence_20040423000200_ENG_20040423_000200-p0001
# text = What if Google Morphed Into GoogleOS?
1	What	what	PRON	WP	PronType=Int	0	root	0:root	_
2	if	if	SCONJ	IN	_	1	mark	4:mark	_
3	Google	Google	PROPN	NNP	Number=Sing	2	nsubj	4:nsubj	_
4	Morphed	morph	VERB	VBD	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	3	advcl	1:advcl:if	_
5	Into	into	ADP	IN	_	4	case	6:case	_
6	GoogleOS	GoogleOS	PROPN	NNP	Number=Sing	5	obl	4:obl:into	SpaceAfter=No
7	?	?	PUNCT	.	_	6	punct	4:punct	_
[0 --(root)--> 1, 4 --(mark)--> 2, 4 --(nsubj)--> 3, 1 --(advcl:if)--> 4, 6 --(case)--> 5, 4 --(obl:into)--> 6, 4 --(punct)--> 7]




The `CoNLL.Graph`, `EnhancedCoNLL.Graph` and `SDP.Graph` inherit the methods from the abstract `Graph` ([trasepar/structs/graph.py](../trasepar/structs/graph.py)). We suggest looking to its Python implementation to see all their methods (extract cycles, planes, rebuild, etc).

### **Constituency Parsing**

The `PTB` class loads a bracketing-formatted file as the one provided in [sample.ptb](sample.ptb):

```
(S (INTJ (RB No)) (, ,) (NP-SBJ (PRP it)) (VP (VBD was) (RB n't) (NP-PRD (NNP Black) (NNP Monday))) (. .))
```

In [8]:
data = PTB.from_file('sample.ptb')
print(data[0].format())

                                             

(S (INTJ (RB No)) (, ,) (NP-SBJ (PRP it)) (VP (VBD was) (RB n't) (NP-PRD (NNP Black) (NNP Monday))) (. .))




As in the other formats, each element of the `PTB` file is a `PTB.Tree`. Our code implements several methods to process and operate with the different elements of the tree (e.g. collapse unary chains, get the spans that conform the tree, obtain the PoS-tags, etc.). For more information, see the source code ([trasepar/data/ptb.py](../trasepar/data/ptb.py)).

## **Linearization algorithms**

The [trasepar/models](../trasepar/models) module contains the implementation of the different sequence-labeling parsers: 


| Dependency Parsing | Description | Arguments |
|:---|:---|:---|
| [`IndexDependencyParser`](../trasepar/models/dep/idx/parser.py) | Absolute and relative indexing | `rel` |
| [`PoSDependencyParser`](../trasepar/models/dep/pos/parser.py) | PoS-tag relative indexing | |
| [`BracketDependencyParser`](../trasepar/models/dep/bracket/parser.py) | $k$-planar bracket encoding | `k` | 
| [`Bit4DependencyParser`](../trasepar/models/dep/bit4/parser.py) | $1$-planar bit-encoding |  | 
| [`Bit7DependencyParser`](../trasepar/models/dep/bit7/parser.py) | $2$-planar bit-encoding |  | 


| Semantic Parsing | Description | Arguments |
|:---|:---|:---|
| [`IndexSemanticParser`](../trasepar/models/sdp/idx/parser.py) | Absolute and relative graph indexing | `rel` |
| [`BracketSemanticParser`](../trasepar/models/sdp/bracket/parser.py) | $k$-planar bracket graph encoding | `k` | 
| [`Bit4SemanticParser`](../trasepar/models/sdp/bit4k/parser.py) | $4k$-bit graph encoding ($k$-planar) | `k` |
| [`Bit6SemanticParser`](../trasepar/models/sdp/bit6k/parser.py) | $6k$-bit graph encoding ($k$-planar) | `k` |

| Constituency Parsing | Description | Arguments |
|:---|:---|:---|
| [`IndexSemanticParser`](../trasepar/models/con/idx/parser.py) | Absolute and relative indexing | `rel` |

In all parsers, the linearization process is performed with the `Labeler` inner class. The two key methods to transform an input graph or tree into a sequence of labels is performed with the `.encode()` function, while the reverse process is performed with the `decode()` function.

### **Dependency Parsing as Sequence Labeling**

The dependency labelers work with dependency graphs (`CoNLL.Graph`). See in the source code of each class the algorithm to transform the input graph into a sequence of labels.

In [9]:
from separ.models import IndexDependencyParser, PoSDependencyParser, BracketDependencyParser, Bit4DependencyParser, Bit7DependencyParser

graph = CoNLL.from_file('sample.conllu')[0]
idx = IndexDependencyParser.Labeler()
print(f'Arcs of the input graph:', graph.arcs)
labels, rels = idx.encode(graph)
print(f'Sequence of labels:', labels)
print(f'Dependency relations:', rels)
print('Applying decoding...')
idx.decode(labels, rels)

                                                    

Arcs of the input graph: [0 --(root)--> 1, 4 --(mark)--> 2, 4 --(nsubj)--> 3, 1 --(advcl)--> 4, 6 --(case)--> 5, 4 --(obl)--> 6, 4 --(punct)--> 7]
Sequence of labels: ['0', '4', '4', '1', '6', '4', '4']
Dependency relations: ['root', 'mark', 'nsubj', 'advcl', 'case', 'obl', 'punct']
Applying decoding...




[0 --(root)--> 1,
 4 --(mark)--> 2,
 4 --(nsubj)--> 3,
 1 --(advcl)--> 4,
 6 --(case)--> 5,
 4 --(obl)--> 6,
 4 --(punct)--> 7]

In [10]:
bracket = BracketDependencyParser.Labeler()
print(f'Arcs of the input graph:', graph.arcs)
labels, rels = bracket.encode(graph)
print(f'Sequence of labels:', labels)
print(f'Dependency relations:', rels)
print('Applying decoding...')
bracket.decode(labels, rels)

Arcs of the input graph: [0 --(root)--> 1, 4 --(mark)--> 2, 4 --(nsubj)--> 3, 1 --(advcl)--> 4, 6 --(case)--> 5, 4 --(obl)--> 6, 4 --(punct)--> 7]
Sequence of labels: ['/>', '<', '<', '\\\\//>', '<', '\\>', '>']
Dependency relations: ['root', 'mark', 'nsubj', 'advcl', 'case', 'obl', 'punct']
Applying decoding...


[0 --(root)--> 1,
 4 --(mark)--> 2,
 4 --(nsubj)--> 3,
 1 --(advcl)--> 4,
 6 --(case)--> 5,
 4 --(obl)--> 6,
 4 --(punct)--> 7]

### **Semantic Parsing as Sequence Labeling**

Accordingly, the semantic labelers take as input a semantic graph. The encoding process in this case only involves the _unlabeled arcs_ of the graph, meaning that they would only represent the positions of head and dependant nodes associated to each arc. See in the following example that only the _unlabeled versions_ of each arc are recovered.

In [11]:
from separ.models import IndexSemanticParser, BracketSemanticParser, Bit4kSemanticParser, Bit6kSemanticParser

graph = SDP.from_file('sample.sdp')[0]
print(graph.arcs)
idx = IndexSemanticParser.Labeler()
labels = idx.encode(graph)
print('Labels:', labels)
print('Applying decoding...')
idx.decode(labels)

                                                 

[3 --(ARG1)--> 1, 5 --(ARG1)--> 1, 0 --(TOP)--> 2, 2 --(ARG1)--> 3, 10 --(subord)--> 5, 5 --(ARG2)--> 7, 6 --(poss)--> 7, 9 --(mwe)--> 8, 3 --(ARG2)--> 10, 9 --(ARG1)--> 10, 11 --(ARG1)--> 10, 11 --(ARG2)--> 14, 12 --(BV)--> 14, 13 --(compound)--> 14]
Labels: ['3$5', '0', '2', '', '10', '', '5$6', '9', '', '3$9$11', '', '', '', '11$12$13', '']
Applying decoding...




[3 --(None)--> 1,
 5 --(None)--> 1,
 0 --(None)--> 2,
 2 --(None)--> 3,
 10 --(None)--> 5,
 5 --(None)--> 7,
 6 --(None)--> 7,
 9 --(None)--> 8,
 3 --(None)--> 10,
 9 --(None)--> 10,
 11 --(None)--> 10,
 11 --(None)--> 14,
 12 --(None)--> 14,
 13 --(None)--> 14]

In order to also encode the information of the dependency types, the semantic labeler has the method `complete_encode()` and `complete_decode()`:

In [12]:
labels, rels = idx.complete_encode(graph)
print('Labels:', labels)
print('Applying full decoding...')
idx.complete_decode(labels, rels)

Labels: ['3$5', '0', '2', '', '10', '', '5$6', '9', '', '3$9$11', '', '', '', '11$12$13', '']
Applying full decoding...


[3 --(ARG1)--> 1,
 5 --(ARG1)--> 1,
 0 --(TOP)--> 2,
 2 --(ARG1)--> 3,
 10 --(subord)--> 5,
 5 --(ARG2)--> 7,
 6 --(poss)--> 7,
 9 --(mwe)--> 8,
 3 --(ARG2)--> 10,
 9 --(ARG1)--> 10,
 11 --(ARG1)--> 10,
 11 --(ARG2)--> 14,
 12 --(BV)--> 14,
 13 --(compound)--> 14]

### **Constituency Parsing as Sequence Labeling**

At the moment, only the indexing encoding algorithm has been implemented for constituency parsing (more on-going). As the previous explained labelers, the constituency linearization algorithms encode an input `PTB.Tree`.

In [14]:
from separ.models import IndexConstituencyParser
tree = PTB.from_file('sample.ptb')[0]
print(tree.format())
idx = IndexConstituencyParser.Labeler()
labels, cons, leaves = idx.encode(tree)
print(labels)
print(cons)
print(leaves)

                                             

(S (INTJ (RB No)) (, ,) (NP-SBJ (PRP it)) (VP (VBD was) (RB n't) (NP-PRD (NNP Black) (NNP Monday))) (. .))
['1', '1', '1', '2', '2', '3', '1', '0']
['S', 'S', 'S', 'VP', 'VP', 'NP-PRD', 'S', '']
['INTJ', '', 'NP-SBJ', '', '', '', '', '']




The decoding process in this case does not return arcs, but the list of spans that conform the decoded tree:

In [15]:
spans = idx.decode(labels, cons, leaves)
spans

[Span(LEFT=5, RIGHT=7, LABEL=NP-PRD),
 Span(LEFT=3, RIGHT=7, LABEL=VP),
 Span(LEFT=0, RIGHT=8, LABEL=S),
 Span(LEFT=0, RIGHT=1, LABEL=INTJ),
 Span(LEFT=2, RIGHT=3, LABEL=NP-SBJ)]

With the sequence of spans, the `PTB.Tree` class uses a method to build the tree instance:

In [16]:
PTB.Tree.from_spans(tree.preterminals, spans)

(S (INTJ (RB No)) (, ,) (NP-SBJ (PRP it)) (VP (VBD was) (RB n't) (NP-PRD (NNP Black) (NNP Monday))) (. .))

In [17]:
tree.format()

"(S (INTJ (RB No)) (, ,) (NP-SBJ (PRP it)) (VP (VBD was) (RB n't) (NP-PRD (NNP Black) (NNP Monday))) (. .))"