# Example usage of the `Document` class for data manipulation.

In [8]:
# Import module

from dygie.data.dataset_readers import document

## ACE event data

Load in a dataset and print a brief description.

In [9]:
dataset = document.Dataset.from_jsonl("../data/ace-event/normalized-data/default-settings/json/dev.json")
print(dataset)

Dataset with 77 documents.


Grab a document, and print out the sentences.

In [41]:
doc = dataset[0]
print(doc)

0: CNN_CF_20030303.1900.02
1: STORY
2: 2003 - 03 - 03T19:00:00 - 05:00
3: New Questions About Attacking Iraq ; Is Torturing Terrorists Necessary ?
4: NOVAK Welcome back .
5: Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region .
6: The army 's entire first Calvary division based at Fort Hood , Texas , would join the quarter million U.S. forces already in the region .
7: We 're talking about possibilities of full scale war with former Congressman Tom Andrews , Democrat of Maine .
8: He 's now national director of Win Without War , and former Congressman Bob Dornan , Republican of California .
9: BEGALA Bob , one of the reasons I think so many Americans are worried about this war and so many people around the world do n't want to go is there have been a lot of problems with credibility from this administration .
10: Our president has repeatedly , for example , relied on a man whom you 're aware , Hussein Kamel , Saddam Hussein 's son - in - law , leader of

Grab a single sentence from the document, and print. The characters will be shown, with character indices underneath.fd

In [42]:
sent = doc[5]
print(sent)

Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region .
0      1    2   3     4  5      6      7    8    9        10 11  12      13   14     15


Examine a named entity in the dataset. Printing an entity mention shows:
- The token indices in the current sentence.
- The mention text.
- The entity type.

In [43]:
ner = sent.ner[0]
print(ner)

(7, 7, ['U.S.']): GPE


Entities are as spans. Spans "know" what sentence they're in, and also know their indices with respect to the sentence and the document.

In [44]:
span = ner.span

Printing the span shows the start and end indices, and the text.

In [45]:
print(span)


(7, 7, ['U.S.'])


Spans have a pointer back to the sentence they're from.

In [46]:
print(span.sentence)

Orders went out today to deploy 17,000 U.S. Army soldiers in the Persian Gulf region .
0      1    2   3     4  5      6      7    8    9        10 11  12      13   14     15


Spans also know their indices relative to the sentence

print(span.start_sent)
print(span.end_sent)

And to the document they're part of

In [48]:
print(span.start_doc)
print(span.end_doc)

31
31


Relations are represented as two spans and a relation label.

In [49]:
rel = sent.relations[0]
print(rel)

(8, 8, ['Army']), (7, 7, ['U.S.']): PART-WHOLE.Subsidiary


Events are represented as a trigger followed by a last of argument spans.

In [51]:
ev = sent.events[0]
print(ev)

<(5, 'deploy', Movement.Transport):
      (9, 9, ['soldiers'], Movement.Transport, Artifact);
      (14, 14, ['region'], Movement.Transport, Destination)>


You can convert a document back to a json-style dict that matches the DyGIE data format.

In [60]:
js = doc.to_json()
print(js.keys())

dict_keys(['doc_key', 'dataset', 'sentences', 'ner', 'relations', 'events'])


You can also split a long document up into shorter segments. This is useful for dealing with documents that are too large to fit into GPU.

**CAVEAT**: This functionality isn't implemented yet for coref annotations. These are challenging because they cross sentence boundaries.

In [71]:
small_docs = doc.split(max_tokens_per_doc=50)
print(small_docs[1])

0: The army 's entire first Calvary division based at Fort Hood , Texas , would join the quarter million U.S. forces already in the region .
1: We 're talking about possibilities of full scale war with former Congressman Tom Andrews , Democrat of Maine .


The document-level span indices update when the document is split.

In [72]:
print(small_docs[1][0].ner[0].span.start_doc)

1


## SciERC data

Unlike ACE event, the SciERC data have coreference annotations.

In [77]:
dataset = document.Dataset.from_jsonl("../data/scierc/normalized_data/json/dev.json")
doc = dataset[2]
print(doc)

0: An entity-oriented approach to restricted-domain parsing is proposed .
1: In this approach , the definitions of the structure and surface representation of domain entities are grouped together .
2: Like semantic grammar , this allows easy exploitation of limited domain semantics .
3: In addition , it facilitates fragmentary recognition and the use of multiple parsing strategies , and so is particularly useful for robust recognition of extra-grammatical input .
4: Several advantages from the point of view of language definition are also noted .
5: Representative samples from an entity-oriented language definition are presented , along with a control structure for an entity-oriented parser , some parsing strategies that use the control structure , and worked examples of parses .
6: A parser incorporating the control structure and the parsing strategies is currently under implementation .


Let's look at the coreference clusters. The coreference clusters are written like:

`[cluster-id]: [[<sent_index> (span_start, span_end), span_text], ...]`

So, cluster 0 has two mentions: "this" in sentence 2, and "it" in sentence 3.

In [82]:
for clust in doc.clusters:
    print(clust)
    print()

0: [<2> (4, 4, ['this']), <3> (3, 3, ['it'])]

1: [<0> (1, 2, ['entity-oriented', 'approach']), <1> (2, 2, ['approach'])]

2: [<5> (21, 22, ['parsing', 'strategies']), <6> (8, 9, ['parsing', 'strategies'])]

3: [<5> (13, 14, ['control', 'structure']), <5> (26, 27, ['control', 'structure']), <6> (4, 5, ['control', 'structure'])]



## Predictions

You can also load in predicted data. The code will populate attributes for `predicted_ner`, `predicted_relations`, etc. That match the attributes we've shown already.