# Read annotations from CoNLL

The folder `annotations-pickle` contains all documents in the corpus with its annotations as pickled Python objects (created with the script `read_annotations.py`). This notebook illustrates how to load and use these objects in Python.

Note: to be able to unpickle the files, you should make sure that your code can access the module `conll_data.py` (e.g. by putting it in the same directory).

In [None]:
import gzip
import pickle
from collections import Counter
from tqdm import tqdm
from glob import glob

### Single document
The following illustrates how you can read one of the documents as a Document instance and print some information about its sentences, tokens and annotations.

In [None]:
example_file = "../data/annotations-pickle/21st-Century-Wire_20170627T181355.conll.pickle.gz"
with gzip.open(example_file, "rb") as infile:
    doc = pickle.load(infile)

In [None]:
doc.text

In [None]:
# print statistics
print("Number of sentences:", len(doc.sentences))
print("Number of tokens:", len(doc.tokens))
print("Number of unique words:", len(set(doc.words)))
print("Number of unique lemmas:", len(set(doc.lemmas)))

In [None]:
# inspect a specific sentence
sentence = doc.sentences[0]
print("Text:", sentence.text)
print("Words:", sentence.words)
print("Lemmas:", sentence.lemmas)

In [None]:
# inspect a specific token
token = sentence.get_token(token_id="13") # or: sentence.tokens[12]
print("Word:", token.word)
print("Lemma:", token.lemma)
print("POS:", token.pos)
print("Offset:", token.offset_start, "-", token.offset_end)
print("Full phrase:", sentence.get_full_phrase(head_id="13").text)

In [None]:
# print statistics on the annotations
print(f"{len(doc.events)} events annotated")
print(f"{len(doc.claims)} claims annotated")
print(f"{len(doc.attr_cues)} attribution cues annotated")
print(f"{len(doc.attr_contents)} attribution contents annotated")
print(f"{len(doc.attr_sources)} attribution sources annotated")
print(f"{len(doc.attr_relations)} attribution relations annotated")

In [None]:
# inspect a specific event annotation
doc.events[0].text

In [None]:
# inspect all multi-word events
mw_events = [event for event in doc.events if len(event.tokens) > 1]
for event in mw_events:
    print(event.text)

In [None]:
# inspect a specific claim annotation
print(doc.claims[0].text)

In [None]:
# inspect a specific attribution relation;
# one AR can have multiple sources and cues
ar = doc.attr_relations[-1]
print("Content:", ar.content.text)
for source in ar.sources:
    print("Source:", source.text)
for cue in ar.cues:
    print("Cue:", cue.text)

### All documents

The following illustrates how you can read one all documents as Document instances and get some overall information on all annotations.

In [None]:
pickle_files = glob("../data/annotations-pickle/*.pickle.gz")
len(pickle_files)

In [None]:
# get all events
events = []
for pickle_file in tqdm(pickle_files):
    with gzip.open(pickle_file, "rb") as infile:
        doc = pickle.load(infile)
        events.extend(doc.events)
print(len(events), "events annotated")

In [None]:
# most frequent events
event_texts = [event.tokens[0].lemma.lower() if len(event.tokens) == 1 else event.text.lower() for event in events]
Counter(event_texts).most_common(10)

In [None]:
# get all cues
cues = []
for pickle_file in tqdm(pickle_files):
    with gzip.open(pickle_file, "rb") as infile:
        doc = pickle.load(infile)
        cues.extend(doc.attr_cues)
print(len(cues), "cues annotated")

In [None]:
# most frequent cues
cue_texts = [cue.tokens[0].lemma.lower() if len(cue.tokens) == 1 else cue.text.lower() for cue in cues]
Counter(cue_texts).most_common(10)