# Exploring DFN annotated data structure

This notebook introduces the annotated data from the Dutch FrameNet annotation tool. We will analyze the dictionary structure of one annotated reference text. Once you are familiarized with the richness of this structure, you can write code to iterate over your corpus, extract the data you want to analyze and restructure or visuzalize them in any way you want.

Let us import some packages first.

In [63]:
import json
import pprint
import glob
from collections import defaultdict

In your corpus, we have a json file for each annotated reference text. We will now open one json file from the Dutch Utrecht_shooting corpus. 

In [100]:
with open("Utrecht_shooting/annotated_data/'Agenten zagen bij een slachtoffer dat de telefoon maar bleef afgaan. Dat is heel moeilijk’.json", 'r') as infile:
    json_dict = json.load(infile)

Each json file has one key, which is the title of the annotated reference text. Its value consists of the following keys:

In [101]:
json_dict_title = json_dict["'Agenten zagen bij een slachtoffer dat de telefoon maar bleef afgaan. Dat is heel moeilijk’"]
print(json_dict_title.keys())

dict_keys(["fe's without links", 'frames/links', 'historical distance', "implicated fe's", 'raw text', 'subevents'])


We will discuss the content and structure of each key.

## 1. Raw text

The value of 'raw text' displays exactly what it stands for: the raw unannotated text of the document. This can be convenient when you want to perform some qualitative analysis or simply get a good picture of what has been annotated.

In [123]:
print(json_dict_title["raw text"])


'Agenten zagen bij een slachtoffer dat de telefoon maar bleef afgaan. Dat is heel moeilijk’ Aanslag UtrechtAl direct heeft hij het gevoel dat er iets niet pluis is. Dit is geen gewone liquidatie, denkt Rob van Bree (41) als hij in de ochtend van 18 maart hoort dat er is geschoten in een tram op het 24 Oktoberplein in Utrecht. Het hoofd operaties van de politie Midden-Nederland pakt meteen de telefoon en belt met de recherchechef en algemeen commandant van het crisisteam. Direct daarna wordt hij zelf gebeld. ,,De meldkamer. Ze werden daar overspoeld met telefoontjes.’’ Dan is duidelijk dat zijn gevoel klopt. ,,Het is echt groot.” Het begin van een dag die z’n weerga niet kent.



## 2. Historical distance

The value of 'historical distance' displays the number of days between the event date and the publication date of the text. The temporal distance range of your corpus can be used as a variable in your analysis to see if your observed patterns change over time.

In [25]:
print("This document is written",json_dict_title['historical distance'], "days after the main event.")

This document is written 12 days after the main event.


## 3. Frames/links

This value is a dictionary containing information about text mentions that are annotated with both a link to a structured entry and with frame information. Its keys are identifiers encoding structured entries. In order to interpret this identifier, we can make use of *structured_data.json* , which contains a mapping from identifiers to labels. We will take one identifier and examine the information linked to it.

In [102]:
print(json_dict_title['frames/links'].keys())

with open("Utrecht_shooting/structured_data.json", 'r') as infile:
    structured_data_dict = json.load(infile)

print()
print("Q62090804 maps to", structured_data_dict["Q62090804"])
print()
print(json_dict_title['frames/links']['Q62090804'].keys())

dict_keys(['Q1632409367599', 'Q62090804', 'Q803'])

Q62090804 maps to {'sem:incidentID': '2019 Utrecht shooting'}

dict_keys(['t19', 't37'])


There seem to be links of in-text mentions to three different structured entries. Identifier Q62090804 maps to the main event itself. In our dictionary, each identifier has a nested dictionary with tokens as keys. These tokens encode words that were tagged by the annotators in the raw text. Each token has its own nested dictionary. This dictionary consists of all linguistic and grounding information that you need. We will take a look at the dictionary of one token.

In [103]:
#these are the layers of nesting:
pprint.pprint(json_dict["'Agenten zagen bij een slachtoffer dat de telefoon maar bleef afgaan. Dat is heel moeilijk’"] #document title
                      ['frames/links'] #information type
                      ['Q62090804'] #structured entry
                      ['t19']) #token

{'POS': 'NOUN',
 'article': {'definite': None, 'lemma': None},
 'frame': 'Attack',
 'frame elements': ['Catastrophe@Undesirable_event',
                    'Killing@Cause',
                    'Contacting@Topic',
                    'Contacting@Topic'],
 'lemma': 'aanslag',
 'reftype': 'evoke',
 'sentence': '2',
 'target phrase': 'dat zijn heel moeilijk ’ aanslag utrechtal'}


This dictionary contains all linguistic information in regards to the selected token. We will go over all the keys:

* **lemma** = the lemmatized linguistic form of the tagged word in the raw text. In this case, this is the word 'aanslag' ('attack'). Hence, the annotators have linked this word in the text to the respective identifier in the structured data.

* **POS** = the POS of the token, in this case a noun. 

* **article** = any information about articles (definiteness and lemma) are provided in a dictionary. This can be useful if you want to look at the use of definiteness as a function of common ground. In this case, since 'aanslag' is not modified with an article, the values in this dictionary are None.

* **sentence** = the number of the sentence in which this word occurs. This word appears in the second sentence, which is probably in the beginning of the text.

* **target phrase** = the syntactic phrase of which the word is the semantic head. This was automatically reconstructed and all words are lemmatized, so it is sometimes far from perfect. But you can always look into the raw layer for the linguistic context. If the word is only annotated with a frame, then this key is absent.

* **reftype** = whether the frame and word both refer to the same structured entry. In this case it does. But this has been wrongly annotated way too often, so unfortunately you cannot make use of it.

* **frame** = if the annotator has annotated the word with a frame, you can find it here. In this case, 'aanslag' has been annotated with Attack. 

* **frame elements** = a list of frame elements that the word has been annotated with. It can exhibit frame elements of the frame that is evoked by the same word, i.e., an incorporated frame element, but also frame elements belonging to frames evoked in different sentences (as a product of discourse annotation). Each frame element's form is inherently constructed as frame@frame_element. In this case, all frame elements belong to frames evoked in different sentences.  

## 4. Subevents

A reference text usually uses only a handful of general references to the main event. Those are both frame-annotated and linked to the incidentID in the structured data and can thus be found under the key 'frames/coreferences' (see above). The remainder of the eventive predicates in the text are not linked to structured data and only frame-annotated if they are subevents of the main event. Predicates are considered subevents if they play an important part in shaping the narrative of the event. They can be:

* **rising** events: events that are a direct cause of the main event. They occur beforehand.
* events **entailed** by the main event
* **falling** events: events that are directly caused by the main event. They occur in the aftermath.

Predicates that are frame-annotated due to their nature of expressing subevents are found under this key. The dictionary per predicate is structured in similar fashion to the dictionaries for the predicates under 'frames/coreferences'. Now we will have look at the tokens grouped under this header, and pick one to examine more closely.

In [41]:
print(json_dict_title['subevents'].keys())

#these are the layers of nesting:
print()
pprint.pprint(json_dict["'Agenten zagen bij een slachtoffer dat de telefoon maar bleef afgaan. Dat is heel moeilijk’"] #document title
                      ['subevents'] #information type
                      ['t104.c0']) #token

dict_keys(['t104.c0', 't109', 't58', 't83', 't86.c0', 't92.c0', 't92.c1', 't99'])

{'POS': 'ADJ',
 'article': {'definite': None, 'lemma': None},
 'compound': 'meldkamer',
 'frame': 'Reporting',
 'frame elements': None,
 'function': 'modifier',
 'lemma': 'meld',
 'reftype': 'type',
 'sentence': '7'}


This token encodes the word 'meld' ('report'). This is a lexeme that is part of a Dutch compound. This compound was split in the tool and its lexemes were seperately annotated. 'meld' was not linked to structured data, otherwise it would be grouped under 'frames/coreferences'. When an annotated word is part of a compound, its token has a suffix. In this case, the suffix is '.c0'. Given the compounding nature of the word, the following keys have been added to the dictionary:

* **compound**: the lemma of the compound that the annotated lexeme is a part of. In this case, the compound is 'meldkamer' ('control room'). Note that 'lemma' only provides the tagged lexeme 'meld'.
* **function**: the syntactic function of the lexeme in the compound. Compounds usually consist of a head and a modifier. In this case, 'meld' is the modifier of 'kamer'.

Note: when the annotators split a Dutch compound, they annotated both parts with any frame or frame element. Exceptionally, the status of the potential frame element was disregarded, which means that a subpart of a compound is also annotated with a frame element if the frame element is peripheral.

## 5. Frame elements without links

The annotators had to annotate all core frame elements of each annotate frame, disregarding whether the mentions expressing the frame element is linked to the structured data or not. All mentions of frame elements without a link to structured data can be found under "fe's without coreference". Let us have a look at the annotated tokens under this key and pick one token for further examination.

In [105]:
print(json_dict_title["fe's without links"].keys())

#these are the layers of nesting:
print()
pprint.pprint(json_dict["'Agenten zagen bij een slachtoffer dat de telefoon maar bleef afgaan. Dat is heel moeilijk’"] #document title
                      ["fe's without links"] #information type
                      ['t104.c1']) #token

dict_keys(['t104.c1', 't111', 't70', 't86.c1', 't97'])

{'POS': 'NOUN',
 'article': {'definite': None, 'lemma': None},
 'compound': 'meldkamer',
 'frame elements': ['Mass_motion@Goal', 'Reporting@Place'],
 'function': 'head',
 'lemma': 'kamer',
 'sentence': '7',
 'target phrase': 'meldkamer'}


As you can see, the structure of the dictionary belonging to 'kamer' ('room') is similar to that of 'frames/coreference' and 'subevents'. 'kamer' here is part of the compound, 'meldkamer'. Since 'melden' evokes a frame that references a subevent of the main event, its token t104.c0 is grouped under 'Subevents'. But 'kamer' only expresses frame elements, and is thus placed under "fe's without coreference". It is expressing Place of the Reporting frame evoked by 'meld'. Furthermore, we see that it expresses Goal of a Mass_motion frame, evoked in some other sentence. Notice that there is no key for a frame. If 'kamer' would evoke a frame, it would have been put under 'subevents'. Thus, "fe's without coreference" is reserved for words annotated with frame elements but not frames.

## 6. Implicated frame elements

In the DFN annotation tool, the annotators performed frame annotation on discourse level. This means that if they could not find a core frame element within the sentence of its frame-evoking lexical unit, they could look for it across sentence boundaries. If they could not find the frame element in discourse, they clicked the "unexpressed" button, meaning that the frame element is completely absent. Since core frame elements must be present in order to cognitively process their frame, their absence from discourse means that they are pragmatically implicated. These frame elements are found in a list under "implicated fe's":

In [46]:
pprint.pprint(json_dict_title["implicated fe's"])

['Attack@Assailant',
 'Killing@Killer',
 'Killing@Instrument',
 'Hit_target@Agent',
 'Mass_motion@Area',
 'Reporting@Informer',
 'Reporting@Wrongdoer',
 'Reporting@Authorities',
 'Reporting@Behavior',
 'Contacting@Communicator',
 'Contacting@Address',
 'Contacting@Communication',
 'Contacting@Address',
 'Contacting@Communication',
 'Catastrophe@Undesirable_event',
 'Catastrophe@Patient',
 'Team@Members',
 'Criminal_investigation@Incident',
 'Criminal_investigation@Suspect']


In sum: we distinguish the following types of information in our json dictionary, grouped under the following keys:

* **raw text** the raw unannotated text of the document
* **historical distance** the distance in days from the document to the main event
* **frames/links** linguistic information about the frame annotated references per structured entry that they are linked to
* **subevents** linguistic annotation about frame annotated predicates referencing subevents instead of the main incident (no link to structured data).
* **fe's without links** linguistic information about frame element annotations without links to structured data.
* **implicated fe's** list of core frame elements that are absent from the reference text.

Now that you are familiar with the data structure, you can iterate over your corpus and extract and restructure the information you need for your analysis:

In [54]:
print("titles in the Utrecht shooting corpus:")
print()

for filename in glob.glob("Utrecht_shooting/annotated_data/*"):
    with open(filename, 'r') as infile:
        json_dict = json.load(infile)
        for title, info_headers in json_dict.items():
            print(title) 

titles in the Utrecht shooting corpus:

Van Zanen: 'Twee verdachten vrij', politie ontkent
Om kwart voor elf staat de tram stil, daarna heel Utrecht
Vermoedelijke dader en twee andere verdachten aanslag Utrecht opgepakt
Politie Utrecht houdt klopjacht op 37-jarige verdachte Gökmen Tanis
Rutte: vlaggen op overheidsgebouwen vandaag halfstok
'Agenten zagen bij een slachtoffer dat de telefoon maar bleef afgaan. Dat is heel moeilijk’
Rechtbank verplicht verdachte van tramaanslag om naar zitting te komen
Schietpartij 24 Oktoberplein
Onderzoek naar schietincident vandaag verder
Baudet (FvD) voert wél campagne en maakt CDA en VVD verwijten
Verdachte Gökmen T. had schulden en zou woning kwijtraken
Gökmen T. krijgt levenslang voor aanslag in Utrechtse tram
Zwarte dag in Utrecht: dit gebeurde er op de dag van het schietincident
Kleine bijdrage in kosten afscheid lieve Roos
Duizenden mensen melden zich op ikbenveilig.nl
Verplichte rechtsbijstand voor verdachte schietpartij Utrecht
Extra alertheid 

The exercises below will help you to explore the corpus. The full code for the first exercise is given to help you on the way ;)

### Exercise 1:

Use the identifier of the main event to extract all the frames and their lexical units, and rank them in frequency and percentage. Which frame has the largest variety of lexical units?

In [90]:
identifier = "Q62090804"
frames_lus = []

for filename in glob.glob("Utrecht_shooting/annotated_data/*"): #iterate over the corpus
    with open(filename, 'r') as infile:
        json_dict = json.load(infile)

    for title, values in json_dict.items():   #iterate over document
        if identifier in values["frames/links"]:   #check if identifier is present
            for token, info_dict in values["frames/links"][identifier].items():   #iterate over links
                if info_dict['frame'] != None and 'lemma' in info_dict.keys():   #check if the annotation contains lemma and frame
                    frame = info_dict['frame']
                    if 'compound' in info_dict.keys():
                        lemma_pos = info_dict['compound']+'.'+'NOUN'
                    elif info_dict['POS'] == None:
                        lemma_pos = info_dict['lemma']+'.X'
                    else:
                        lemma_pos = info_dict['lemma']+'.'+info_dict['POS']
                    frames_lus.append((frame, lemma_pos))   #append tuple (frame, LU) to list

print(f"the main event is referenced {len(frames_lus)} times")
print()
print(f"sample of the set of frames used in reference:")
pprint.pprint(frames_lus[:10])

the main event is referenced 249 times

sample of the set of frames used in reference:
[('Attack', 'aanslag.NOUN'),
 ('Hit_target', 'schietincident.NOUN'),
 ('Catastrophe', 'schietincident.NOUN'),
 ('Attack', 'aanslag.NOUN'),
 ('Attack', 'aanslag.NOUN'),
 ('Attack', 'aanslag.NOUN'),
 ('Attack', 'aanslag.NOUN'),
 ('Attack', 'aanval.NOUN'),
 ('Terrorism', 'terroristisch.ADJ'),
 ('Attack', 'aanslag.NOUN')]


In [97]:
def sort_by_values_len(d):
    """sort dictionary by length of the values. Return list of dicts"""
    dict_len= {key: len(value) for key, value in d.items()}
    import operator
    sorted_key_list = sorted(dict_len.items(), key=operator.itemgetter(1), reverse=True)
    sorted_dict = [{item[0]: d[item [0]]} for item in sorted_key_list]
    return sorted_dict

frames_dict = defaultdict(list)

for frame_lu in frames_lus:
    frame = frame_lu[0]
    lu = frame_lu[1]
    frames_dict[frame].append(lu)   #create a dictionary with frame as key and list of their annotated lexical units as value

frames_sorted = sort_by_values_len(frames_dict)   #sort dictionary by frequency of annotated lexical units

print("ranking of most dominant frames in reference to the main event:")
print()
print('freq', 'ratio', 'frame')
print()
for frame_dict in frames_sorted[:10]:   #iterate over the top ranked frames
    for frame, lus in frame_dict.items():
        freq = len(lus)
        percentage = round((freq*100)/len(frames_lus), 3)
        print(freq, percentage, frame)

ranking of most dominant frames in reference to the main event:

freq ratio frame

65 26.104 Attack
49 19.679 Hit_target
28 11.245 Catastrophe
25 10.04 Participation
18 7.229 Terrorism
12 4.819 Committing_crime
12 4.819 Killing
7 2.811 Event
7 2.811 Purpose
6 2.41 Commitment


In [99]:
frames_dict = defaultdict(set)

for frame_lu in frames_lus:
    frame = frame_lu[0] 
    lu = frame_lu[1]
    frames_dict[frame].add(lu) #create a dictionary with frame as key and a set of their unique lexical units as value

frames_sorted = sort_by_values_len(frames_dict) #sort dictionary by frequency of unique lexical units

print("ranking of frames with strongest variety of lexical units:")
print()
for frame_d in frames_sorted[:5]:   #iterate over the top ranked frames
    for frame, lus in frame_d.items():
        print(len(lus), frame, set(lus))
        print()

ranking of frames with strongest variety of lexical units:

5 Attack {'aanslagvanochtend.NOUN', 'tramaanslag.NOUN', 'aanslag.NOUN', 'terreuraanslag.NOUN', 'aanval.NOUN'}

5 Killing {'dodelijk.ADJ', 'dood.ADJ', 'liquidatie.NOUN', 'moord.NOUN', 'moordpartij.NOUN'}

4 Hit_target {'doodschieten.NOUN', 'schieten.VERB', 'schietpartij.NOUN', 'schietincident.NOUN'}

4 Terrorism {'terroristisch.ADJ', 'terreur.NOUN', 'terreurdaad.NOUN', 'terreuraanslag.NOUN'}

3 Catastrophe {'incident.NOUN', 'schietincident.NOUN', 'drama.NOUN'}



### Exercise 2

Use the identifier of one of the participants in the main event to extract all the frame elements used to frame this person, and rank them on frequency and percentage. what frame elements are used most in reference to this person. Basically, you are applying the code of exercise 1 to frame elements linked to participants (instead of frames to events)

In [None]:
#your code here


### Exercise 3

Rank the lexical units in this corpus (both linked and not linked) with respect to their variety of frames. In other words, rank the words according to their polysemy.

In [None]:
frames_lus = []

for filename in glob.glob("Utrecht_shooting/annotated_data/*"): #iterate over the corpus
    with open(filename, 'r') as infile:
        json_dict = json.load(infile)

    for title, values in json_dict.items():   #iterate over document
        for token, info_dict in values["frames/links"].items(): #iterate over links
            if info_dict['frame'] != None and 'lemma' in info_dict.keys():   #check if the annotation contains lemma and frame
                frame = info_dict['frame']
                if 'compound' in info_dict.keys():
                    lemma_pos = info_dict['compound']+'.'+'NOUN'
                elif info_dict['POS'] == None:
                    lemma_pos = info_dict['lemma']+'.X'
                else:
                    lemma_pos = info_dict['lemma']+'.'+info_dict['POS']
                frames_lus.append((frame, lemma_pos))   #append tuple (frame, LU) to list
        for token, info_dict in values["subevents"].items(): #iterate over subevents
            #repeat the iteration above for subevents
            #your code here

lus_dict = defaultdict(set)

for frame_lu in frames_lus:
    frame = frame_lu[0]
    lu = frame_lu[1]
    lus_dict[lu].add(label)   

lus_sorted = sort_by_values_len(lus_s)   #sort dictionary by frequency of their unique frames

#your code here

### Exercise 4

Compile all frames in reference to subevents and compare them with the frames referencing the main incident.

In [None]:
#your code here
