# Lab2.2: Detecting framenet events

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

FrameNet is a database about situation semantics developed at Berkeley University under the leadership of Fillmore:

https://framenet.icsi.berkeley.edu

FrameNet provides over a thousand frames that represent conceptual schemata for events involving participants in certain roles.

In this notebook, we are going to use the FrameNet module inside the NLTK package to assign frames to words. Words can evoke multiple frames and FrameNet does not provide frames for every word in the English language. Another problem is that FrameNet does not provide frame distributions. We therefore do no know which frames are more likely than others, nor do we know in which texts frames are more dominant. In this notebook, we are going to use trick to learn the dominant frames for your text collection.

These are the steps described in this notebook:

* obtain predicates and nominal heads from SpaCy
* look them up in FrameNet through the NLTK toolkit
* determine the dominant frames for a data set
* use the dominant frames to assign frames to a document collection


## 1. FrameNet in NLTK

In [1]:
import nltk

We assume you have NLTK already installed. To use the FrameNet module, you need to download FrameNet within it. Run the following cell to download it within NLTK. If you have already done this before, FrameNet is included in NLTK and you do not need to download it again.

In [3]:
nltk.download('framenet_v17')

KeyboardInterrupt: 

To check if the install was succesful, the following code cell should work:

After succesful download you can comment out the previous cell.

In [4]:
from nltk.corpus import framenet as fn
len(fn.frames())

1221

Now you know how many different frames there are.

There are some instructions how to use FrameNet in NLTK although they are quote sparse:

http://www.nltk.org/howto/framenet.html


In [5]:
### get the frame identifier for a specific Frame
print(fn.frames('Killing'))

[<frame ID=590 name=Killing>]


In [6]:
### get the frames for a specific lemma
print(fn.frames_by_lemma('inject'))

[<frame ID=262 name=Abounding_with>, <frame ID=59 name=Filling>, ...]


In [7]:
### get frames with the substring 'medical' regardless of case
print(fn.frames(r'(?i)medical'))

[<frame ID=239 name=Medical_conditions>, <frame ID=257 name=Medical_instruments>, ...]


In [9]:
### get a specific frame through its identifier
f = fn.frame(59)
### check what properties and functions are provided for a frame
dict(f)

{'cBy': 'ChW',
 'cDate': '02/07/2001 04:12:13 PST Wed',
 'name': 'Filling',
 'ID': 59,
 '_type': 'frame',
 'definition': "These are words relating to filling containers and covering areas with some thing, things or substance, the Theme. The area or container can appear as the direct object with all these verbs, and is designated Goal because it is the goal of motion of the Theme. Corresponding to its nuclear argument status, it is also affected in some crucial way, unlike goals in other frames.  'Lionel Hutz coated the wall with paint. '",
 'definitionMarkup': '<def-root>These are words relating to filling containers and covering areas with some thing, things or substance, the <fen>Theme</fen>. The area or container can appear as the direct object with all these verbs, and is designated <fen>Goal</fen> because it is the goal of motion of the <fen>Theme</fen>. Corresponding to its nuclear argument status, it is also affected in some crucial way, unlike goals in other frames.\n <ex><fex 

In [10]:
#### print some properties of a frame structure in NLTK

print('ID', f.ID)
print('FRAME:',f.name)
print('DEFINITION', f.definition)
print()
print('LEXICAL UNITS:')
for lu in f.lexUnit:
    print(lu)
print()
print('FRAME ELEMENTS:')
for fe in f.FE:
    print(fe)

ID 59
FRAME: Filling
DEFINITION These are words relating to filling containers and covering areas with some thing, things or substance, the Theme. The area or container can appear as the direct object with all these verbs, and is designated Goal because it is the goal of motion of the Theme. Corresponding to its nuclear argument status, it is also affected in some crucial way, unlike goals in other frames.  'Lionel Hutz coated the wall with paint. '

LEXICAL UNITS:
adorn.v
anoint.v
cover.v
dust.v
load.v
pack.v
smear.v
spread.v
stuff.v
wrap.v
plaster.v
drape.v
dab.v
daub.v
inject.v
cram.v
sow.v
seed.v
brush.v
hang.v
spatter.v
splash.v
splatter.v
spray.v
sprinkle.v
squirt.v
shower.v
drizzle.v
heap.v
pile.v
pump.v
jam.v
plant.v
scatter.v
butter.v
asphalt.v
surface.v
tile.v
wallpaper.v
coat.v
suffuse.v
fill.v
strew.v
douse.v
flood.v
crowd.v
pave.v
varnish.v
paint.v
gild.v
glaze.v
embellish.v
panel.v
wax.v
wash.v
plank.v
yoke.v
dress.v
accessorize.v

FRAME ELEMENTS:
Agent
Theme
Source
Path


In [11]:
print('FRAME RELATIONS:')
for relation in f.frameRelations:
   # print(relation.subFrameName)
    print(relation.superFrameName)
    #print(relation)

FRAME RELATIONS:
Container_focused_placing
Cause_motion
Distributed_position
Placing
Filling


## 2 Getting frames for predicates

Frames can be evoked by many different words and phrases. In the following example, the subject and object of *cause* are also events and actually more information than the main predicate:

```Vaccination can cause autism```

In this notebook, we are restricting ourselves to predicates however, as it is more complex to decide whether subjects and objects denote events as well. To find the predicates, we can rely on the syntactic parsing by spaCy as we did in the previous notebook.

We repeat here for convenience the cells with our example sentence and the dependency tree rendering. We also re-use our function for obtaining event tuples from the dependency relations.

In [12]:
import spacy
from spacy import displacy
# depending on how you installed spaCy, the name of the model might be different
nlp = spacy.load(name='en_core_web_sm') 
text = "John makes the cake . He got sick . He went to bed ."
doc = nlp(text)

In [13]:
displacy.render(doc, jupyter=True, style='dep')

In [14]:
def get_predicate_subject_object(doc, rels={'nsubj', 'dobj', 'prep'}):
    """
    extract predicates with:
    -subject
    -object
    
    :param spacy.tokens.doc.Doc doc: spaCy object after processing text
    
    :rtype: list 
    :return: list of tuples (predicate, subject, object)
    """
    predicates = {}
    
    for token in doc:
        if token.dep_ in rels:
            
            head = token.head
            head_id = head.i
            
            if head_id not in predicates:
                predicates[head_id] = dict()
            
            predicates[head_id][token.dep_] = token.lemma_
    
    output = []
    for pred_token, pred_info in predicates.items():
        one_row = (doc[pred_token].lemma_, 
                   pred_info.get('nsubj', None),
                   pred_info.get('dobj', None)
                  )
        output.append(one_row)
    
    return output

Given that we can process the text with spaCy and obtain the events, we can now make a simple script to iterate over de event tuples and obtain all the frames for each event word. The next cell does that: 

In [15]:
events = get_predicate_subject_object(doc)
for event in events:
    predicate=event[0]
    print(event)
    frames = fn.frames_by_lemma(predicate)
    print('Number of frames:', len(frames))
    frame_names=[]
    for frame in frames:
        frame_names.append(frame.name)
    print(frame_names)

('make', 'John', 'cake')
Number of frames: 27
['Arriving', 'Behind_the_scenes', 'Body_decoration', 'Building', 'Causation', 'Cause_change', 'Communicate_categorization', 'Cooking_creation', 'Creating', 'Destroying', 'Earnings_and_losses', 'Fame', 'Historic_event', 'Intentionally_create', 'Leadership', 'Make_acquaintance', 'Making_arrangements', 'Manufacturing', 'People_by_vocation', 'Personal_success', 'Procreative_sex', 'Reparation', 'Self_motion', 'Sex', 'Theft', 'Type', 'Verification']
('get', '-PRON-', None)
Number of frames: 51
['Abandonment', 'Accompaniment', 'Accoutrements', 'Activity_prepare', 'Activity_start', 'Aiming', 'Amalgamation', 'Arriving', 'Board_vehicle', 'Body_movement', 'Bringing', 'Building', 'Cause_to_amalgamate', 'Cause_to_wake', 'Clothing', 'Collaboration', 'Come_down_with', 'Come_together', 'Contacting', 'Cooking_creation', 'Disembarking', 'Dressing', 'Dynamism', 'Escaping', 'Evading', 'Food', 'Gathering_up', 'Getting', 'Getting_underway', 'Getting_up', 'Giving

We can see that these predicates are very polysemous! Many of these frames are very general and a-specific. So which of these frames are most relevant for our sentences? In other words, which frames tell the story!

## 3. How to choose the right frame?

Unfortunately, there is no data on the frame distribution for words. We can therefore not compare the frames in our domain with the frames in other collections of text data. For example, we cannot apply a contrastive analysis using ```TF*IDF``` for our domain data set to learn which frames are more frequent than expected.

We are going to follow a different procedure here and check how well it works. We assume that words that are more specific also tend to have only a single frame and that these frames are precise indications of what the domain is about. Therefore if we can collect the frames from the words with a single frame (monosemous), we get statistics on the domain specific frames for free.

We are going to do this in the following steps:

<ol>
    <li>For all events in all documents, put the frames of monosemous predicates in a list
    <li>We count how frequently these frames occur and store the counts for later usage. we call these frames the dominant frames of the collection.
    <li>When we process the events from a document, we check for each predicate if it has frames that match these dominant frames.
    <li>If the dominant frame scores above a threshold, we keep the frame and the events, otherwise we ignore them.
</ol>


### 3.1 Learning the dominant frames

To test our idea, we are going to apply this to a single document. We get all the event predicates and store the frame only for monosemous words. 

In [16]:
#### Change the path to your own text file
path_to_file='../lab1-getting-text/techcrunch_search_results/apple%20os%20x17.txt'
monosemous_frame_counts = [] ### where we store the frames
with open(path_to_file) as infile:
    text = infile.read()
    doc = nlp(text)
    events = get_predicate_subject_object(doc)
    for event in events:
        predicate=event[0]
        frames = fn.frames_by_lemma(predicate)
        if len(frames)==1:
            monosemous_frame_counts.append(frames[0].name)


print(monosemous_frame_counts)



To obtain a list for a specific domain, we need to apply this to all texts in our collection. We load all the files from our folder to obtain the monosemous frames.

In [18]:
from pathlib import Path

monosemous_frame_counts = []

basepath = Path('../lab1-getting-text/techcrunch_search_results/')
files_in_basepath = basepath.iterdir()
for path_to_file in files_in_basepath:
    if path_to_file.is_file():  # check of the item is not a subdirectory!!
        print(path_to_file.name)
        with open(path_to_file) as infile:
            text = infile.read()
            doc = nlp(text)
            events = get_predicate_subject_object(doc)
            ### We iterate ove the events for a document
            for event in events:
                predicate=event[0]
                frames = fn.frames_by_lemma(predicate)
                if len(frames)==1:
                    monosemous_frame_counts.append(frames[0].name)



apple%20os%20x17.txt
apple%20os%20x16.txt
apple%20os%20x14.txt
apple%20os%20x9.txt
apple%20os%20x28.txt
apple%20os%20x29.txt
apple%20os%20x8.txt
apple%20os%20x15.txt
apple%20os%20x39.txt
apple%20os%20x11.txt
apple%20os%20x10.txt
apple%20os%20x38.txt
apple%20os%20x12.txt
apple%20os%20x13.txt
apple%20os%20x60.txt
apple%20os%20x48.txt
apple%20os%20x49.txt
apple%20os%20x61.txt
apple%20os%20x59.txt
apple%20os%20x58.txt
apple%20os%20x55.txt
apple%20os%20x41.txt
apple%20os%20x40.txt
apple%20os%20x54.txt
apple%20os%20x42.txt
apple%20os%20x56.txt
apple%20os%20x57.txt
apple%20os%20x43.txt
apple%20os%20x47.txt
apple%20os%20x53.txt
apple%20os%20x52.txt
apple%20os%20x46.txt
apple%20os%20x50.txt
apple%20os%20x44.txt
apple%20os%20x45.txt
apple%20os%20x51.txt
apple%20os%20x36.txt
apple%20os%20x3.txt
apple%20os%20x22.txt
apple%20os%20x23.txt
apple%20os%20x2.txt
apple%20os%20x37.txt
apple%20os%20x21.txt
apple%20os%20x35.txt
apple%20os%20x1.txt
apple%20os%20x34.txt
apple%20os%20x20.txt
apple%20os%20x18.t

We can now again use the Counter function to obtain frame frequencies when we apply this to our whole collection. This will give us a reduced list of more specific frames with frequency counts.

In [19]:
from collections import Counter
counted_frames = Counter(monosemous_frame_counts)
print(counted_frames)



### 3.2 Choosing the dominant frame

We can store the above frame set as a distributional resource and load it for future usage. We use the ```pickle``` module to store the result as *binary* data.

In [20]:
import pickle
with open('dominant-frame-counts.pickle', 'wb') as outputfile:
    pickle.dump(counted_frames, outputfile)

We can load the file at any moment using the same ```pickle``` module. We use a different variable to load the counts.

In [21]:
loaded_counts=()
with open('dominant-frame-counts.pickle', 'rb') as inputfile:
    loaded_counts=pickle.load(inputfile)

print(loaded_counts)



Now, we can process any set of documents, obtain all events and check each event for the most dominant frame if any. To find the most dominant frame, we look up the associated frames in *loaded_counts*. If multiple frames are present, we take the one with the highest frequency. If none is present, we ignore the event.

In the next cell, we show how to do this for a single file. We first set a threshold to only keep frames and events if the frame scores above this threshold. In this way, we can tune the degree of dominance and restrict the frames and events.

If the predicate has multiple frames (polysemous), we keep the frame with the highest count. If the predicate has only one frame, this is the highest scoring one.

Next, we only store the frame and event is the score is above the threshold and dominant. We keep track of the number of ignored events, given the threshold.

In [22]:
### dictionary in which we store for each frame the list of events
frame_event_dictionary={}

### threshold how dominant the frame has to be to keep an event
threshold=2

#### Change the path to your own text file
path_to_file='../lab1-getting-text/techcrunch_search_results/apple%20os%20x17.txt'
with open(path_to_file) as infile:
    text = infile.read()
    doc = nlp(text)
    events = get_predicate_subject_object(doc)
    number_of_ignored_events=0 ### counter to keep track of ignored frames
    for event in events:
        predicate=event[0] ## first element from the tuple
        frames = fn.frames_by_lemma(predicate)
        
        ### best candidate frame
        top_score=0
        top_frame = ""
        for frame in frames:
            ### we get the count
            count = loaded_counts[frame.name]
            ### if it is higher than the current top score, we update the top score and the top frame
            if (count>top_score):
                top_score=count
                top_frame=frame.name

        ### if the top_score is above the threshold, we store the frame and the event
        if top_score>threshold:
            if top_frame in frame_event_dictionary:
                ### if the frame is in the dictionary, we append the event to the list
                frame_event_dictionary[top_frame].append(event)
            else:
                ### if the frame is not present, we create a new list entry for the frame with the event
                frame_event_dictionary[top_frame]=[event]
        else:
            ## We ignore this event because it does not have a frame in our loaded_counts
            print('Ignoring this event:',event)
            number_of_ignored_events+=1

print('Threshold=', threshold, ' ignored:', number_of_ignored_events, ' events out of:', len(events))

Ignoring this event: ('spend', None, 'year')
Ignoring this event: ('industry', None, None)
Ignoring this event: ('continue', 'war', None)
Ignoring this event: ('free', 'delivery', None)
Ignoring this event: ('wipe', 'bug', None)
Ignoring this event: ('remove', 'sweep', 'rating')
Ignoring this event: ('sweep', None, None)
Ignoring this event: ('brand', None, None)
Ignoring this event: ('free', None, None)
Ignoring this event: ('Ahead', None, None)
Ignoring this event: ('launch', None, None)
Ignoring this event: ('affordable', 'service', None)
Ignoring this event: ('launch', 'Spotify', 'app')
Ignoring this event: ('subscriber', None, 'Oct')
Ignoring this event: ('boost', None, 'subscription')
Ignoring this event: ('launch', None, None)
Ignoring this event: ('listen', None, None)
Ignoring this event: ('onli', 'both', None)
Ignoring this event: ('14.99', None, None)
Ignoring this event: ('pricing', None, None)
Ignoring this event: ('along', None, None)
Ignoring this event: ('share', None, 

The last line tells us how many events have been ignored because their frames are not dominant enough.

In [23]:
for frame, events in frame_event_dictionary.items():
    print(frame, events)
    print()

Work [('work', 'Sarah', None), ('work', None, None), ('work', 'Sarah', None), ('work', 'that', None)]

People_by_vocation [('writer', None, None), ('make', 'Amazon', None), ('serve', None, None), ('manager', None, None), ('report', None, None), ('teach', 'set', 'user'), ('home', None, None)]

Emphasizing [('prior', None, None), ('as', None, None), ('grow', '-PRON-', None), ('m', None, None), ('focus', '’s', None)]

Quantified_mass [('number', None, None), ('bil', '194', None), ('let', 'ga', None), ('number', None, None)]

Containers [('offer', 'late', 'pickup'), ('offer', 'service', 'booze'), ('back', None, None), ('back', None, None), ('back', None, None), ('offer', 'GameClub', 'hit'), ('offer', None, 'account')]

Destroying [('late', None, None), ('take', 'GameClub', 'Oct'), ('out', None, None), ('out', None, None)]

Judgment_communication [('tout', 'Walmart', None), ('raise', 'Current', 'B'), ('raise', None, 'billion')]

Statement [('announce', 'retailer', 'milestone'), ('announce',

You can play with the threshold above and see what types of frames and events get selected.

## End of this Notebook