# Lab2.2: Detecting framenet events

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

FrameNet is a database about situation semantics developed at Berkeley University under the leadership of Fillmore:

https://framenet.icsi.berkeley.edu

FrameNet provides over a thousands frames that represent conceptual schemata for events involving participants in certain roles.

In this notebook, we are going to use the FrameNet module inside the NLTK package to assign frames to words. Words can evoke multiple frames and FrameNet does not provide frames for every word in the English language.

* obtain predicates and nominal heads from SpaCy
* look them up in FrameNet through the NLTK toolkit
* determine the dominant frames for a data set
* use the dominant frames to assign frames to a document collection


## 2. FrameNet in NLTK

We assume you have NLTK already installed. To use the FrameNet module, you need to download FrameNet within it. Run the following cell to download it within NLTK:

In [2]:
import nltk
nltk.download('framenet_v17')

[nltk_data] Downloading package framenet_v17 to
[nltk_data]     /Users/piek/nltk_data...
[nltk_data]   Package framenet_v17 is already up-to-date!


True

To check if the install was succesful, the following code cell should work:

In [6]:
from nltk.corpus import framenet as fn
len(fn.frames())

1221

Now you know how many different frames there are.

There are some instructions how to use FrameNet in NLTK although they are quote sparse:

http://www.nltk.org/howto/framenet.html


In [22]:
### get the frame identifier for a specific Frame
print(fn.frames('Killing'))

[<frame ID=590 name=Killing>]


In [27]:
### get the frames for a specific lemma
print(fn.frames_by_lemma('inject'))

[<frame ID=262 name=Abounding_with>, <frame ID=59 name=Filling>, ...]


In [7]:
### get frames with the substring 'medical' regardless of case
print(fn.frames(r'(?i)medical'))

[<frame ID=239 name=Medical_conditions>, <frame ID=257 name=Medical_instruments>, ...]


In [28]:
### get a specific frame through its identifier
f = fn.frame(59)
dict(f)

{'cBy': 'ChW',
 'cDate': '02/07/2001 04:12:13 PST Wed',
 'name': 'Filling',
 'ID': 59,
 '_type': 'frame',
 'definition': "These are words relating to filling containers and covering areas with some thing, things or substance, the Theme. The area or container can appear as the direct object with all these verbs, and is designated Goal because it is the goal of motion of the Theme. Corresponding to its nuclear argument status, it is also affected in some crucial way, unlike goals in other frames.  'Lionel Hutz coated the wall with paint. '",
 'definitionMarkup': '<def-root>These are words relating to filling containers and covering areas with some thing, things or substance, the <fen>Theme</fen>. The area or container can appear as the direct object with all these verbs, and is designated <fen>Goal</fen> because it is the goal of motion of the <fen>Theme</fen>. Corresponding to its nuclear argument status, it is also affected in some crucial way, unlike goals in other frames.\n <ex><fex 

In [43]:
#### print some properties of a frame structure in NLTK

print('ID', f.ID)
print('FRAME:',f.name)
print('DEFINITION', f.definition)
print()
print('LEXICAL UNITS:')
for lu in f.lexUnit:
    print(lu)
print()
print('FRAME ELEMENTS:')
for fe in f.FE:
    print(fe)

ID 59
FRAME: Filling
DEFINITION These are words relating to filling containers and covering areas with some thing, things or substance, the Theme. The area or container can appear as the direct object with all these verbs, and is designated Goal because it is the goal of motion of the Theme. Corresponding to its nuclear argument status, it is also affected in some crucial way, unlike goals in other frames.  'Lionel Hutz coated the wall with paint. '

LEXICAL UNITS:
adorn.v
anoint.v
cover.v
dust.v
load.v
pack.v
smear.v
spread.v
stuff.v
wrap.v
plaster.v
drape.v
dab.v
daub.v
inject.v
cram.v
sow.v
seed.v
brush.v
hang.v
spatter.v
splash.v
splatter.v
spray.v
sprinkle.v
squirt.v
shower.v
drizzle.v
heap.v
pile.v
pump.v
jam.v
plant.v
scatter.v
butter.v
asphalt.v
surface.v
tile.v
wallpaper.v
coat.v
suffuse.v
fill.v
strew.v
douse.v
flood.v
crowd.v
pave.v
varnish.v
paint.v
gild.v
glaze.v
embellish.v
panel.v
wax.v
wash.v
plank.v
yoke.v
dress.v
accessorize.v

FRAME ELEMENTS:
Agent
Theme
Source
Path


In [44]:
print('FRAME RELATIONS:')
for relation in f.frameRelations:
   # print(relation.subFrameName)
    print(relation.superFrameName)
    #print(relation)


FRAME RELATIONS:
Container_focused_placing
<Parent=Container_focused_placing -- Inheritance -> Child=Filling>
Cause_motion
<MainEntry=Cause_motion -- See_also -> ReferringEntry=Filling>
Distributed_position
<MainEntry=Distributed_position -- See_also -> ReferringEntry=Filling>
Placing
<MainEntry=Placing -- See_also -> ReferringEntry=Filling>
Filling
<Causative=Filling -- Causative_of -> Inchoative/state=Fullness>


### 3 Getting frames for predicates

In [1]:
import spacy
from spacy import displacy
# depending on how you installed spaCy, the name of the model might be different
nlp = spacy.load(name='en_core_web_sm') 
text = "John makes the cake . He got sick . He went to bed ."
doc = nlp(text)

In [2]:
displacy.render(doc, jupyter=True, style='dep')

In [4]:
def get_predicate_subject_object(doc, rels={'nsubj', 'dobj', 'prep'}):
    """
    extract predicates with:
    -subject
    -object
    
    :param spacy.tokens.doc.Doc doc: spaCy object after processing text
    
    :rtype: list 
    :return: list of tuples (predicate, subject, object)
    """
    predicates = {}
    
    for token in doc:
        if token.dep_ in rels:
            
            head = token.head
            head_id = head.i
            
            if head_id not in predicates:
                predicates[head_id] = dict()
            
            predicates[head_id][token.dep_] = token.lemma_
    
    output = []
    for pred_token, pred_info in predicates.items():
        one_row = (doc[pred_token].lemma_, 
                   pred_info.get('nsubj', None),
                   pred_info.get('dobj', None)
                  )
        output.append(one_row)
    
    return output

In [13]:
events = get_predicate_subject_object(doc)
for event in events:
    predicate=event[0]
    subject=event[1]
    object=event[2]
    print(event)
    frames = fn.frames_by_lemma(predicate)
    print('Number of frames:', len(frames))
    frame_names=[]
    for frame in frames:
        frame_names.append(frame.name)
    print(frame_names)
    

('make', 'John', 'cake')
Number of frames: 27
['Arriving', 'Behind_the_scenes', 'Body_decoration', 'Building', 'Causation', 'Cause_change', 'Communicate_categorization', 'Cooking_creation', 'Creating', 'Destroying', 'Earnings_and_losses', 'Fame', 'Historic_event', 'Intentionally_create', 'Leadership', 'Make_acquaintance', 'Making_arrangements', 'Manufacturing', 'People_by_vocation', 'Personal_success', 'Procreative_sex', 'Reparation', 'Self_motion', 'Sex', 'Theft', 'Type', 'Verification']
('get', '-PRON-', None)
Number of frames: 51
['Abandonment', 'Accompaniment', 'Accoutrements', 'Activity_prepare', 'Activity_start', 'Aiming', 'Amalgamation', 'Arriving', 'Board_vehicle', 'Body_movement', 'Bringing', 'Building', 'Cause_to_amalgamate', 'Cause_to_wake', 'Clothing', 'Collaboration', 'Come_down_with', 'Come_together', 'Contacting', 'Cooking_creation', 'Disembarking', 'Dressing', 'Dynamism', 'Escaping', 'Evading', 'Food', 'Gathering_up', 'Getting', 'Getting_underway', 'Getting_up', 'Giving

We can see that these predicates are very polysemous! Many of these frames are very general and a-specific. So which of these frames applies to our sentences?

### 4 Choosing the right frame

There is hardly any distributiona data for FrameNet. We cannot apply a contrastive analysis, e.g. using ```TF*IDF``` for our domain data set.

As an alternative method, we are going to investigate the monesemous predicates.

In [19]:
#### Change the path to your own text file
path_to_file='../lab1-getting-text/techcrunch_search_results/apple%20os%20x17.txt'
monesemous_event_frames = {}
monesemous_frame_counts = []
with open(path_to_file) as infile:
    text = infile.read()
    doc = nlp(text)
    events = get_predicate_subject_object(doc)
    for event in events:
        predicate=event[0]
        frames = fn.frames_by_lemma(predicate)
        if len(frames)==1:
            monesemous_event_frames[predicate]=frames[0].name
            monesemous_frame_counts.append(frames[0].name)

print(monesemous_event_frames)




We can now again use the Counter function to obtain frame frequencies when we apply this to our whole collection



We now get a reduced list of more specific frames

In [None]:
import spacy
from spacy import displacy
nlp = spacy.load('en')
#nlp = spacy.load("en_core_web_sm")

from pathlib import Path
basepath = Path('/Users/piek/Desktop/Language-as-data/LAD-labs/lab1-getting-text/techcrunch_search_results/')
files_in_basepath = basepath.iterdir()

monesemous_frame_counts = []

for path_to_file in files_in_basepath:
    if path_to_file.is_file():  # check of the item is not a subdirectory!!
        print(path_to_file.name)
        with open(path_to_file) as infile:
            text = infile.read()
            doc = nlp(text)
            events = get_predicate_subject_object(doc)
            for event in events:
                predicate=event[0]
                frames = fn.frames_by_lemma(predicate)
                if len(frames)==1:
                    monesemous_event_frames[predicate]=frames[0].name
                    monesemous_frame_counts.append(frames[0].name)



apple%20os%20x17.txt
apple%20os%20x16.txt
apple%20os%20x14.txt
apple%20os%20x9.txt
apple%20os%20x28.txt
apple%20os%20x29.txt
apple%20os%20x8.txt
apple%20os%20x15.txt
apple%20os%20x39.txt
apple%20os%20x11.txt
apple%20os%20x10.txt
apple%20os%20x38.txt
apple%20os%20x12.txt
apple%20os%20x13.txt
apple%20os%20x60.txt
apple%20os%20x48.txt
apple%20os%20x49.txt
apple%20os%20x61.txt
apple%20os%20x59.txt
apple%20os%20x58.txt
apple%20os%20x55.txt
apple%20os%20x41.txt
apple%20os%20x40.txt
apple%20os%20x54.txt
apple%20os%20x42.txt
apple%20os%20x56.txt
apple%20os%20x57.txt
apple%20os%20x43.txt
apple%20os%20x47.txt
apple%20os%20x53.txt
apple%20os%20x52.txt
apple%20os%20x46.txt
apple%20os%20x50.txt
apple%20os%20x44.txt
apple%20os%20x45.txt
apple%20os%20x51.txt
apple%20os%20x36.txt


In [None]:
from collections import Counter
print(Counter(monesemous_frame_counts))

In [69]:
#spacy uses different PoS tags from 
spacy_pos_to_fn_pos = {
    'VERB' : 'v',
    'NOUN' : 'n'
}

In [56]:
import spacy
from spacy import displacy
nlp = spacy.load('en')
#nlp = spacy.load("en_core_web_sm")

from pathlib import Path
basepath = Path('/Users/piek/Desktop/Language-as-data/LAD-labs/lab1-getting-text/techcrunch_search_results/')
files_in_basepath = basepath.iterdir()

monesemous_frame_counts = []


frame_dict={}
parent_list=[]

for path_to_file in files_in_basepath:
    if path_to_file.is_file():  # check of the item is not a subdirectory!!
        print(path_to_file.name)
        with open(path_to_file) as infile:
            text = infile.read()
            doc = nlp(text)
            i=0
            for token in doc:
                #lu=toke.lemma+"."+spacy_pos_to_fn_pos(token.pos_)
                i+=1
                if i>50:
                    break
                try:
                    frames = fn.frames_by_lemma(token.lemma_)
                    if (len(frames)>0):
                        for frame in frames:
                            for relation in frame.frameRelations:
                                if frame.name!=relation.superFrameName:
                                    if frame.name in frame_dict:
                                        if relation.superFrameName not in frame_dict[frame.name]:
                                            frame_dict[frame.name].append(relation.superFrameName)
                                            parent_list.append(relation.superFrameName)
                                    else:
                                        frame_dict[frame.name]=[relation.superFrameName]
                                        parent_list.append(relation.superFrameName)
                                #else:
                                   # print(frame.name, ":",relation.superFrameName )
                except:
                    print('error getting frames for:', token.lemma_)
                
        break
        

apple%20os%20x17.txt


In [57]:
for k,parents in frame_dict.items():
    print(k,':', parents)


Abounding_with : ['Locative_relation', 'Abundance', 'Distributed_position', 'Distributed_abundance']
Accoutrements : ['Artifact']
Attack : ['Intentionally_affect', 'Hostile_encounter']
Behind_the_scenes : ['Performing_arts']
Being_named : ['Name_conferral']
Biological_urge : ['Gradable_attributes']
Body_decoration : ['Artifact', 'Body_parts']
Body_description_holistic : ['Gradable_attributes']
Body_description_part : ['Gradable_attributes', 'Body_parts']
Buildings : ['Artifact', 'Locale_by_use']
Bungling : ['Intentionally_act', 'Intentionally_affect']
Change_accessibility : ['Intentionally_affect']
Chatting : ['Reciprocality', 'Statement', 'Discussion']
Clothing : ['Artifact', 'Closure']
Clothing_parts : ['Clothing']
Complaining : ['Statement']
Compliance : ['Satisfying', 'Social_behavior_evaluation', 'Obligation_scenario']
Contacting : ['Intentionally_act', 'Communication', 'Communication_means']
Containers : ['Bounded_entity', 'Containing']
Counterattack : ['Attack', 'Response']
Cour

In [58]:
from collections import Counter
frame_parent_counts=Counter(parent_list)

In [59]:
print(frame_parent_counts)

Counter({'Gradable_attributes': 57, 'Intentionally_act': 56, 'Intentionally_affect': 50, 'Transitive_action': 41, 'Communication': 38, 'Event': 32, 'Artifact': 23, 'Motion': 23, 'Eventive_affecting': 21, 'Statement': 20, 'Locative_relation': 16, 'Awareness': 16, 'State': 16, 'Social_behavior_evaluation': 15, 'Emotions': 14, 'Traversing': 14, 'Placing': 14, 'Cause_motion': 13, 'People': 12, 'Measurable_attributes': 12, 'Reciprocality': 11, 'Part_whole': 10, 'Process': 10, 'Make_noise': 10, 'Intentionally_create': 10, 'Activity': 10, 'Committing_crime': 9, 'Relation': 9, 'Capability': 9, 'Body_parts': 8, 'Commerce_scenario': 8, 'Desirability': 8, 'Locale': 8, 'Mental_activity': 8, 'Perception': 8, 'Self_motion': 8, 'Giving': 8, 'Experiencer_focus': 7, 'Social_interaction_evaluation': 7, 'Possession': 7, 'Activity_ongoing': 7, 'Removing': 7, 'Using': 7, 'Hostile_encounter': 6, 'Discussion': 6, 'Relation_between_individuals': 6, 'Measures': 6, 'Success_or_failure': 6, 'Biological_entity': 

## Some code to get you started

In the Python module **event_utils.py**, I've created a function called **load_lemma_to_candidate_frames**, which allows you to retrieve the frames that a lexical unit can evoke, e.g.,

In [4]:
import event_utils
# pos_set indicates which parts of speech you want to use
# in this case we only focus on verbs
lu_to_candidate_frames = event_utils.load_lu_to_candidate_frames(fn, 
                                                                 pos_set={'n', 'v'})
lu_to_candidate_frames['win.v']

{'Finish_competition', 'Finish_game', 'Getting', 'Win_prize'}

We observe that the verb **win** can evoke four frames according to FrameNet version 1.7

We now move to the actual event detection and frame disambiguation. We focus on two steps:
* **target identification**: is a token referring to an event? For example, the token **the** will probably not refer to an event whereas **inject** would probably refer to an event.
* **frame disambiguation**: once we've determined that a token refers to an event, we want to indicate which frame it evokes, e.g., **Ingest_substance**

We first focus on **target_identification**:

In [5]:
def target_identification(lu_to_candidate_frames, lu):
    """
    determine whether an lu is referring to an event by
    checking whether the lu has candidate frames in FrameNet
    
    :param dict lu_to_candidate_frames: 
    mapping of lu, e.g., 'win.v.' to a set of the frames it can evoke
    :param str lu: the lu for which you want to perform the 
    target identification, e.g., 'win.v'
    """
    target = False 
    
    # if the lu has candidate frames in FrameNet, we state it is a target
    if lu in lu_to_candidate_frames:
        candidate_frames = lu_to_candidate_frames[lu]
        if candidate_frames:
            target = True 
                
    return target
    
print(target_identification(lu_to_candidate_frames, 'win.v'))
print(target_identification(lu_to_candidate_frames, 'relax.v'))
print(target_identification(lu_to_candidate_frames, 'brew.v'))
print(target_identification(lu_to_candidate_frames, 'the.d'))

True
False
False
False


In [10]:
dataset = 'The doctor vaccinated the person by injecting a substance. He had to inject twice.'
doc = nlp(dataset)

#spacy uses different PoS tags from 
spacy_pos_to_fn_pos = {
    'VERB' : 'v',
    'NOUN' : 'n'
}

frame_to_freq = dict() 


for token in doc:
    is_target = False
    lu = None
    candidate_frames = set()
    
    lemma = token.lemma_
    spacy_pos = token.pos_
    if spacy_pos in spacy_pos_to_fn_pos:
        fn_pos = spacy_pos_to_fn_pos[spacy_pos]
        lu = f'{lemma}.{fn_pos}'
        
        is_target = target_identification(lu_to_candidate_frames, lu)
        
        candidate_frames = lu_to_candidate_frames[lu]
        
        for candidate_frame in candidate_frames:
            if candidate_frame not in frame_to_freq:
                frame_to_freq[candidate_frame] = 0
            frame_to_freq[candidate_frame] += 1
        
    print(token.text, lu, is_target, candidate_frames)

The None False set()
doctor doctor.n True {'Appellations', 'Medical_interaction_scenario', 'Medical_professionals'}
vaccinated vaccinate.v False set()
the None False set()
person person.n True {'People'}
by None False set()
injecting inject.v True {'Placing', 'Ingest_substance', 'Filling'}
a None False set()
substance substance.n False set()
. None False set()
He None False set()
had have.v True {'Giving_birth', 'Possession', 'Sex', 'Have_associated', 'Ingestion', 'Ingest_substance', 'Inclusion'}
to None False set()
inject inject.v True {'Placing', 'Ingest_substance', 'Filling'}
twice None False set()
. None False set()


In [13]:
print()
print('frame frequency in entire text')
print(frame_to_freq)


frame frequency in entire text
{'Appellations': 1, 'Medical_interaction_scenario': 1, 'Medical_professionals': 1, 'People': 1, 'Placing': 2, 'Ingest_substance': 3, 'Filling': 2, 'Giving_birth': 1, 'Possession': 1, 'Sex': 1, 'Have_associated': 1, 'Ingestion': 1, 'Inclusion': 1}


The next step is **frame_disambiguation**, for which we make use of the dictionary **frame_freq**.
We establish a dominant frame as the frame that has the highest frequency in the dataset out of the candidate frames of a lexical unit, e.g.,

In [14]:
lu = 'inject.v'
candidate_frames = lu_to_candidate_frames[lu]
print(candidate_frames)

{'Placing', 'Ingest_substance', 'Filling'}


Based on the frequency in the dictionary **frame_freq**, we observe that:
* *Ingest_substance* has a frequency of 3
* *Placing* has a frequency of 2
* *Filling* has a frequency of 2

Since *Ingest_substance* has a frequency of 3, i.e., the highest frequency, we determine that **Ingest_substance** is the dominant frame for the lexical unit **inject.v**

In [40]:
def determine_dominant_frame(frame_to_freq, candidate_frames):
    """
    
    """
    # Please write a function to determine the dominant frame of a lexical unit

## End of this Notebook