# Lab5.2 Extracting attribution relations

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

Attribution relations consists of three components:

<ol>
    <li>source: an explicit source mentioned in the text to which the content is attributed
    <li>cue: a phrase that indicates an attribution relation between a source and some content
    <li>content: a phrase or phrases that represent the content attributed to the source
</ol>

The source has to be a person or organisation, whereas the cue is often a so-called speech act or cognitive verb, e.g. ''say'', ''claim'', 'state'', ''announce'', ''think'', ''believe''. A good strategy to detect attribution relations in sentences is therefore first finding the such predicates and next to get the subjects as human sources: people or organisations.

In this notebook, we are going to demonstrate a simple approach to extract attribution relations from texts in three steps:

<ol>
    <li>Get the predicate, subject and complement from a sentence
    <li>Select source introducing predicates only
    <li>Select sources that refer to humans aor organizations
</ol>


## 1. Get events, their subjects and complements from spaCY

In order to design a strategy to get events, subjects and complements from syntax, we are first going to analyse two sentences that reflect attribution relation and which are processed by spaCy

In [192]:
import spacy
from spacy import displacy
# depending on how you installed spaCy, the name of the model might be different
nlp = spacy.load(name='en_core_web_sm') 

We first look at two example sentences that contain an attribution relation to see what dependency structure spaCy yields.

In [193]:
example="Google's self-driving car division announces that they will meet their target. Investors did not believe it is true."
doc = nlp(example)
displacy.render(doc, jupyter=True, style='dep')

Here the verbs *announe* and *believe* represent the cues for an attribution relation. The subjects of these verbs are the sources and their complements are the content that is attributed to the source. If we want to obtain these three elements, we need to define the steps to extract these:

<ol>
    <li>Obtain the main predicates from a sentence
    <li>For each predicate get the subject as the source and get the remainder of the sentence that has a dependency to the predicate (the complement) as the content
</ol>

If you design a plan like this, think about the factors that may complicate this, e.g. more than one clause and predicate in the same sentence, ellipsis, conjunctions, complex subject phrases, etc.

We will use a similar function as we used before to get tuples with the predicate, the subject and now also the full complement of the predicate. Another difference with the function that we used for extracting events, is that we now want to operate this function on each sentence and not all tokens in a document.

A complicated factor is that we do not just want the dependency relations between the heads of the constituents but that we also want to have all the other tokens of the constituents that depend on the verb. For example for the above sentence, the source is "Google's self-driving car division", while the dependency is between the verb *announce* and the noun *division*. Similarly, we want to have the full that-clause as a the complement of *announce* and not just the relation between *announce* and *meet*.

What we want is the following tuple:

```
[announce, Google's self-driving car division, that they will meet their target]
```

We therefore need a function, which we call *get_dependent_tokens*, that collects all the tokens governed by a head. So for the main verb we want to be able to collect all the tokens with a dependency but also the tokens that depend on this dependency. We thus need a recursive function that collects all tokens dominated by the main verb at any depth. Below, we give such a recursive function that given a sentence and the token id of a head, 1) collects all tokens with a dependency relation but also calls the same function again to proceed to the tokens that depend on these dependent tokens. The function continues recursively to collect tokens with further dependencies until it does not find any tokens any more.

There is one special for getting the attribution. We are extracting the subject constituent separately from the complement tokens of the main predicate. We therefore add an extra parameter to provide a dependent token that should be excluded. Below you find our function that takes the identifier of the head (head_id), the excluded identifier (for the subject in our case) and the sentence object from spaCy.

In [194]:
# Recursive function that takes the token id of the head to get all tokens that directly or indirectly depend on it given a spaCy sentence object
# There is an additional parameter exclude_id to indicate which dependent constituent should be excluded.
# The result is a list of tokens

def get_dependent_tokens(head_id, exclude_id, sent):
    tokens=[]
    for token in sent:
        ### check if this token is not the same as the head token itself nor the excluded token
        if token.i!=head_id and token.i!=exclude_id:
            head = token.head
            ### check of this token is indeed dependent on the head token
            if (head_id==head.i):
                ### we want this token and put it in the result list
                tokens.append(token.i)
                ### we recursively call the function again with our new token to see if there are other tokens that depend on the new token
                nested_tokens=get_dependent_tokens(token.i, exclude_id, sent)
                ### if we have a result, we extend the result list with the deeper tokens
                if nested_tokens:
                    tokens.extend(nested_tokens)
    return tokens

*WARNING!!!*
Recursive functions are very elegant and powerful, but they are also very dangerous. They could go on for ever for certain input. For example, if the condition create a circular relation in which e.g. condition *a* yields condition *b* and *b* yields *a*. So be careful creating and applying resursive functions. If you end up in an infinite-loop, either kill the function when running, or wait until you get an out-of-memory-error. Don't worry, nothing breaks but you may want to relaunch the notebook or the environment.

We are going to use the above recursive function in a function *get_predicate_subject_complement_phrases* to get the predicates, the subject phrase and the complement phrase from a spaCy sentence.

In [156]:
# Input prarameter is a spaCy sentence
def get_predicate_subject_complement_phrases(doc, sent):
    """
    extract predicates with:
    -subject phrase
    -complement phrase
    
    :param spacy.tokens.Sent sent: spaCy object after processing text
    
    :rtype: list 
    :return: list of tuples (predicate, subject, complement)
    """
    
    ### result list that is returned
    output = []
    
    ### we use a dictionary to collect all predicates and their corresponding subjects if any
    predicates = {}
    
    ### We first get the token that has a nsubj dependency with the main verb
    for token in sent:
        if (token.dep_=='nsubj'):
            predicates[token.head.i] = token.i
    
    ### Note that the next loop is not executed if there are no predicates with such a subject.
    for pred_token, pred_info in predicates.items():
        ## We get the subject identifier for this predicate
        subject_id = pred_info
        ### We get all the tokens that make up the subject phrase
        subject_tokens = get_dependent_tokens(subject_id, pred_token, sent)
        subject_tokens.extend([subject_id])
        ### We sort the tokens to get them in the right order
        subject_tokens.sort()
        ### We get the full phrase from the subject tokens
        subject_phrase = ""
        for token in subject_tokens:
            subject_phrase+=" "+doc[token].text
        
        ### We get all the tokens that make up the complement phrase, we exclude the subject
        complement_tokens=get_dependent_tokens(pred_token, subject_id, sent)
        ### We sort the phrase to get the tokens in the right order
        complement_tokens.sort()
        
        if complement_tokens:
            complement_phrase = ""
            for token in complement_tokens:
                complement_phrase+=" "+doc[token].text
            one_row = (doc[pred_token].lemma_,
                       subject_phrase,
                       complement_phrase
                      )
            output.append(one_row)
    
    return output

We test our function on the example sentences to see if we get the right tuples

In [157]:
for sent in doc.sents:
    print(sent)
    events = get_predicate_subject_complement_phrases(doc, sent)
    if events:
        for event in events:
            print(event)

Google's self-driving car division announces that they will meet their target.
('announce', " Google 's self - driving car division", ' that they will meet their target .')
('meet', ' they', ' that will their target')
Investors did not believe it is true.
('believe', ' Investors', ' did not it is true .')
('be', ' it', ' true')


Now we can apply it to a complete document and aggregate the tuples.

In [214]:
#### Change the path to your own text file
path_to_file='../lab1-getting-text/techcrunch_search_results/apple%20os%20x17.txt'
events=[]
with open(path_to_file) as infile:
    text = infile.read()
    doc = nlp(text)
    for sent in doc.sents:
        events.extend(get_predicate_subject_object(sent))


When we print the result, we can see that we overgenerate tuples for attribution relations. In the next section, we will try to filter them

In [215]:
print(events)

[('work', 'Sarah', ' currently as a writer for TechCrunch , after having previously spent over three years at ReadWriteWeb .'), ('work', 'Sarah', ' Prior to her work as a reporter , in I.T. across a number of industries , including banking , retail and software .'), ('offer', 'Latest', ' now curbside alcohol pickup at 2,000 US stores Oct 30 , 2019'), ('continue', 'wars', ' .'), ('make', 'Amazon', ' this week just grocery delivery free , so Walmart is now touting how its grocery service offers the booze .'), ('tout', 'Walmart', ' is now how its grocery service offers the booze .'), ('offer', 'service', ' how the booze'), ('announce', 'retailer', ' today a new milestone in  '), ('wipe', 'bug', ' out over 20 M ratings Oct 30 , 2019'), ('remove', 'sweep', ' more than 20 million ratings from the most popular apps — including from well - known brands like Google , Microsoft , Starbucks , Hulu , Nike and others & Apple TV+ will be free with an Apple Music student subscription Oct 30 , 2019'),

## 2. Filtering tuples for cue predictates

In order to get these predicates, we are going to use a list of FrameNet frames that have been hand picked as so-called Source-Introducing-Frames. We can use the same function as before for detecting events to first get the predciates with frames and filter the onces that introduce a source.


### 2.1. Loading FrameNet in NLTK

In [216]:
import nltk

In [217]:
## probably already done
#nltk.download('framenet_v17')

In [218]:
from nltk.corpus import framenet as fn

### 2.2. Loading source introducing frames

We load the list of FrameNet frames in the file sip-frames.txt to filter events in texts.

In [219]:
path_to_sip_file='sip-frames.txt'
sip_frames=[]
with open(path_to_sip_file) as fp: sip_frames = fp.read().splitlines()


In [220]:
## We check the number of frames
len(sip_frames)

131

In [221]:
print(sip_frames)

['Achieving_first', 'Adding_up', 'Adducing', 'Agree_or_refuse_to_act', 'Appointing', 'Attempt_suasion', 'Bail_decision', 'Be_in_agreement_on_assessment', 'Be_translation_equivalent', 'Become_silent', 'Behind_the_scenes', 'Being_named', 'Body_movement', 'Bragging', 'Categorization', 'Chatting', 'Choosing', 'Claim_ownership', 'Coming_up_with', 'Commitment', 'Communicate_categorization', 'Communication', 'Communication_manner', 'Communication_means', 'Communication_noise', 'Communication_response', 'Compatibility', 'Complaining', 'Compliance', 'Confronting_problem', 'Contacting', 'Criminal_investigation', 'Deny_permission', 'Deserving', 'Discussion', 'Distinctiveness', 'Encoding', 'Eventive_cognizer_affecting', 'Evidence', 'Experiencer_obj', 'Expressing_publicly', 'Forgiveness', 'Gesture', 'Grant_permission', 'Have_as_translation_equivalent', 'Heralding', 'Imposing_obligation', 'Judgment', 'Judgment_communication', 'Judgment_direct_address', 'Justifying', 'Labeling', 'Linguistic_meaning',

We can now simply iterate over the tuples and check if the associated frames match any of the SiFs

In [222]:
filtered_event=[]
for event_tuple in events:
    frames =  fn.frames_by_lemma(event_tuple[0])
    sip_frame=""
    if frames:
        for frame in frames:
            if frame.name in sip_frames:
                sip_frame = frame.name
                break
    if sip_frame:
        filtered_event.append(event_tuple)


In [223]:
for event in filtered_event:
    print(event)

('work', 'Sarah', ' currently as a writer for TechCrunch , after having previously spent over three years at ReadWriteWeb .')
('work', 'Sarah', ' Prior to her work as a reporter , in I.T. across a number of industries , including banking , retail and software .')
('make', 'Amazon', ' this week just grocery delivery free , so Walmart is now touting how its grocery service offers the booze .')
('tout', 'Walmart', ' is now how its grocery service offers the booze .')
('announce', 'retailer', ' today a new milestone in  ')
('announce', 'company', ' Sarah Perez Ahead of Friday ’s launch of Apple ’s new streaming service , Apple TV+ , an Apple Music / Apple TV+ bundle deal specifically aimed at making the service more affordable fo Spotify launches a dedicated Kids app for Premium Family')
('announce', 'Spotify', ' In a move to boost family subscriptions to its app , this morning the launch of a dedicated Kids application which allows children three and up to listen to their own music , both

If you are not happy with the result, you can adapt the file *sip-frames.txt* to make it more restrictive.

## 3. Filtering the sources as people and organisations

You can inspect the above list and check whether the subjects of these predicates are indeed people or organizations. In so far they are not, we can build in more filters on the subject. We can also sort the results per subject.

Let us first sort the tuple by the presumed source to get a better idea about the source candidates. We define a little function *getKey* that selects the second element from the tuple.
We use the *sorted* function to sort the tuple by that second element.


In [188]:
def getKey(item):
    return item[1]

sorted_by_source=sorted(filtered_event,key=getKey)


We can now print the sorted tuples in the order of the source

In [191]:
for tuple in sorted_by_source:
    source_tuple=(tuple[1], tuple[0], tuple[2])
    print(source_tuple)

('194', 'bil', ' GameClub offers mobile gaming ’s greatest hits for $ 5 per month Oct 24 , 2019')
('Amazon', 'make', ' this week just grocery delivery free , so Walmart is now touting how its grocery service offers the booze .')
('Carolina', 'rise', ' has been as an entrepreneurial hub .')
('Current', 'raise', ' $ 20 M Series B , tops half a million users')
('GameClub', 'take', ' , on Apple Arcade Oct 26 , 2019')
('Google', 'roll', ' Sarah Perez A year ago , out “ .new')
('Pass', 'follow', ' soon as a way to subscribe to a sizable collection of both apps and ga Spotify now lets artists buy a full - screen ‘ recommendation ’ promoting their new album Oct 24 , 2019')
('Sarah', 'work', ' currently as a writer for TechCrunch , after having previously spent over three years at ReadWriteWeb .')
('Sarah', 'work', ' Prior to her work as a reporter , in I.T. across a number of industries , including banking , retail and software .')
('Spotify', 'announce', ' In a move to boost family subscripti

What do you notice about the sources? How do you think you can filter tuples based on the source being human?

One way of solving this is by checking if the subject is a named entity and spaCy assigned the type PERSON or ORG. For this, we extend the tuple with an entity type if there is any. Luckily, spaCy allows us to iterate over the subject tokens and check if an entity label is assigned. Below is the adapted function with the extended tuples.

In [224]:
# Input prarameter is a spaCy sentence
def get_predicate_subject_type_complement_phrases(doc, sent):
    """
    extract predicates with:
    -subject phrase
    -complement phrase
    
    :param spacy.tokens.Sent sent: spaCy object after processing text
    
    :rtype: list 
    :return: list of tuples (predicate, subject, complement)
    """
    
    ### result list that is returned
    output = []
    
    ### we use a dictionary to collect all predicates and their corresponding subjects if any
    predicates = {}
    
    ### We first get the token that has a nsubj dependency with the main verb
    for token in sent:
        if (token.dep_=='nsubj'):
            predicates[token.head.i] = token.i
    
    ### Note that the next loop is not executed if there are no predicates with such a subject.
    for pred_token, pred_info in predicates.items():
        ## We get the subject identifier for this predicate
        subject_id = pred_info
        ### We get all the tokens that make up the subject phrase
        subject_tokens = get_dependent_tokens(subject_id, pred_token, sent)
        subject_tokens.extend([subject_id])
        ### We sort the tokens to get them in the right order
        subject_tokens.sort()
        ### We get the full phrase from the subject tokens
        subject_phrase = ""
        for token in subject_tokens:
            subject_phrase+=" "+doc[token].text
        
        ### We define a variable to store the entity label for the subject tokens if any
        ent_label = ""
            
        for token in subject_tokens:
            ent_label =doc[token].ent_type_
            ### if we have a label, we can break
            if ent_label:
                break
        
        
        ### We get all the tokens that make up the complement phrase, we exclude the subject
        complement_tokens=get_dependent_tokens(pred_token, subject_id, sent)
        ### We sort the phrase to get the tokens in the right order
        complement_tokens.sort()
        
        if complement_tokens:
            complement_phrase = ""
            for token in complement_tokens:
                complement_phrase+=" "+doc[token].text
            one_row = (doc[pred_token].lemma_,
                       subject_phrase,
                       ent_label,
                       complement_phrase
                      )
            output.append(one_row)
    
    return output

We now apply our new function to get the quadruples from the document.

In [242]:
#### Change the path to your own text file
path_to_file='../lab1-getting-text/techcrunch_search_results/apple%20os%20x17.txt'
events=[]
with open(path_to_file) as infile:
    text = infile.read()
    doc = nlp(text)
    for sent in doc.sents:
        events.extend(get_predicate_subject_type_complement_phrases(doc, sent))



We can now combine the frame filter with another filter on the entity label. We also print the subject phrase in case there is no matching type.

In [243]:
filtered_event=[]
for event_tuple in events:
    frames =  fn.frames_by_lemma(event_tuple[0])
    sip_frame=""
    if frames:
        for frame in frames:
            if frame.name in sip_frames:
                sip_frame = frame.name
                break
    if sip_frame:
        if event_tuple[2]=='PERSON' or event_tuple[2]=='ORG':
            filtered_event.append(event_tuple)
        else:
            print(event_tuple[1]) 



 The retailer
 the company
 which
 Oct 29 , 2019 Sarah Perez Sony ’s live TV streaming service , PlayStation Vue ,
 The service
 the company
 The news
 that
 you
 Venmo
 which
 users
 GameClub
 The app industry in 2018
 194
 ga Spotify
 the service
 you
 the company
 it
 they
 the company
 The company
 the service


We can see that the missed subjects contain subjects we do not want but also good subjects such as *the company* and pronouns *you* and *they*.

Next, we consider the selected results.

In [244]:
sorted_by_source=sorted(filtered_event,key=getKey)

In [245]:
for tuple in sorted_by_source:
    source_tuple=(tuple[1], tuple[0], tuple[2], tuple[3])
    print(source_tuple)

(' Amazon', 'make', 'ORG', ' this week just grocery delivery free , so Walmart is now touting how its grocery service offers the booze .')
(' Earl Mobile banking app Current', 'raise', 'PERSON', ' $ 20 M Series B , tops half a million users')
(' Google', 'roll', 'ORG', ' Sarah Perez A year ago , out “ .new')
(' Google Play Pass', 'follow', 'ORG', ' soon as a way to subscribe to a sizable collection of both apps and ga Spotify now lets artists buy a full - screen ‘ recommendation ’ promoting their new album Oct 24 , 2019')
(' Sarah', 'work', 'PERSON', ' Prior to her work as a reporter , in I.T. across a number of industries , including banking , retail and software .')
(' Sarah Perez North Carolina', 'rise', 'PERSON', ' has been as an entrepreneurial hub .')
(' Sarah Perez Spotify', 'add', 'PERSON', ' recently a feature that will occasionally pop up a full - screen recommendation of a new album the service thinks you ’ll like , based on a combination of your listening taste and huma App

Think about how to further improve the results. For now, we save the current result to use it for the next notebook on sentiment analysis.

In [246]:
import pickle
with open('attribution-relations.pickle', 'wb') as outputfile:
    pickle.dump(sorted_by_source, outputfile)

## End of this notebook