# Lab4.2: Detecting predicates and participants

Copyright, Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

In this notebook, we are going to look into the dependency relations generated by spaCy. You have already processed your documents using spaCy. Each text is split into sentences and for each sentence a syntactic parser tree is created.

We are going to extract the main predicate that represents the root from these syntactic structures to represent an event and the depending constituents as possible participants and adjuncts.


## 2. Finding predicates in text

In [45]:
import spacy
from spacy import displacy
# depending on how you installed spaCy, the name of the model might be different
nlp = spacy.load(name='en_core_web_sm') 
text = "John makes the cake . He got sick . He went to bed ."
doc = nlp(text)

In [46]:
displacy.render(doc, jupyter=True, style='dep')

In [58]:
def get_predicate_subject_object(doc, rels={'nsubj', 'dobj', 'prep'}):
    """
    extract predicates with:
    -subject
    -object
    
    :param spacy.tokens.doc.Doc doc: spaCy object after processing text
    
    :rtype: list 
    :return: list of tuples (predicate, subject, object)
    """
    predicates = {}
    
    for token in doc:
        if token.dep_ in rels:
            
            head = token.head
            head_id = head.i
            
            if head_id not in predicates:
                predicates[head_id] = dict()
            
            predicates[head_id][token.dep_] = token.lemma_
    
    output = []
    for pred_token, pred_info in predicates.items():
        one_row = (doc[pred_token].lemma_, 
                   pred_info.get('nsubj', None),
                   pred_info.get('dobj', None)
                  )
        output.append(one_row)
    
    return output

In [59]:
get_predicate_subject_object(doc)

[('make', 'John', None), ('get', '-PRON-', None), ('go', '-PRON-', None)]

## 2. Aggregating events across text sources

Using the above function, we can now process text documents and obtain all the events but also subjects and objects as properties. In the next cell, we load one of our text files, process it with spaCy and get the event tuples.

In [62]:
#### Change the path to your own text file
path_to_file='../lab1-getting-text/techcrunch_search_results/apple%20os%20x17.txt'
events = []
with open(path_to_file) as infile:
    text = infile.read()
    doc = nlp(text)
    events = get_predicate_subject_object(doc)
    print(events)

[('work', 'Sarah', None), ('writer', None, None), ('spend', None, 'year'), ('prior', None, None), ('work', None, None), ('work', 'Sarah', None), ('number', None, None), ('industry', None, None), ('offer', 'late', 'pickup'), ('late', None, None), ('continue', 'war', None), ('make', 'Amazon', None), ('free', 'delivery', None), ('tout', 'Walmart', None), ('offer', 'service', 'booze'), ('announce', 'retailer', 'milestone'), ('wipe', 'bug', None), ('remove', 'sweep', 'rating'), ('sweep', None, None), ('rating', None, None), ('include', None, None), ('brand', None, None), ('free', None, None), ('Ahead', None, None), ('launch', None, None), ('announce', 'company', 'deal'), ('aim', None, None), ('affordable', 'service', None), ('launch', 'Spotify', 'app'), ('app', None, None), ('subscriber', None, 'Oct'), ('announce', 'Spotify', 'launch'), ('boost', None, 'subscription'), ('launch', None, None), ('allow', 'which', None), ('three', 'child', None), ('listen', None, None), ('onli', 'both', None),

The same event words may occur more than once so let's aggregate them for this file.

In [79]:
### We define two dictionaries, one for the subjects and one for the objects
event_subjects={}
event_objects={}

### We iterate ove the events for a document
for event in events:
    ### we check if the event word (the first element in the tuple: event[0]) 
    ### is in a dictionary or not. If so and if the value is not None, 
    ### we add it to the corresponding dictionaries a subject (event[1])
    if event[0] in event_subjects:
        if event[1]:
            event_subjects[event[0]].append(event[1])
    ### if the word is not present, we create a new list entry for the word with the value if not None
    elif event[1]:
        event_subjects[event[0]]=[event[1]]
    ### we repeat the same thing for the object (event[2]) in the object dictionary   
    if event[0] in event_participants:
        if event[2]:
            event_objects[event[0]].append(event[2])
    ### if the word is not present, we create a new list entry for the word with the value if not None
    elif event[2]:
        event_objects[event[0]]=[event[2]]
            
print('Subjects:', event_subjects)
print('Objects:', event_objects)            

Subjects: {'work': ['Sarah', 'Sarah', 'that'], 'offer': ['late', 'service', 'GameClub'], 'continue': ['war', 'which'], 'make': ['Amazon'], 'free': ['delivery'], 'tout': ['Walmart'], 'announce': ['retailer', 'company', 'Spotify', 'Walmart', 'company', 'Venmo', 'company', 'company'], 'wipe': ['bug'], 'remove': ['sweep'], 'affordable': ['service'], 'launch': ['Spotify', 'Quibi', 'that'], 'allow': ['which', 'which'], 'three': ['child'], 'onli': ['both'], 'cost': ['Max', 'service'], 'establish': ['Walmart'], 'shut': ['service'], 'be': ['service'], 'come': ['news', 'app'], 'bring': ['Google'], 'roll': ['Google'], 'type': ['-PRON-'], 'earn': ['user'], 'take': ['GameClub'], 'recap': ['that'], 'support': ['-PRON-'], 'flow': ['that'], 'see': ['industry'], 'bil': ['194'], 'introduce': ['Arcade'], 'follow': ['Pass'], 'let': ['ga'], 'buy': ['artist', 'IBM'], 'add': ['Spotify'], 'pop': ['that'], 'think': ['service'], 'like': ['-PRON-'], 'begin': ['app', 'which'], 'raise': ['Current'], 'expand': ['Cu

In the next call, we repeat the above for each file in our collection and aggregate the predicates with their subjects and objects in a single subject and object dictionary.

In [83]:
from pathlib import Path

# The path to the folder with the text files. 
# Here I use a relative path from where I run the notebook
# Adapt the path accordingly to where your data is and/or where you run your notebook
# You can also specify the absolute path

### We define two dictionaries, one for the subjects and one for the objects
event_subjects={}
event_objects={}

basepath = Path('../lab1-getting-text/techcrunch_search_results/')
files_in_basepath = basepath.iterdir()
for path_to_file in files_in_basepath:
    if path_to_file.is_file():  # check of the item is not a subdirectory!!
        print(path_to_file.name)
        with open(path_to_file) as infile:
            text = infile.read()
            doc = nlp(text)
            events = get_predicate_subject_object(doc)
            ### We iterate ove the events for a document
            for event in events:
                ### we check if the event word (the first element in the tuple: event[0]) 
                ### is in a dictionary or not. If so and if the value is not None, 
                ### we add it to the corresponding dictionaries a subject (event[1])
                if event[0] in event_subjects:
                    if event[1]:
                        event_subjects[event[0]].append(event[1])
                ### if the word is not present, we create a new list entry for the word with the value if not None
                elif event[1]:
                    event_subjects[event[0]]=[event[1]]
                ### we repeat the same thing for the object (event[2]) in the object dictionary   
                if event[0] in event_participants:
                    if event[2]:
                        event_objects[event[0]].append(event[2])
                ### if the word is not present, we create a new list entry for the word with the value if not None
                elif event[2]:
                    event_objects[event[0]]=[event[2]]

print('Subjects:', event_subjects)
print('Objects:', event_objects) 

apple%20os%20x17.txt
apple%20os%20x16.txt
apple%20os%20x14.txt
apple%20os%20x9.txt
apple%20os%20x28.txt
apple%20os%20x29.txt
apple%20os%20x8.txt
apple%20os%20x15.txt
apple%20os%20x39.txt
apple%20os%20x11.txt
apple%20os%20x10.txt
apple%20os%20x38.txt
apple%20os%20x12.txt
apple%20os%20x13.txt
apple%20os%20x60.txt
apple%20os%20x48.txt
apple%20os%20x49.txt
apple%20os%20x61.txt
apple%20os%20x59.txt
apple%20os%20x58.txt
apple%20os%20x55.txt
apple%20os%20x41.txt
apple%20os%20x40.txt
apple%20os%20x54.txt
apple%20os%20x42.txt
apple%20os%20x56.txt
apple%20os%20x57.txt
apple%20os%20x43.txt
apple%20os%20x47.txt
apple%20os%20x53.txt
apple%20os%20x52.txt
apple%20os%20x46.txt
apple%20os%20x50.txt
apple%20os%20x44.txt
apple%20os%20x45.txt
apple%20os%20x51.txt
apple%20os%20x36.txt
apple%20os%20x3.txt
apple%20os%20x22.txt
apple%20os%20x23.txt
apple%20os%20x2.txt
apple%20os%20x37.txt
apple%20os%20x21.txt
apple%20os%20x35.txt
apple%20os%20x1.txt
apple%20os%20x34.txt
apple%20os%20x20.txt
apple%20os%20x18.t

## 3 counting subjects and objects for events

We see there is a lot of repetition. We would like to keep the unique subjects and objects but count them. For this we use the *Counter* package to count the elements in a list and derive a new dictionary with the counts.

In [85]:
from collections import Counter 

event_counted_subjects={}
event_counted_objects={}

for key, subjects in event_subjects.items():
    event_counted_subjects[key]=Counter(subjects)

for key, objects in event_objects.items():
    event_counted_objects[key]=Counter(objects)

for key, subjects in event_counted_subjects.items():
    print(key, subjects)

for key, objects in event_counted_objects.items():
    print(key, objects)

   
#print('Counted subjects:', event_counted_subjects)
#print('Counted objects:', event_counted_objects)

work Counter({'-PRON-': 28, 'company': 6, 'that': 5, 'Sarah': 2, 'Pencil': 2, 'dongle': 2, 'Microsoft': 2, 'Apple': 2, 'x': 2, 'app': 2, 'Siri': 2, 'piece': 2})
offer Counter({'-PRON-': 6, 'procreate': 2, 'Pencil': 2, 'party': 2, 'Pro': 2, 'message': 2, 'late': 1, 'service': 1, 'GameClub': 1, 'which': 1, 'world': 1})
continue Counter({'trickle': 2, 'app': 2, 'war': 1, 'which': 1, 'crisis': 1, 'death': 1, 'robot': 1})
make Counter({'-PRON-': 22, 'that': 14, 'which': 6, 'app': 4, 'feature': 4, 'what': 3, 'Apple': 2, 'contact': 2, 'company': 2, 'patch': 2, 'move': 2, 'assistant': 2, 'feedback': 2, 'Capitan': 2, 'website': 2, 'Amazon': 1, 'Arcade': 1, 'people': 1, 'all': 1, 'sale': 1})
free Counter({'delivery': 1})
tout Counter({'Walmart': 1})
announce Counter({'company': 25, 'Apple': 18, '’s': 6, 'Google': 5, 'Federighi': 2, 'retailer': 1, 'Spotify': 1, 'Walmart': 1, 'Venmo': 1, 'date': 1, 'PayPal': 1, 'Department': 1, 'Mozilla': 1, 'McDermott': 1, 'Canva': 1, 'BoxGroup': 1, 'Dorsey': 1, 

What can you say about the distribution of predicates and the subjects and objects? Does this look familair?

## End of this Notebook