<a href="https://colab.research.google.com/github/dbamman/anlp21/blob/main/16.ie/DependencyPatterns_TODO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook explores relation extraction by measuring common dependency paths between two entities that hold a given relation to each other -- here, the relation "born_in" between a PER entity and an GPE entity, using data from Wikipedia biographies.  Feel free to run this notebook either on your on computer or on Google Colab. If using Google Colab, install these libraries first:

In [21]:
#!pip install spacy==2.1.0

In [22]:
#!python -m spacy download en_core_web_sm

In [23]:
#!pip install neuralcoref --no-binary neuralcoref

Otherwise, use your local `anlp21-spacy2` environment we set up in class on Tuesday.

In [4]:
import re
import spacy
import neuralcoref
from collections import Counter

In [5]:
nlp = spacy.load('en_core_web_sm')
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name='neuralcoref')

In [6]:
def get_path(one, two):
    
    """ Get dependency path between two tokens in a sentence; return None if not reachable """
    
    one_heads=[x for x in one.ancestors]
    two_heads=[x for x in two.ancestors]
    
    up_path=[]
    down_path=[]
    up_path.append(one)
    down_path.append(two)
    
    lca=None
    for head in one_heads:
        if head in two_heads:
            lca=head
            break
            
        up_path.append(head)

    for head in two_heads:
        if head == lca:
            break
    
        down_path.append(head)
   
    if lca is None:
        return None
    
    path="%s->%s<-%s" % ('->'.join(["%s" % x.dep_ for x in up_path]), lca.text, '<-'.join(["%s" % x.dep_ for x in reversed(down_path)]))
    return path

In [7]:
def get_closest_coref(entity1, clusters, target_entity):
    
    """ Given entities e1 and mention m2 of another entity, returns the mention for e1 closest to m2 """
    
    targetCluster=None
    for chain in clusters:
        for mention in chain.mentions:
            if mention.start <= entity1.start and mention.end >= entity1.end:
                targetCluster=chain
                break

    if targetCluster is None:
        return None

    closestMention=None
    dist=100
    for mention in targetCluster:
            sentDist=abs(target_entity.sent.start-mention.sent.start)
            if sentDist < dist:
                dist=sentDist
                closestMention=mention
            if sentDist == dist and closestMention is not None:
                if abs(target_entity.start-mention.start) < abs(target_entity.start-closestMention.start):
                    closetMention=mention
    return closestMention
                

Q0. In class activity: here's [a Google spreadsheet](https://docs.google.com/spreadsheets/d/1PNDInP5JIqad9mOXwRUxGDZntvoUerX22QQcgFCJDxY/edit?usp=sharing) with the first 5 sentences from ~500 Wikipedia biographies.  If you're present in class, pick 10 rows of this spreadsheet that other students haven't already claimed and put your student ID in the "Student ID" column; then go through those 10 rows and read the document. If you can infer that a person from the "Candidate people" column was born in a place in the "Candidate places" column, list that person in the "PER BORN" column and the place in the "PLACE BORN" column.

In [8]:
def read_training(filename):
    
    """ Read in training data for <person, place> tuples that express the "born_in" relation.
    
    -- Use coreference resolution to identity the person mention closest to the place mention.
    -- Use dependency parsing to extract the syntactic path from that person mention to the place.
    
    """
    
    data=[]
    with open(filename) as file:
        for line in file:
            cols=line.split("\t")
            idd=cols[0]
            doc=cols[1]
            pers=cols[4]
            place=cols[5].rstrip()
            
            if pers != "" and place != "":
                doc=nlp(doc)

                target_person=None
                target_place=None
                
                # Annotations are at the type level, so let's anchor them to specific mentions
                for entity in doc.ents:
                    if entity.text == pers:
                        target_person=entity
                    elif entity.text == place:
                        target_place=entity
                
                if target_person is not None and target_place is not None:
                    
                    # Use coreference to get person mention that's closest to the place (ideally in the same sentence).
                    closest_person_mention=get_closest_coref(target_person, doc._.coref_clusters, target_place)
                    if closest_person_mention is None:
                        closest_person_mention=target_person
                    
                    path=get_path(closest_person_mention.root, target_place.root)
                    
                    # if a path can be found between the two
                    if path is not None:
                        data.append((pers, place, path, target_place.sent ))
    return data     

At the end of class, save this Google sheet as a tsv in `born.tsv` and execute the `read_training` function on it to read in the <person, place> tuples.  If you're executing this on Google Colab, download the born.tsv file here (after class).  In both cases, adjust the path to `born.tsv` for your environment.

In [9]:
#!brew install wget
!wget https://raw.githubusercontent.com/dbamman/anlp21/master/data/born.tsv

--2021-11-21 14:37:33--  https://raw.githubusercontent.com/dbamman/anlp21/master/data/born.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 443267 (433K) [text/plain]
Saving to: ‘born.tsv.2’


2021-11-21 14:37:33 (11.0 MB/s) - ‘born.tsv.2’ saved [443267/443267]



In [10]:
trainingData=read_training("born.tsv")
for data in trainingData:
    print ('\t'.join([str(x) for x in data]))

Hutchinson	Alford	nsubjpass->born<-prep<-pobj	Hutchinson was born in Alford , Lincolnshire , England , the daughter of Francis Marbury , an Anglican cleric and school teacher who gave her a far better education than most other girls received .
William Henry Seward	Florida	nsubjpass->born<-prep<-pobj	Seward was born in Florida , Orange County , New York , where his father was a farmer and owned slaves .
Konrad Henlein	Maffersdorf	nsubjpass->born<-prep<-pobj	Konrad Henlein was born in Maffersdorf ( present-day Vratislavice nad Nisou ) near Reichenberg ( Liberec ) , in what was then the Bohemian crown land of Austria-Hungary .
Robert Sylvester Kelly	Chicago	nsubj->began<-advcl<-prep<-pobj	A native of Chicago , Kelly began performing during the late 1980s and debuted in 1992 with the group Public Announcement .
Alastair Nathan Cook	England	nsubjpass->regarded<-prep<-pobj<-prep<-pobj<-relcl<-prep<-pobj	He is regarded as one of the greatest batsmen ever to play for England , and is one of th

Q1: Count the syntactic paths identified in the training data.  What are the two that are most frequently attested?

In [11]:
trainingData=read_training("born.tsv")
synt_dict = {}
for data in trainingData:
    synt = data[2]
    try:
        synt_dict[synt] += 1
    except KeyError:
        synt_dict[synt] = 1
        

In [12]:
sort_orders = sorted(synt_dict.items(), key=lambda x: x[1], reverse=True)

In [13]:
sort_orders[0:2]
# most frequent: nsubjpass->born<-prep<-pobj
# second most frequent: nsubjpass->born<-prep<-pobj<-prep<-pobj

[('nsubjpass->born<-prep<-pobj', 29),
 ('nsubjpass->born<-prep<-pobj<-prep<-pobj', 5)]

Q2: Write a function to read in a target file (containing one document per line) and a syntactic path and identify all people/places that are joined by that path. Hint: you can use the get_path functon defined above to retrieve the syntactic path between two tokens.

In [15]:
 #helper fcn
 def has_numbers(inputString):
     return any(char.isdigit() for char in inputString) 

In [16]:
def extract_relations(filename, target_path):
    
    """ Extract new relations from a file.
    Input: 
        - filename containing one document per line
        - target_path: the syntactic dependency path connecting the person entity to the place entity
    Output:
        - a list of (person, place, path, sentence) tuples in the same format returned from `read_training`.
    
    """
    data=[]
    
    with open(filename) as file:
        for line in file:

            doc=line
            #print(doc)

            doc=nlp(doc)
                
            # Annotations are at the type level, so let's anchor them to specific mentions
            for entity in doc.ents:
                for entity_2 in doc.ents:
                    if (entity != entity_2) and (not has_numbers(str(entity_2))):

                        if entity is not None and entity_2 is not None:
                    
                            closest_person_mention=get_closest_coref(entity, doc._.coref_clusters, entity_2)
                            if closest_person_mention is not None:
                    
                                path=get_path(closest_person_mention.root, entity_2.root)
                    
                                if path == target_path:
                                    data.append((closest_person_mention, entity_2, path, entity_2.sent ))   
    
    return list(set(data))

If you're executing this on Google Colab, download the wiki.bio.born.test.txt file here.

In [17]:
!wget https://raw.githubusercontent.com/dbamman/anlp21/master/data/wiki.bio.born.test.txt

--2021-11-21 14:38:21--  https://raw.githubusercontent.com/dbamman/anlp21/master/data/wiki.bio.born.test.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 144574 (141K) [text/plain]
Saving to: ‘wiki.bio.born.test.txt.8’


2021-11-21 14:38:22 (8.55 MB/s) - ‘wiki.bio.born.test.txt.8’ saved [144574/144574]



In [18]:
#new_examples=extract_relations("wiki.bio.born.test.txt", "nsubjpass->born<-prep<-pobj")
#for data in new_examples:
    #print ('\t'.join([str(x) for x in data]))

Q3: Execute `extract_relations` on `wiki.bio.born.test.txt` and the two most frequent paths identified in the training data above.

In [19]:
# most frequent path

new_examples=extract_relations("wiki.bio.born.test.txt", "nsubjpass->born<-prep<-pobj")
for data in new_examples:
    print ('\t'.join([str(x) for x in data]))

Gladstone	Liverpool	nsubjpass->born<-prep<-pobj	Gladstone was born in Liverpool to Scottish parents .

Kerry	Aurora	nsubjpass->born<-prep<-pobj	Kerry was born in Aurora , Colorado and attended boarding school in Massachusetts and New Hampshire .

He	Uetersen , Holstein	nsubjpass->born<-prep<-pobj	He was born in Uetersen , Holstein , in present-day Germany .
He	Busseto	nsubjpass->born<-prep<-pobj	He was born near Busseto to a provincial family of moderate means , and developed a musical education with the help of a local patron .
Joel	Bronx	nsubjpass->born<-prep<-pobj	Joel was born in 1949 in The Bronx , New York , and grew up on Long Island , New York , both places that influenced his music .
Bonnet	Bassillac	nsubjpass->born<-prep<-pobj	Bonnet was born in Bassillac , Dordogne , the son of a lawyer .
Stengel	Kansas City	nsubjpass->born<-prep<-pobj	Stengel was born in Kansas City , Missouri .
Julia Jackson	Calcutta	nsubjpass->born<-prep<-pobj	Julia Jackson was born in Calcutta to an Angl

In [20]:
# second most frequent path

new_examples=extract_relations("wiki.bio.born.test.txt", "nsubjpass->born<-prep<-pobj<-prep<-pobj")
for data in new_examples:
    print ('\t'.join([str(x) for x in data]))

Chopin	Warsaw	nsubjpass->born<-prep<-pobj<-prep<-pobj	Chopin was born Fryderyk Franciszek Chopin in the Duchy of Warsaw and grew up in Warsaw , which in 1815 became part of Congress Poland .
He	Bohemia	nsubjpass->born<-prep<-pobj<-prep<-pobj	He was born in Louisville , Kentucky , to Jewish immigrant parents from Bohemia ( now in the Czech Republic ) , who raised him in a secular home .
Joel Stephen Kovel	Brooklyn	nsubjpass->born<-prep<-pobj<-prep<-pobj	Joel Stephen Kovel was born on August 27 , 1936 , in Brooklyn , New York .
Mundy	Hammersmith	nsubjpass->born<-prep<-pobj<-prep<-pobj	Mundy was born to a conservative middle-class family in Hammersmith , London .

