This homework explores coreference resolution for the task of timeline generation: for a given biography on Wikipedia, can you extract all of the events associated with the people mentioned and create one timeline for each person?

In [1]:
import wikipedia
import spacy
import neuralcoref
import re
import operator

In [2]:
nlp = spacy.load('en_core_web_sm')
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name='neuralcoref')

In [3]:
biography = wikipedia.page("Harrison Ford", auto_suggest=False)

In [4]:
doc=nlp(biography.content)

Q1. Create a method `extract_timeline` to construct a timeline for the subject of the Wikipedia biography; the timeline should be comprised of events (i.e., verbs) in the biography that the subject participates in and that can be grounded to a year. Your criteria:

- Only include events in sentences that contain a year.
- Only include events that the target of the Wikipedia biography is involved in.
- Only include events where that target person is the subject or direct object of the verb.

Your function should return a dict mapping a sentence where those criteria are satisfied to the date mentioned in that sentence.  For example, assume the following to be a biography of the target *Bach*:

> Bach is a composer.  He was born in 1685.  He composed 6 suites for solo cello in 1717. Anna Magdalena Wilcke met him in 1721.  Handel was a contemporary and was also born in 1685. Bach's dog was born in 1723.  

Your function should return the following dict:

{"He was born in 1685": 1685, "He composed 6 suites for solo cello in 1717": 1717, "Anna Magdalena Wilcke met him in 1721": 1721}

We exclude the other sentences because:

- "Bach is a composer" does not mention a date.
- Bach is not mentioned in "Handel was a contemporary and was also born in 1685."
- Bach is not the subject or direct object of "Bach's dog was born in 1723"

In [5]:
def extract_timeline(doc):
    
    events={}
    
    def extractDateEvents(doc, idx):
        " Extract the first date found in linear order of the sentence for a given verb "
        for token in doc[idx].sent:
            if re.match("\d{4}", token.text) is not None:
                return token.text
        return None
        
    # Here we assume that the largest entity cluster is the biography subject
    maxLen=0
    maxId=None
    for canonical in doc._.coref_clusters:
        if len(canonical.mentions) > maxLen:
            maxLen=len(canonical.mentions)
            maxId=canonical
    
    for mention in maxId.mentions:
        if mention.root.dep_ == "nsubj" or mention.root.dep_== "dobj":
            
            # get the parent verb
            head=mention.root.head

            year=extractDateEvents(doc, head.i)

            if year is not None:
                events[str(head.sent).lstrip().rstrip()]=year

    return events

Let's print that extracted timeline from oldest to newest events.

In [6]:
events=extract_timeline(doc)
for k,v in sorted(events.items(), key=operator.itemgetter(1)):
    print("%s\t%s" % (v, k))

1960	Ford graduated in 1960 from Maine East High School in Park Ridge, Illinois.
1960s	Ford soon dropped the "J" and worked for Universal Studios, playing minor roles in many television series throughout the late 1960s and early 1970s, including Gunsmoke, Ironside, The Virginian, The F.B.I., Love, American Style and Kung Fu.
1960s	Ford began flight training in the 1960s at Wild Rose Idlewild Airport in Wild Rose, Wisconsin, flying in a Piper PA-22 Tri-Pacer, but at $15 an hour (equivalent to $128 in 2020), he could not afford to continue the training.
1964	In 1964, after a season of summer stock with the Belfry Players in Wisconsin, Ford traveled to Los Angeles to apply for a job in radio voice-overs.
1964	He was first married to Mary Marquardt from 1964 until their divorce in 1979.
1968	He appeared in the western Journey to Shiloh (1968) and had an uncredited, non-speaking role in Michelangelo Antonioni's 1970 film Zabriskie Point as an arrested student protester.
1969	French filmmake