This homework explores coreference resolution for the task of timeline generation: for a given biography on Wikipedia, can you extract all of the events associated with the people mentioned and create one timeline for each person?  For this homework, be sure you are using the `anlp_spacy2` anaconda environment that we set up in `CorefSetup.ipynb` (that has Spacy 2 installed, along with the neuralcoref package).

Within this environment, install the wikipedia package:

```pip install wikipedia```

In [2]:
import wikipedia
import spacy
import neuralcoref
import re
import operator

In [3]:
nlp = spacy.load('en_core_web_sm')
coref = neuralcoref.NeuralCoref(nlp.vocab)
nlp.add_pipe(coref, name='neuralcoref')

In [4]:
biography = wikipedia.page("Harrison Ford", auto_suggest=False)

In [5]:
doc=nlp(biography.content)

Q1. Create a method `extract_timeline` to construct a timeline for the subject of the Wikipedia biography; the timeline should be comprised of events (i.e., verbs) in the biography that the subject participates in and that can be grounded to a year. Your criteria:

- Only include events in sentences that contain a year.
- Only include events that the target of the Wikipedia biography is involved in.
- Only include events where that target person is the subject or direct object of the verb.

Your function should return a dict mapping a sentence where those criteria are satisfied to the date mentioned in that sentence.  For example, assume the following to be a biography of the target *Bach*:

> Bach is a composer.  He was born in 1685.  He composed 6 suites for solo cello in 1717. Anna Magdalena Wilcke met him in 1721.  Handel was a contemporary and was also born in 1685. Bach's dog was born in 1723.  

Your function should return the following dict:

{"He was born in 1685": 1685, "He composed 6 suites for solo cello in 1717": 1717, "Anna Magdalena Wilcke met him in 1721": 1721}

We exclude the other sentences because:

- "Bach is a composer" does not mention a date.
- Bach is not mentioned in "Handel was a contemporary and was also born in 1685."
- Bach is not the subject or direct object of "Bach's dog was born in 1723"

In [260]:
def extract_timeline(doc, subject = "Ford"):
    # Looking at the Harrison Ford examples, the
    # coref chains should just be the last name only (could be wrong)
    
    events={}
    
    # Get all coref links
    corefs = doc._.coref_clusters
    corefs_to_check = []
    
    # Find clusters that match the name
    for cluster in corefs:
        #if cluster.main.text in subject or subject in cluster.main.text:
            corefs_to_check.append(cluster)
            
    year_sents = []
    # For each event in the clusters to check, see if they mention a year
    for ref in corefs_to_check:
        for mention in ref:
            years = re.findall(r"\d{4}", mention.sent.text)
            if years:
                year_sents.append(mention)
    
    for mention in year_sents:
        if subject in mention.root.text and (mention.root.dep_ == "dobj" or mention.root.dep_ == "nsubj") and mention.sent.text not in events.values():
            year = int(re.findall(r"\d{4}", mention.sent.text)[0])
            events[year] = mention.sent.text
        
    # For each year_cluster sentence, see if the subject is the target
    # If so, add to the dictionary
    
    return events

Let's print that extracted timeline from oldest to newest events.

In [261]:
events=extract_timeline(doc)

# Note: couldn't get provided code to show the dict correctly, so using a different version
for key in sorted(events):
    print("%s: %s" % (key, events[key]))

#for k,v in sorted(events.items(), key=operator.itemgetter(1)):
    #print("%s\t%s" % (v, k))

1942: Harrison Ford (born July 13, 1942) is an American actor.
1960: Ford began flight training in the 1960s at Wild Rose Idlewild Airport in Wild Rose, Wisconsin, flying in a Piper PA-22 Tri-Pacer, but at $15 an hour (equivalent to $128 in 2020), he could not afford to continue the training.
1964: In 1964, after a season of summer stock with the Belfry Players in Wisconsin, Ford traveled to Los Angeles to apply for a job in radio voice-overs.
1969: French filmmaker Jacques Demy chose Ford for the lead role of his first American film, Model Shop (1969), but the head of Columbia Pictures thought Ford had "no future" in the film business and told Demy to hire a more experienced actor.
1970: ===
Ford began to receive bigger roles in films throughout the late 1970s, including Heroes (1977), Force 10 from Navarone (1978) and Hanover Street (1979).
1973: Casting director and fledgling producer Fred Roos championed the young Ford and secured him an audition with George Lucas for the role of B