# Extracting Quotation-Person pairs from Reuters articles
# Seth Vanderwilt, December 18 2016 for Bob West (and my own NLP trial by fire)

Task description:

(1) Download the Reuters corpus here: http://www.daviddlewis.com/resources/testcollections/reuters21578/

(2) **Build a table of quotes alongside the name of the person who uttered it**; i.e., one record of the table could look like this:
They brought us whole binders full of women&lt;TAB&gt;Mitt Romney

For extracting quotes, it's quite easy to handcraft some high-precision extractor patterns, e.g., in the form of regexes such as "(.\*)", said (.\*)
But if you do only this, then you'll miss out on a lot of quotes that are framed slightly differently (e.g., '"Bla bla bla?", the governor asked').

Can you think of ways to automatically or semi-automatically expand the set of extractor patterns? You could think of ways of leveraging the fact that some people (such as Ronald Reagan -- the corpus is from the late '80s...) have probably contributed many quotes to the corpus. So-called bootstrapping approaches (such as this one http://ilpubs.stanford.edu:8090/421/1/1999-65.pdf) could work in such a setting.

**Potentially relevant papers:**
- http://www.cs.columbia.edu/~delson/pubs/AAAI10-ElsonMcKeown.pdf
- http://alpage.inria.fr/~sagot/pub/ltc09sapiens.pdf
- https://aclweb.org/anthology/U/U13/U13-1007.pdf

**What I've done so far:**
- one potentially interesting file (not so interesting now):
     - all-people-strings.lc.txt
         - could use this as a set() of known Person entities and do lookups?
         - turns out this file only consists of people who were the primary focus of an article, so we can use it to look for example quotation-person tuples (get some bootstrap samples) but it's not an exhaustive list of all the people who occur in the corpus.
- the most reliable example patterns of quotations I thought of:
    - QUOTATION SAID_SYNONYM PERSON
    - QUOTATION PERSON SAID_SYNONYM
    - PERSON SAID_SYNONYM QUOTATION


**interesting stuff**
    - import the dataset
    - see if we can get some matches for the example pattern (QUOTATION) said (.\*)
    - see if we can get some bootstrap samples of Reagan quotes
    
**Paper stuff**
Brin approach for (Title, Author) extraction:
- Start by manually adding 5 example (Book, Author) pairs.
- Find all documents containing these pairs in a small region of text.
    - occurence: (author, title, order, url, prefix, middle, suffix)
- From these documents, extract patterns e.g. the document contained &lt;LI&gt;&lt;B&gt;title&lt;/B&gt; by author
- Search for these patterns on the Web to find more books

**A similar approach for (Quote, Sayer) extraction:**
- Start with a few (Quote, Famous_Person) pairs
    - So we start with pairs [("Mr. Gorbachev tear down this wall", Ronald Reagan), ("God bless America", Ronald Reagan), ("It's clearly a budget. It's got a lot of numbers in it.", George Bush) ...]
- Find all documents containing these pairs
    - could adapt Brin's occurence definition - but do we need to find a 'say' verb in the middle/suffix? I'm not sure that we can just use the exact text we find.
        - For example I found "Reagan to Gorbachev: 'Tear down this wall'!" which is a quotation pattern I hadn't thought of. 
        - Another one: "On June 12, 1987, President Ronald Reagan declared, 'Mr. Gorbachev, open this gate. Mr. Gorbachev, tear down this wall.'"
- Extract patterns from these documents
- Search for these patterns to find more quote-sayer pairs

**One problem** is this Reuters corpus consists primarily of __mundane__ quotes from famous and not-so-famous people. You can find many of Reagan's quotes in the public record but most of the quotes in Reuters are one-offs from a press conference or interview. Thus I don't think there are any quotes that are reused in multiple articles.

You might think it's OK to just match any quote and do a fuzzy matching of the quotee. So we would just match ("anything", ~Reagan). The problem with this approach is it will match a sentence like "... the 'geniuses' in the Reagan administration ..." and then we learn some bad patterns. So it seems we need to use specific quotes and use the grammatical structure to check that Reagan was the person saying the quote.

**Other datasets/approaches that might be useful**
- Cross-reference multiple books or articles that contain the same famous quotations. 
- Find all the articles written about a specific speech/press-conference (journalists would all hear the same quotes but write their articles differently?)
- Use Google search results for each (quote,person) pair basically automating what I did for the "tear down this wall" example above
- Supervised learning and measure the error rate with crowdsourcing

**See what Reagan quotes we can find in the dataset without any preprocessing** 

(NOTE: other than this grep result, the rest of my work is with the NLTK version of reuters-21578)

We can see that in the second execution many of these are Reagan-related but not actual Reagan quotes! So this is not an easy task.

**I preprocess the dataset by getting the raw text of each document, splitting it into sentences (each separated by newlines) and putting all of them into a giant array**

In [1]:
import re
import nltk
from nltk.corpus import reuters 

doc_ids = reuters.fileids()
texts = []
#Keep mapping from docid:text for context?
for doc_id in doc_ids:
    doc = reuters.raw(doc_id)
    #instead of accepting doc as formatted, try to get each sentence on its own line
    doc_oneline = doc.replace('\n',' ')
    doc_sents = nltk.tokenize.sent_tokenize(doc_oneline)
    doc_lines = '\n'.join(doc_sents)
    texts.append(doc_lines)

In [2]:
concatenated_texts = "\n".join(texts)
len(concatenated_texts)

8650373

In [3]:
#Curious what happens if we try to create the table with this pattern.
#Looks like the "sayer" field would be a mess
matches = re.findall('\"(.*)\" said (.*)', concatenated_texts)
print(len(matches))
matches[:5]

458


[("We wouldn't be able to do business,",
  'a spokesman for   leading Japanese electronics firm Matsushita Electric   Industrial Co Ltd &lt;MC.T>.'),
 ('If the tariffs remain in place for any length of time   beyond a few months it will mean the complete erosion of   exports (of goods subject to tariffs) to the U.S.,',
  'Tom   Murtha, a stock analyst at the Tokyo office of broker &lt;James   Capel and Co>.'),
 ('That is a very short-term view,',
  'Lawrence Mills,   director-general of the Federation of Hong Kong Industry.'),
 ('The tie-up is widely looked on as a lame duck because the   Fed was stricter than Sumitomo expected,',
  'one analyst.'),
 ("It's (Sumitomo) been bold in its strategies,", "  Kleinwort's Smithson.")]

In [4]:
reagan_matches = re.findall('\".*\".*Reagan.*', concatenated_texts)

print(len(reagan_matches))
reagan_matches

19


['"We cannot allow it to be jeopardized by unfair trading   practices," Reagan added in the statement from his California   vacation home at Santa Barbara.',
 '"The commitments made at these meetings need to be   translated into action," Reagan said in a pre-summit speech   celebrating the 40th anniversary of the Marshall Plan.     ',
 '"They died while guarding a chokepoint of freedom, deterring   aggression and reaffirming America\'s willingness to protect its   vital interests," Reagan said.',
 '"No one has ever stated or supported a policy of protecting   all shipping in those waters," Pentagon spokesman Bob Sims said   as the Reagan Administration drew up plans to increase the   protective U.S. military presence in the gulf.',
 '"I believe there is room in the market for a further decline   in interest rates," Reagan said in a statement as he left the   White House to visit his wife Nancy at Bethesda Naval Hospital.',
 '"Interest rates are down across the spectrum," Reagan said.',

In [5]:
reagan_matches_swapped = re.findall('Reagan.*\".*\".*', concatenated_texts)

print(len(reagan_matches_swapped))
reagan_matches_swapped

34


['Reagan said "I am committed to   the full enforcement of our trade agreements designed to   provide American industry with free and fair trade."',
 'Reagan administration was resisting "strong   domestic pressure" for trade protection and was working closely   with the U.S. Congress in crafting a trade bill.',
 'Reagan said "I am committed to   full enforcement of our trade agreements designed to provide   American industry with free and fair trade opportunities."',
 'Reagan to cancel the scheduled 100 pct tariffs on Japanese   electronic exports, he said "slim to none."',
 'Reagan said in announcing the sanctions today that "I regret   that these actions are necessary," but that the health and   vitality of the U.S. semiconductor industry was essential to   American competitiveness in world markets.',
 'Reagan said the American people were aware that "it is not our   interests alone that are being protected."',
 'Reagan said today, "economic policy decisions   made last year in Toky

**Some example valid patterns from these results (clearly not every match is a valid Reagan quote!):**
- "We cannot allow it to be jeopardized by unfair trading   practices," Reagan added ...
- "The United States remains committed to the Louvre   agreement," Reagan said ...
- "We don\'t want to go down that road," Reagan was quoted as telling ...

In [6]:
mydataset = '"quote1" Reagan said.\n \
"quote1" Reagan was quoted as telling the police.\n \
Reagan reiterated that "quote1" in his speech.\n \
Reagan: "quote2"\n Reagan said that "Quote2" and so on.\n \
Reagan replied, "quote3" before walking away.\n \
Reagan told reporters to "quote3"'

In [8]:
for line in mydataset.split("\n"):
    #match our example pair
    if "Reagan" in line and "quote1" in line:
        print("Found match for line: '%s'" % line)
        #extract pattern
        #XXX haven't covered the case where quote precedes sayer
        regex_sq = re.search("(.*)Reagan(.*)\"quote1\"(.*)", line)
        if regex_sq:
            print("sayer-quote:")
            pref,mid,suff = regex_sq.groups()
            print("prefix: '%s'" % pref)
            print("middle: '%s'" % mid)
            print("suffix: '%s'" % suff)
        regex_qs = re.search("(.*)\"quote1\"(.*)Reagan(.*)", line)
        if regex_qs:
            print("quote-sayer:")
            pref,mid,suff = regex_qs.groups()
            print("prefix: '%s'" % pref)
            print("middle: '%s'" % mid)
            print("suffix: '%s'" % suff)
        

Found match for line: '"quote1" Reagan said.'
quote-sayer:
prefix: ''
middle: ' '
suffix: ' said.'
Found match for line: ' "quote1" Reagan was quoted as telling the police.'
quote-sayer:
prefix: ' '
middle: ' '
suffix: ' was quoted as telling the police.'
Found match for line: ' Reagan reiterated that "quote1" in his speech.'
sayer-quote:
prefix: ' '
middle: ' reiterated that '
suffix: ' in his speech.'


**Seems like we need to figure out what's relevant to have a useful pattern, but we can't get too general or we'll get junk**