# Week 7 - Information Extraction


This week, we move from arbitrary textual classification to the use of computation and linguistic models to parse precise claims from documents. Rather than focusing on simply the *ideas* in a corpus, here we focus on understanding and extracting its precise *claims*. This process involves a sequential pipeline of classifying and structuring tokens from text, each of which generates potentially useful data for the content analyst. Steps in this process, which we examine in this notebook, include: 1) tagging words by their part of speech (POS) to reveal the linguistic role they play in the sentence (e.g., Verb, Noun, Adjective, etc.); 2) tagging words as named entities (NER) such as places or organizations; 3) structuring or "parsing" sentences into nested phrases that are local to, describe or depend on one another; and 4) extracting informational claims from those phrases, like the Subject-Verb-Object (SVO) triples we extract here. While much of this can be done directly in the python package NLTK that we introduced in week 2, here we use NLTK bindings to the Stanford NLP group's open software, written in Java. Try typing a sentence into the online version [here](http://nlp.stanford.edu:8080/corenlp/) to get a sense of its potential. It is superior in performance to NLTK's implementations, but takes time to run, and so for these exercises we will parse and extract information for a very small text corpus. Of course, for final projects that draw on these tools, we encourage you to install the software on your own machines or shared servers at the university (RCC, SSRC) in order to perform these operations on much more text. 

For this notebook we will be using the following packages:

In [243]:
#Special module written for this class
#This provides access to data and to helper functions from previous weeks
#Make sure you update it before starting this notebook
import lucem_illud #pip install -U git+git://github.com/Computational-Content-Analysis-2018/lucem_illud.git

#All these packages need to be installed from pip
#For NLP
import nltk
import sklearn


import numpy as np #For arrays
import pandas #Gives us DataFrames
import matplotlib.pyplot as plt #For graphics
import seaborn #Makes the graphics look nicer

#Displays the graphs
import graphviz #You also need to install the command line graphviz

#These are from the standard library
import os.path
import zipfile
import subprocess
import io
import tempfile

%matplotlib inline

You need to run this _once_ to download everything, you will also need [Java 1.8+](http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) if you are using Windows or MacOS.

In [60]:
lucem_illud.setupStanfordNLP()

Starting downloads, this will take 5-10 minutes
../stanford-NLP/parser already exists, skipping download
../stanford-NLP/postagger already exists, skipping download
../stanford-NLP/core already exists, skipping download
../stanford-NLP/ner already exists, skipping download
Done setting up the Stanford NLP collection


We need to have stanford-NLP setup before importing, so we are doing the import here. IF you have stanford-NLP working, you can import at the beginning like you would with any other library.

In [61]:
import lucem_illud.stanford as stanford

Open Information Extraction is a module packaged within the Stanford Core NLP package, but it is not yet supported by `nltk`. As a result, we have defining our own `lucem_illud` function that runs the Stanford Core NLP java code right here. For other projects, it is often useful to use Java or other programs (in C, C++) within a python workflow, and this is an example. `stanford.openIE()` takes in a string or list of strings and then produces as output all the subject, verb, object (SVO) triples Stanford Corenlp can find, as a DataFrame. You can do this through links to the Stanford Core NLP project that we provide here, or play with their interface directly (in the penultimate cell of this notebook), which produces data in "pretty graphics" like this example parsing of the first sentence in the "Shooting of Trayvon Martin" Wikipedia article:

![Output 1](../data/stanford_core1.png)
![Output 2](../data/stanford_core2.png)

First, we will illustrate these tools on some *very* short examples:

In [9]:
text = ['I saw the elephant in my pajamas.', 'The quick brown fox jumped over the lazy dog.', 'While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.', 'Trayvon Benjamin Martin was an African American from Miami Gardens, Florida, who, at 17 years old, was fatally shot by George Zimmerman, a neighborhood watch volunteer, in Sanford, Florida.', 'Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo']
tokenized_text = [nltk.word_tokenize(t) for t in text]
print('\n'.join(text))


I saw the elephant in my pajamas.
The quick brown fox jumped over the lazy dog.
While in France, Christine Lagarde discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.
Trayvon Benjamin Martin was an African American from Miami Gardens, Florida, who, at 17 years old, was fatally shot by George Zimmerman, a neighborhood watch volunteer, in Sanford, Florida.
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo


# Part-of-Speech (POS) tagging

In POS tagging, we classify each word by its semantic role in a sentence. The Stanford POS tagger uses the [Penn Treebank tag set]('http://repository.upenn.edu/cgi/viewcontent.cgi?article=1603&context=cis_reports') to POS tag words from input sentences. As discussed in the second assignment, this is a relatively precise tagset, which allows more informative tags, and also more opportunities to err :-).

|#. |Tag |Description |
|---|----|------------|
|1.	|CC	|Coordinating conjunction
|2.	|CD	|Cardinal number
|3.	|DT	|Determiner
|4.	|EX	|Existential there
|5.	|FW	|Foreign word
|6.	|IN	|Preposition or subordinating conjunction
|7.	|JJ	|Adjective
|8.	|JJR|	Adjective, comparative
|9.	|JJS|	Adjective, superlative
|10.|	LS	|List item marker
|11.|	MD	|Modal
|12.|	NN	|Noun, singular or mass
|13.|	NNS	|Noun, plural
|14.|	NNP	|Proper noun, singular
|15.|	NNPS|	Proper noun, plural
|16.|	PDT	|Predeterminer
|17.|	POS	|Possessive ending
|18.|	PRP	|Personal pronoun
|19.|	PRP\$|	Possessive pronoun
|20.|	RB	|Adverb
|21.|	RBR	|Adverb, comparative
|22.|	RBS	|Adverb, superlative
|23.|	RP	|Particle
|24.|	SYM	|Symbol
|25.|	TO	|to
|26.|	UH	|Interjection
|27.|	VB	|Verb, base form
|28.|	VBD	|Verb, past tense
|29.|	VBG	|Verb, gerund or present participle
|30.|	VBN	|Verb, past participle
|31.|	VBP	|Verb, non-3rd person singular present
|32.|	VBZ	|Verb, 3rd person singular present
|33.|	WDT	|Wh-determiner
|34.|	WP	|Wh-pronoun
|35.|	WP$	|Possessive wh-pronoun
|36.|	WRB	|Wh-adverb

In [10]:
pos_sents = stanford.postTagger.tag_sents(tokenized_text)
print(pos_sents)

[[('I', 'PRP'), ('saw', 'VBD'), ('the', 'DT'), ('elephant', 'NN'), ('in', 'IN'), ('my', 'PRP$'), ('pajamas', 'NNS'), ('.', '.')], [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')], [('While', 'IN'), ('in', 'IN'), ('France', 'NNP'), (',', ','), ('Christine', 'NNP'), ('Lagarde', 'NNP'), ('discussed', 'VBD'), ('short-term', 'JJ'), ('stimulus', 'NN'), ('efforts', 'NNS'), ('in', 'IN'), ('a', 'DT'), ('recent', 'JJ'), ('interview', 'NN'), ('with', 'IN'), ('the', 'DT'), ('Wall', 'NNP'), ('Street', 'NNP'), ('Journal', 'NNP'), ('.', '.')], [('Trayvon', 'NNP'), ('Benjamin', 'NNP'), ('Martin', 'NNP'), ('was', 'VBD'), ('an', 'DT'), ('African', 'NNP'), ('American', 'NNP'), ('from', 'IN'), ('Miami', 'NNP'), ('Gardens', 'NNP'), (',', ','), ('Florida', 'NNP'), (',', ','), ('who', 'WP'), (',', ','), ('at', 'IN'), ('17', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('was', 'VBD'), ('fatally'

This looks quite good. Now we will try POS tagging with a somewhat larger corpus. We consider a few of the top posts from the reddit data we used last week.

In [11]:
redditDF = pandas.read_csv('../Codes_Data/data/reddit.csv', index_col=0)

Grabbing the 10 highest scoring posts and tokenizing the sentences. Once again, notice that we aren't going to do any kind of stemming this week (although *semantic* normalization may be performed where we translate synonyms into the same focal word).

In [12]:
redditTopScores = redditDF.sort_values('score')[-10:]
redditTopScores['sentences'] = redditTopScores['text'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])
redditTopScores.index = range(len(redditTopScores) - 1, -1,-1) #Reindex to make things nice in the future
redditTopScores[-5:]

Unnamed: 0,author,over_18,score,subreddit,text,title,url,sentences
4,goldie-gold,False,12650,Tales From Tech Support,"This just happened... So, I had a laptop syst...",Engineer is doing drugs!! No. No they aren't.,https://www.reddit.com/r/talesfromtechsupport/...,"[[This, just, happened, ...], [So, ,, I, had, ..."
3,TheDroolinFool,False,13152,Tales From Tech Support,Another tale from the out of hours IT desk... ...,"""I need you to fix Google Bing immediately!""",https://www.reddit.com/r/talesfromtechsupport/...,"[[Another, tale, from, the, out, of, hours, IT..."
2,Clickity_clickity,False,13404,Tales From Tech Support,[Part 1](http://www.reddit.com/r/talesfromtech...,"Jack, the Worst End User, Part 4",https://www.reddit.com/r/talesfromtechsupport/...,"[[[, Part, 1, ], (, http, :, //www.reddit.com/..."
1,SECGaz,False,13724,Tales From Tech Support,"> $Me - Hello, IT. > $Usr - Hi, I am still ...","Hi, I am still off sick but I am not.",https://www.reddit.com/r/talesfromtechsupport/...,"[[>, $, Me, -, Hello, ,, IT, .], [>, $, Usr, -..."
0,guitarsdontdance,False,14089,Tales From Tech Support,So my story starts on what was a normal day ta...,"""Don't bother sending a tech, I'll be dead by ...",https://www.reddit.com/r/talesfromtechsupport/...,"[[So, my, story, starts, on, what, was, a, nor..."


In [13]:
redditTopScores['POS_sents'] = redditTopScores['sentences'].apply(lambda x: stanford.postTagger.tag_sents(x))

In [14]:
redditTopScores['POS_sents']

9    [[(Last, JJ), (year, NN), (,, ,), (Help, NN), ...
8    [[(First, JJ), (post, NN), (in, IN), (quite, R...
7    [[([, NNP), (Original, NNP), (Post, NNP), (], ...
6    [[(I, PRP), (witnessed, VBD), (this, DT), (ast...
5    [[(I, PRP), (work, VBP), (Helpdesk, NNP), (for...
4    [[(This, DT), (just, RB), (happened, VBN), (.....
3    [[(Another, DT), (tale, NN), (from, IN), (the,...
2    [[([, NNP), (Part, NNP), (1, CD), (], FW), ((,...
1    [[(>, JJR), ($, $), (Me, PRP), (-, :), (Hello,...
0    [[(So, RB), (my, PRP$), (story, NN), (starts, ...
Name: POS_sents, dtype: object

And count the number of `NN` (nouns)

In [15]:
countTarget = 'NN'
targetCounts = {}
for entry in redditTopScores['POS_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != countTarget:
                continue
            elif ent in targetCounts:
                targetCounts[ent] += 1
            else:
                targetCounts[ent] = 1
sortedTargets = sorted(targetCounts.items(), key = lambda x: x[1], reverse = True)
sortedTargets[:20]

[('password', 21),
 ('(', 19),
 (')', 14),
 ('time', 14),
 ('lot', 12),
 ('computer', 12),
 ('email', 11),
 ('life', 11),
 ('**Genius**', 10),
 ('system', 9),
 ('day', 9),
 ('**Me**', 9),
 ('message', 9),
 ('today', 8),
 ('part', 8),
 ('laptop', 8),
 ('call', 8),
 ('office', 8),
 ('story', 8),
 ('Ok', 7)]

What about the number of top verbs (`VB`)?

In [16]:
countTarget = 'VB'
targetCounts = {}
for entry in redditTopScores['POS_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != countTarget:
                continue
            elif ent in targetCounts:
                targetCounts[ent] += 1
            else:
                targetCounts[ent] = 1
sortedTargets = sorted(targetCounts.items(), key = lambda x: x[1], reverse = True)
sortedTargets[:20]

[('be', 18),
 ('have', 17),
 ('get', 14),
 ('do', 11),
 ('change', 9),
 ('make', 8),
 ('know', 7),
 ('say', 7),
 ('help', 6),
 ('send', 6),
 ('tell', 6),
 ('look', 6),
 ('go', 5),
 ('work', 4),
 ('feel', 4),
 ('use', 4),
 ('thank', 4),
 ('take', 4),
 ('receive', 4),
 ('open', 4)]

What about the adjectives that modify the word, "computer"?

In [17]:
NTarget = 'JJ'
Word = 'computer'
NResults = set()
for entry in redditTopScores['POS_sents']:
    for sentence in entry:
        
        for (ent1, kind1),(ent2,kind2) in zip(sentence[:-1], sentence[1:]):
       
            if (kind1,ent2.lower())==(NTarget,Word):
                NResults.add(ent1)
            else:
                continue

print(NResults)     

{'own', 'unrestricted'}


## Evaluating POS tagger

We can check the POS tagger by running it on a manually tagged corpus and identifying a reasonable error metric.

In [18]:
treeBank = nltk.corpus.treebank
treeBank.tagged_sents()[0]

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

In [19]:
treeBank.sents()[0]

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [20]:
stanfordTags = stanford.postTagger.tag_sents(treeBank.sents()[:30])

And compare the two

In [21]:
NumDiffs = 0
for sentIndex in range(len(stanfordTags)):
    for wordIndex in range(len(stanfordTags[sentIndex])):
        if stanfordTags[sentIndex][wordIndex][1] != treeBank.tagged_sents()[sentIndex][wordIndex][1]:
            if treeBank.tagged_sents()[sentIndex][wordIndex][1] != '-NONE-':
                print("Word: {}  \tStanford: {}\tTreebank: {}".format(stanfordTags[sentIndex][wordIndex][0], stanfordTags[sentIndex][wordIndex][1], treeBank.tagged_sents()[sentIndex][wordIndex][1]))
                NumDiffs += 1
total = sum([len(s) for s in stanfordTags])
print("The Precision is {:.3f}%".format((total-NumDiffs)/total * 100))

Word: Dutch  	Stanford: JJ	Treebank: NNP
Word: publishing  	Stanford: NN	Treebank: VBG
Word: used  	Stanford: VBD	Treebank: VBN
Word: more  	Stanford: JJR	Treebank: RBR
Word: ago  	Stanford: RB	Treebank: IN
Word: that  	Stanford: IN	Treebank: WDT
Word: later  	Stanford: RB	Treebank: JJ
Word: New  	Stanford: NNP	Treebank: JJ
Word: that  	Stanford: IN	Treebank: WDT
Word: more  	Stanford: JJR	Treebank: RBR
Word: ago  	Stanford: RB	Treebank: IN
Word: ago  	Stanford: RB	Treebank: IN
Word: replaced  	Stanford: VBD	Treebank: VBN
Word: more  	Stanford: JJR	Treebank: JJ
Word: expected  	Stanford: VBD	Treebank: VBN
Word: study  	Stanford: VBD	Treebank: VBP
Word: studied  	Stanford: VBD	Treebank: VBN
Word: industrialized  	Stanford: JJ	Treebank: VBN
Word: Lorillard  	Stanford: NNP	Treebank: NN
Word: found  	Stanford: VBD	Treebank: VBN
Word: that  	Stanford: IN	Treebank: WDT
Word: that  	Stanford: IN	Treebank: WDT
Word: rejected  	Stanford: VBD	Treebank: VBN
Word: that  	Stanford: IN	Treebank: WDT

So we can see that the stanford POS tagger is quite good. Nevertheless, for a 20 word sentence, we only have a 66% chance ($1-.96^{20}$) of tagging (and later parsing) it correctly.

## <span style="color:red">*Exercise 1*</span>

<span style="color:red">In the cells immediately following, perform POS tagging on a meaningful (but modest) subset of a corpus associated with your final project. Examine the list of words associated with at least three different parts of speech. Consider conditional frequencies (e.g., adjectives associated with nouns of interest or adverbs with verbs of interest). What do these distributions suggest about your corpus?

For this exercise I used a small subset of my Yelp reviews for closed restaurants, because I want to learn about why they closed by doing text analysis on reviews. 

In [72]:
closed_restaurant_reviews = pandas.read_csv('../Codes_Data/data/closerestaurant_reviews.csv', sep='\t')
closed_restaurant_reviews = closed_restaurant_reviews.dropna(axis=0, how='any')
closed_restaurant_reviews_smallDF = closed_restaurant_reviews[:200]
closed_restaurant_reviews_smallDF[:5]

Unnamed: 0,business_id,text
0,--g-a85VwrdZJNf0R95GcQ,These guys are great. \n\nExtremely friendly a...
1,--g-a85VwrdZJNf0R95GcQ,"What a wonderful surprise, this restaurant was..."
2,--g-a85VwrdZJNf0R95GcQ,Ordered takeout from the place and the food wa...
3,--g-a85VwrdZJNf0R95GcQ,fantastic to have a true Mediterranean restaur...
4,--g-a85VwrdZJNf0R95GcQ,All the food is very fresh and incredibly deli...


In [74]:
closed_restaurant_reviews_smallDF['sentences'] = closed_restaurant_reviews_smallDF['text'].apply(lambda x: [nltk.word_tokenize(s) for s in nltk.sent_tokenize(x)])
closed_restaurant_reviews_smallDF.index = range(len(closed_restaurant_reviews_smallDF) - 1, -1,-1) #Reindex to make things nice in the future
closed_restaurant_reviews_smallDF[-5:]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,business_id,text,sentences
4,-050d_XIor1NpCuWkbIVaQ,I've decided to make this one short and sweet....,"[[I, 've, decided, to, make, this, one, short,..."
3,-050d_XIor1NpCuWkbIVaQ,I really enjoyed this place when I went a few ...,"[[I, really, enjoyed, this, place, when, I, we..."
2,-050d_XIor1NpCuWkbIVaQ,Perfect if you are craving a home-style breakf...,"[[Perfect, if, you, are, craving, a, home-styl..."
1,-050d_XIor1NpCuWkbIVaQ,I've lived downtown for 5 years now and never ...,"[[I, 've, lived, downtown, for, 5, years, now,..."
0,-050d_XIor1NpCuWkbIVaQ,There were a couple of reviews for Matt's Big ...,"[[There, were, a, couple, of, reviews, for, Ma..."


In [75]:
closed_restaurant_reviews_smallDF['POS_sents'] = closed_restaurant_reviews_smallDF['sentences'].apply(lambda x: stanford.postTagger.tag_sents(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Let's look at the top nouns.Looks like there are a lot of breakfast place in these subsets, and food is often being talked about.

In [76]:
countTarget = 'NN'
targetCounts = {}
for entry in closed_restaurant_reviews_smallDF['POS_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != countTarget:
                continue
            elif ent in targetCounts:
                targetCounts[ent] += 1
            else:
                targetCounts[ent] = 1
sortedTargets = sorted(targetCounts.items(), key = lambda x: x[1], reverse = True)
sortedTargets[:20]

[('place', 170),
 ('food', 169),
 ('breakfast', 151),
 ('wait', 119),
 ('time', 70),
 ('bacon', 69),
 (')', 59),
 ('toast', 57),
 ('(', 56),
 ('coffee', 55),
 ('service', 43),
 ('restaurant', 41),
 ('menu', 38),
 ('side', 31),
 ('everything', 31),
 ('table', 31),
 ('hour', 30),
 ('day', 30),
 ('staff', 29),
 ('home', 28)]

Since I will do sentiment analysis later, I am very interested in learning about the adjectives. However, the adjectives are not what I expected. I expected there would be more bad words like, 'bad','worst' because I am doing analysis on close restaurants, but there aren't!

In [77]:
countTarget = 'JJ'
targetCounts = {}
for entry in closed_restaurant_reviews_smallDF['POS_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != countTarget:
                continue
            elif ent in targetCounts:
                targetCounts[ent] += 1
            else:
                targetCounts[ent] = 1
sortedTargets = sorted(targetCounts.items(), key = lambda x: x[1], reverse = True)
sortedTargets[:20]

[('good', 126),
 ('great', 102),
 ('fresh', 59),
 ('small', 55),
 ('worth', 44),
 ('delicious', 36),
 ('hash', 35),
 ('little', 32),
 ('friendly', 30),
 ('local', 30),
 ('new', 29),
 ('other', 28),
 ('long', 27),
 ('sure', 27),
 ('nice', 26),
 ('first', 25),
 ('few', 25),
 ('thick', 24),
 ('tiny', 24),
 ('amazing', 24)]

I also find the interjections just for my own interests. Looks like people like to use "Oh" to express the feelings! Haha.

In [82]:
countTarget = 'UH'
targetCounts = {}
for entry in closed_restaurant_reviews_smallDF['POS_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != countTarget:
                continue
            elif ent in targetCounts:
                targetCounts[ent] += 1
            else:
                targetCounts[ent] = 1
sortedTargets = sorted(targetCounts.items(), key = lambda x: x[1], reverse = True)
sortedTargets[:20]

[('Oh', 8),
 ('oh', 6),
 ('Wow', 3),
 ('Yeah', 3),
 ('Yes', 3),
 ('Okay', 2),
 ('Well', 1),
 ('hello', 1),
 ('Hey', 1),
 ('Welcome', 1),
 ('OK', 1),
 ('yes', 1)]

What about the adjectives that modify the word, "food"? Also out of my expectation, for this subset of the data, not everyone is talking about bad things about the food.

In [78]:
NTarget = 'JJ'
Word = 'food'
NResults = set()
for entry in closed_restaurant_reviews_smallDF['POS_sents']:
    for sentence in entry:
        
        for (ent1, kind1),(ent2,kind2) in zip(sentence[:-1], sentence[1:]):
       
            if (kind1,ent2.lower())==(NTarget,Word):
                NResults.add(ent1)
            else:
                continue

print(NResults) 

{'Outstanding', 'Good', 'fresh', 'favorite', 'exceptional', 'good', 'awesome', 'greasy', 'yummy', 'enough', 'amazing', 'Mediterranean', 'phenomenal', 'great', 'American', 'delicious', 'eastern', 'actual', 'basic', 'devoted', 'fast', 'Great'}


What about service? Still, there aren't many bad words describing the service.

In [79]:
NTarget = 'JJ'
Word = 'service'
NResults = set()
for entry in closed_restaurant_reviews_smallDF['POS_sents']:
    for sentence in entry:
        
        for (ent1, kind1),(ent2,kind2) in zip(sentence[:-1], sentence[1:]):
       
            if (kind1,ent2.lower())==(NTarget,Word):
                NResults.add(ent1)
            else:
                continue

print(NResults) 

{'friendly', 'great', 'fast', 'good', 'poor', 'Friendly', 'nice', 'quick'}


# Named-Entity Recognition

Named Entity Recognition (NER) is also a classification task, which identifies named objects. Included with Stanford NER are a 4 class model trained on the CoNLL 2003 eng.train, a 7 class model trained on the MUC 6 and MUC 7 training data sets, and a 3 class model trained on both data sets plus some additional data (including ACE 2002 and limited data in-house) on the intersection of those class sets. 

**3 class**:	Location, Person, Organization

**4 class**:	Location, Person, Organization, Misc

**7 class**:	Location, Person, Organization, Money, Percent, Date, Time

These models each use distributional similarity features, which provide some performance gain at the cost of increasing their size and runtime. Also available are the same models missing those features.

(We note that the training data for the 3 class model does not include any material from the CoNLL eng.testa or eng.testb data sets, nor any of the MUC 6 or 7 test or devtest datasets, nor Alan Ritter's Twitter NER data, so all of these would be valid tests of its performance.)

First, we tag our first set of exemplary sentences:

In [22]:
classified_sents = stanford.nerTagger.tag_sents(tokenized_text)
print(classified_sents)

[[('I', 'O'), ('saw', 'O'), ('the', 'O'), ('elephant', 'O'), ('in', 'O'), ('my', 'O'), ('pajamas', 'O'), ('.', 'O')], [('The', 'O'), ('quick', 'O'), ('brown', 'O'), ('fox', 'O'), ('jumped', 'O'), ('over', 'O'), ('the', 'O'), ('lazy', 'O'), ('dog', 'O'), ('.', 'O')], [('While', 'O'), ('in', 'O'), ('France', 'LOCATION'), (',', 'O'), ('Christine', 'PERSON'), ('Lagarde', 'PERSON'), ('discussed', 'O'), ('short-term', 'O'), ('stimulus', 'O'), ('efforts', 'O'), ('in', 'O'), ('a', 'O'), ('recent', 'O'), ('interview', 'O'), ('with', 'O'), ('the', 'O'), ('Wall', 'ORGANIZATION'), ('Street', 'ORGANIZATION'), ('Journal', 'ORGANIZATION'), ('.', 'O')], [('Trayvon', 'PERSON'), ('Benjamin', 'PERSON'), ('Martin', 'PERSON'), ('was', 'O'), ('an', 'O'), ('African', 'O'), ('American', 'O'), ('from', 'O'), ('Miami', 'LOCATION'), ('Gardens', 'LOCATION'), (',', 'O'), ('Florida', 'LOCATION'), (',', 'O'), ('who', 'O'), (',', 'O'), ('at', 'O'), ('17', 'O'), ('years', 'O'), ('old', 'O'), (',', 'O'), ('was', 'O'), 

We can also run NER over our entire corpus:

In [23]:
redditTopScores['classified_sents'] = redditTopScores['sentences'].apply(lambda x: stanford.nerTagger.tag_sents(x))

In [24]:
redditTopScores['classified_sents']

9    [[(Last, O), (year, O), (,, O), (Help, O), (De...
8    [[(First, O), (post, O), (in, O), (quite, O), ...
7    [[([, O), (Original, O), (Post, O), (], O), ((...
6    [[(I, O), (witnessed, O), (this, O), (astoundi...
5    [[(I, O), (work, O), (Helpdesk, ORGANIZATION),...
4    [[(This, O), (just, O), (happened, O), (..., O...
3    [[(Another, O), (tale, O), (from, O), (the, O)...
2    [[([, O), (Part, O), (1, O), (], O), ((, O), (...
1    [[(>, O), ($, O), (Me, O), (-, O), (Hello, O),...
0    [[(So, O), (my, O), (story, O), (starts, O), (...
Name: classified_sents, dtype: object

Find the most common entities (which are, of course, boring):

In [25]:
entityCounts = {}
for entry in redditTopScores['classified_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if ent in entityCounts:
                entityCounts[ent] += 1
            else:
                entityCounts[ent] = 1
sortedEntities = sorted(entityCounts.items(), key = lambda x: x[1], reverse = True)
sortedEntities[:10]

[('.', 401),
 ('I', 245),
 ('the', 226),
 (',', 205),
 ('to', 197),
 ('a', 143),
 ('and', 135),
 ('>', 106),
 ('you', 102),
 ('of', 97)]

Or those occurring only twice:

In [26]:
[x[0] for x in sortedEntities if x[1] == 2]

['hear',
 'step',
 'allowed',
 'request',
 'try',
 'ask',
 'stronger',
 'academic',
 'busy',
 'operate',
 'sharing',
 'issues',
 'year',
 'visit',
 'others',
 'videos',
 'idea',
 'Here',
 'myself',
 'believe',
 'bad',
 'immediately',
 'box',
 'asset',
 'done',
 'received',
 'lunch',
 'earlier',
 'shaking',
 'DVD',
 'used',
 'CEO',
 'Ca',
 'Everything',
 '5',
 'business',
 'watched',
 'future',
 'anymore',
 'helped',
 'making',
 'key',
 'taking',
 'small',
 'since',
 'THIS',
 'drowned',
 'cry',
 'Steve',
 'login',
 'bitter',
 'completely',
 'supervisor',
 'web',
 'its',
 'Everyone',
 'needed',
 'Of',
 'terrible',
 "'P4ssword",
 'Things',
 'four',
 'point',
 'tears',
 'search',
 'Then',
 'generate',
 'forwarded',
 'mistakes',
 'store',
 'shortcut',
 'building',
 'XYZ',
 'pointed',
 'hit',
 'organization',
 'stand',
 'information',
 'gildings',
 'themselves',
 'mouse',
 'pay',
 'holiday',
 'wrong',
 'often',
 '17',
 'reply',
 'There',
 '*Note',
 'passed',
 'BING',
 'guide',
 'S',
 'nasty'

We could also list the most common "non-objects". (We note that we're not graphing these because there are so few here.)

In [27]:
nonObjCounts = {}
for entry in redditTopScores['classified_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind == 'O':
                continue
            elif ent in nonObjCounts:
                nonObjCounts[ent] += 1
            else:
                nonObjCounts[ent] = 1
sortedNonObj = sorted(nonObjCounts.items(), key = lambda x: x[1], reverse = True)
sortedNonObj[:10]

[('Jack', 17),
 ('Google', 6),
 ('Smith', 5),
 ('Steve', 2),
 ('GOOGLE', 1),
 ('UK', 1),
 ('Buzzfeed', 1),
 ('Citrix', 1),
 ('Clickity', 1),
 ('Reddit', 1)]

What about the Organizations?

In [28]:
OrgCounts = {}
for entry in redditTopScores['classified_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != 'ORGANIZATION':
                continue
            elif ent in OrgCounts:
                OrgCounts[ent] += 1
            else:
                OrgCounts[ent] = 1
sortedOrgs = sorted(OrgCounts.items(), key = lambda x: x[1], reverse = True)
sortedOrgs[:10]

[('Google', 6), ('GOOGLE', 1), ('CMD', 1), ('Helpdesk', 1), ('Citrix', 1)]

These, of course, have much smaller counts.

## <span style="color:red">*Exercise 2*</span>

<span style="color:red">In the cells immediately following, perform NER on a (modest) subset of your corpus of interest. List all of the different kinds of entities tagged? What does their distribution suggest about the focus of your corpus? For a subset of your corpus, tally at least one type of named entity and calculate the Precision, Recall and F-score for the NER classification just performed (using your own hand-codings as "ground truth").

In [83]:
closed_restaurant_reviews_smallDF['classified_sents'] = closed_restaurant_reviews_smallDF['sentences'].apply(lambda x: stanford.nerTagger.tag_sents(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [84]:
closed_restaurant_reviews_smallDF['classified_sents'][:10]

199    [[(These, O), (guys, O), (are, O), (great, O),...
198    [[(What, O), (a, O), (wonderful, O), (surprise...
197    [[(Ordered, O), (takeout, O), (from, O), (the,...
196    [[(fantastic, O), (to, O), (have, O), (a, O), ...
195    [[(All, O), (the, O), (food, O), (is, O), (ver...
194    [[(super, O), (fresh, O), (food..great, O), (p...
193    [[(Great, O), (Food, O), (!, O)], [(Good, O), ...
192    [[(This, O), (probably, O), (one, O), (of, O),...
191    [[(My, O), (new, O), (favorite, O), (Mediterra...
190    [[(I, O), (saw, O), (a, O), (sign, O), (for, O...
Name: classified_sents, dtype: object

First of all, find the most common boring entities:))

In [85]:
entityCounts = {}
for entry in closed_restaurant_reviews_smallDF['classified_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if ent in entityCounts:
                entityCounts[ent] += 1
            else:
                entityCounts[ent] = 1
sortedEntities = sorted(entityCounts.items(), key = lambda x: x[1], reverse = True)
sortedEntities[:10]

[('.', 1494),
 (',', 1158),
 ('the', 1126),
 ('and', 908),
 ('I', 763),
 ('a', 658),
 ('to', 540),
 ('was', 433),
 ('is', 393),
 ('of', 376)]

In [86]:
nonObjCounts = {}
for entry in closed_restaurant_reviews_smallDF['classified_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind == 'O':
                continue
            elif ent in nonObjCounts:
                nonObjCounts[ent] += 1
            else:
                nonObjCounts[ent] = 1
sortedNonObj = sorted(nonObjCounts.items(), key = lambda x: x[1], reverse = True)
sortedNonObj[:10]

[('Matt', 116),
 ('Phoenix', 47),
 ('Denny', 10),
 ('Julia', 8),
 ('Food', 8),
 ('Guy', 8),
 ('Network', 8),
 ('Chick', 7),
 ('Matts', 6),
 ('IHOP', 5)]

Some person's names are being brought up in the reviews, guess they are names of the reviewers' friends.

In [222]:
PERSONCounts = {}
for entry in closed_restaurant_reviews_smallDF['classified_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != 'PERSON':
                continue
            elif ent in PERSONCounts:
                PERSONCounts[ent] += 1
            else:
                PERSONCounts[ent] = 1
sortedPerson = sorted(PERSONCounts.items(), key = lambda x: x[1], reverse = True)
sortedPerson[:10]

[('Matt', 115),
 ('Denny', 10),
 ('Julia', 8),
 ('Guy', 8),
 ('Matts', 5),
 ('Fieri', 5),
 ('Bacon', 4),
 ('Reuben', 2),
 ('Dottsy', 2),
 ('Chick', 2)]

In [88]:
OrgCounts = {}
for entry in closed_restaurant_reviews_smallDF['classified_sents']:
    for sentence in entry:
        for ent, kind in sentence:
            if kind != 'ORGANIZATION':
                continue
            elif ent in OrgCounts:
                OrgCounts[ent] += 1
            else:
                OrgCounts[ent] = 1
sortedOrgs = sorted(OrgCounts.items(), key = lambda x: x[1], reverse = True)
sortedOrgs[:10]

[('Network', 8),
 ('Food', 8),
 ('Chick', 5),
 ('IHOP', 5),
 ('OJ', 3),
 ('Creek', 2),
 ('Kabab', 2),
 ('&', 2),
 ('Coffee', 2),
 ('MBB', 2)]

Now I pick one review to do the 3 measurements. But first of all I want to say that this part of the natural language is not in my matter of interest, since I don't care so much about the organization names, or person names in the reviews, and also because those are not frequently mentioned in the reviews as well. 

This is the NER classification for the sentence that I picked. As I just mentioned, there really that many words other than non-objectives that were being mentioned.

In [239]:
review_hand_codingDF = pandas.DataFrame([])

for ent, kind in closed_restaurant_reviews_smallDF['classified_sents'][0][4]:
    #for entry in sentence:
    review_hand_codingDF = review_hand_codingDF.append(pandas.DataFrame({'Entity':ent,'Kind':kind},index=[0]), ignore_index=True)
review_hand_codingDF

Unnamed: 0,Entity,Kind
0,Otherwise,O
1,",",O
2,they,O
3,pour,O
4,pre-scrambled,O
5,eggs,O
6,(,O
7,as,O
8,does,O
9,IHOP,ORGANIZATION


And this is the hand coding part.

In [240]:
hand_coding = ['O','O','O','O','O','O','O','O','O','ORGANIZATION','O','O','O','O','O']
A = pandas.DataFrame({'Hand_coding':hand_coding})
A

Unnamed: 0,Hand_coding
0,O
1,O
2,O
3,O
4,O
5,O
6,O
7,O
8,O
9,ORGANIZATION


In [241]:
review_hand_codingDF= pandas.concat([review_hand_codingDF,A],axis=1)
review_hand_codingDF

Unnamed: 0,Entity,Kind,Hand_coding
0,Otherwise,O,O
1,",",O,O
2,they,O,O
3,pour,O,O
4,pre-scrambled,O,O
5,eggs,O,O
6,(,O,O
7,as,O,O
8,does,O,O
9,IHOP,ORGANIZATION,ORGANIZATION


In [246]:
print('Precision score is:', sklearn.metrics.precision_score(review_hand_codingDF['Hand_coding'], review_hand_codingDF['Kind'], labels = ['ORGANIZATION'], average = 'micro')) #precision
print('Recall score is:', sklearn.metrics.recall_score(review_hand_codingDF['Kind'], review_hand_codingDF['Hand_coding'], labels = ['ORGANIZATION'], average = 'micro')) #recall
print('F-1 measure score is:', sklearn.metrics.f1_score(review_hand_codingDF['Kind'], review_hand_codingDF['Hand_coding'], labels = ['ORGANIZATION'], average = 'micro')) #F-1 measure

Precision score is: 1.0
Recall score is: 1.0
F-1 measure score is: 1.0


# Parsing

Here we will introduce the Stanford Parser by feeding it tokenized text from our initial example sentences. The parser is a dependency parser, but this initial program outputs a simple, self-explanatory phrase-structure representation.

In [29]:
parses = list(stanford.parser.parse_sents(tokenized_text)) #Converting the iterator to a list so we can call by index. They are still 
fourthSentParseTree = list(parses[3]) #iterators so be careful about re-running code, without re-running this block
print(fourthSentParseTree)

[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NNP', ['Trayvon']), Tree('NNP', ['Benjamin']), Tree('NNP', ['Martin'])]), Tree('VP', [Tree('VBD', ['was']), Tree('NP', [Tree('NP', [Tree('DT', ['an']), Tree('NNP', ['African']), Tree('NNP', ['American'])]), Tree('PP', [Tree('IN', ['from']), Tree('NP', [Tree('NP', [Tree('NNP', ['Miami']), Tree('NNPS', ['Gardens'])]), Tree(',', [',']), Tree('NP', [Tree('NNP', ['Florida'])]), Tree(',', [',']), Tree('SBAR', [Tree('WHNP', [Tree('WP', ['who'])]), Tree('S', [Tree(',', [',']), Tree('PP', [Tree('IN', ['at']), Tree('ADJP', [Tree('NP', [Tree('CD', ['17']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])])]), Tree(',', [',']), Tree('VP', [Tree('VBD', ['was']), Tree('ADVP', [Tree('RB', ['fatally'])]), Tree('VP', [Tree('VBN', ['shot']), Tree('PP', [Tree('IN', ['by']), Tree('NP', [Tree('NP', [Tree('NNP', ['George']), Tree('NNP', ['Zimmerman'])]), Tree(',', [',']), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['neighborhood']), Tree('NN', ['watch']), Tree('NN', [

Trees are a common data structure and there are a large number of things to do with them. What we are intetered in is the relationship between different types of speech

In [30]:
def treeRelation(parsetree, relationType, *targets):
    if isinstance(parsetree, list):
        parsetree = parsetree[0]
    if set(targets) & set(parsetree.leaves()) != set(targets):
        return []
    else:
        retList = []
        for subT in parsetree.subtrees():
            if subT.label() == relationType:
                if set(targets) & set(subT.leaves()) == set(targets):
                    retList.append([(subT.label(), ' '.join(subT.leaves()))])
    return retList

In [31]:
def treeSubRelation(parsetree, relationTypeScope, relationTypeTarget, *targets):
    if isinstance(parsetree, list):
        parsetree = parsetree[0]
    if set(targets) & set(parsetree.leaves()) != set(targets):
        return []
    else:
        retSet = set()
        for subT in parsetree.subtrees():
            if set(targets) & set(subT.leaves()) == set(targets):
                if subT.label() == relationTypeScope:
                    for subsub in subT.subtrees():
                        if subsub.label()==relationTypeTarget:
                            retSet.add(' '.join(subsub.leaves()))
    return retSet

In [32]:
treeRelation(fourthSentParseTree, 'NP', 'Florida', 'who')

[[('NP',
   'an African American from Miami Gardens , Florida , who , at 17 years old , was fatally shot by George Zimmerman , a neighborhood watch volunteer , in Sanford , Florida')],
 [('NP',
   'Miami Gardens , Florida , who , at 17 years old , was fatally shot by George Zimmerman , a neighborhood watch volunteer , in Sanford , Florida')]]

Notice that Florida occurs twice in two different nested noun phrases in the sentence. 

We can also find all of the verbs within the noun phrase defined by one or more target words:

In [33]:
treeSubRelation(fourthSentParseTree, 'NP', 'VBN', 'Florida', 'who')

{'shot'}

Or if we want to to look at the whole tree

In [34]:
fourthSentParseTree[0].pretty_print()

                                                                                                                   ROOT                                                                                                                       
                                                                                                                    |                                                                                                                          
                                                                                                                    S                                                                                                                         
            ________________________________________________________________________________________________________|_______________________________________________________________________________________________________________________   
           |                       VP     

Or another sentence

In [35]:
list(parses[1])[0].pretty_print()

                     ROOT                           
                      |                              
                      S                             
       _______________|___________________________   
      |                          VP               | 
      |                __________|___             |  
      |               |              PP           | 
      |               |      ________|___         |  
      NP              |     |            NP       | 
  ____|__________     |     |     _______|____    |  
 DT   JJ    JJ   NN  VBD    IN   DT      JJ   NN  . 
 |    |     |    |    |     |    |       |    |   |  
The quick brown fox jumped over the     lazy dog  . 



## Dependency parsing and graph representations

Dependency parsing was developed to robustly capture linguistic dependencies from text. The complex tags associated with these parses are detailed [here]('http://universaldependencies.org/u/overview/syntax.html'). When parsing with the dependency parser, we will work directly from the untokenized text. Note that no *processing* takes place before parsing sentences--we do not remove so-called stop words or anything that plays a syntactic role in the sentence, although anaphora resolution and related normalization may be performed before or after parsing to enhance the value of information extraction. 

In [36]:
depParses = list(stanford.depParser.raw_parse_sents(text)) #Converting the iterator to a list so we can call by index. They are still 
secondSentDepParseTree = list(depParses[1])[0] #iterators so be careful about re-running code, without re-running this block
print(secondSentDepParseTree)

defaultdict(<function DependencyGraph.__init__.<locals>.<lambda> at 0x10e58ec80>,
            {0: {'address': 0,
                 'ctag': 'TOP',
                 'deps': defaultdict(<class 'list'>, {'root': [5]}),
                 'feats': None,
                 'head': None,
                 'lemma': None,
                 'rel': None,
                 'tag': 'TOP',
                 'word': None},
             1: {'address': 1,
                 'ctag': 'DT',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 4,
                 'lemma': '_',
                 'rel': 'det',
                 'tag': 'DT',
                 'word': 'The'},
             2: {'address': 2,
                 'ctag': 'JJ',
                 'deps': defaultdict(<class 'list'>, {}),
                 'feats': '_',
                 'head': 4,
                 'lemma': '_',
                 'rel': 'amod',
                 'tag': 'JJ',
                 'word

This is a graph and we can convert it to a dot file and use that to visulize it. Try traversing the tree and extracting elements that are nearby one another. We note that unless you have the graphviz successfully installed on your computer (which is not necessary to complete this homework), the following graphviz call will trigger an error. If you are interested in installing graphviz and working on a Mac, consider installing through [homebrew](https://brew.sh), a package manager (i.e., with the command "brew install graphviz", once brew is installed). 

In [37]:
try:
    secondSentGraph = graphviz.Source(secondSentDepParseTree.to_dot())
except:
    secondSentGraph = None
    print("There was a problem with graphviz, likely your missing the program, https://www.graphviz.org/download/")
secondSentGraph

ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH

<graphviz.files.Source at 0x10ed87dd8>

Or another sentence 

In [38]:
try:
    graph = graphviz.Source(list(depParses[3])[0].to_dot())
except IndexError:
    print("You likely have to rerun the depParses")
    raise
except:
    graph = None
    print("There was a problem with graphviz, likely your missing the program, https://www.graphviz.org/download/")
graph

ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH

<graphviz.files.Source at 0x10ed7b898>

We can also do a dependency parse on the reddit sentences:

In [39]:
topPostDepParse = list(stanford.depParser.parse_sents(redditTopScores['sentences'][0]))

This takes a few seconds, but now lets look at the parse tree from one of the processed sentences.

The sentence is:

In [40]:
targetSentence = 7
print(' '.join(redditTopScores['sentences'][0][targetSentence]))

So anyway , I get a call from an older gentleman who 's quite bitter and mean right off the bat ( does n't like that I asked for his address / telephone number to verify the account , hates that he has to speak with a machine before reaching an agent , etc . ) .


Which leads to a very rich dependancy tree:

In [41]:
try:
    graph = graphviz.Source(list(topPostDepParse[targetSentence])[0].to_dot())
except IndexError:
    print("You likely have to rerun the depParses")
    raise
except:
    graph = None
    print("There was a problem with graphviz, likely your missing the program, https://www.graphviz.org/download/")
graph


ExecutableNotFound: failed to execute ['dot', '-Tsvg'], make sure the Graphviz executables are on your systems' PATH

<graphviz.files.Source at 0x10ede84a8>

## <span style="color:red">*Exercise 3*</span>

<span style="color:red">In the cells immediately following, parse a (modest) subset of your corpus of interest. How deep are the phrase structure and dependency parse trees nested? How does parse depth relate to perceived sentence complexity? What are five things you can extract from these parses for subsequent analysis? (e.g., nouns collocated in a noun phrase; adjectives that modify a noun; etc.) Capture these sets of things for a focal set of words (e.g., "Bush", "Obama", "Trump"). What do they reveal about the roles that these entities are perceived to play in the social world inscribed by your texts?

For this exercise I just used 10 pieces of reviews to speed up my analysis.

In [139]:
ten_reviews = closed_restaurant_reviews_smallDF[:10]['text'].tolist()
tokenized_ten_reviews = [nltk.word_tokenize(t) for t in ten_reviews]


In [140]:
review_parses = list(stanford.parser.parse_sents(tokenized_ten_reviews)) #Converting the iterator to a list so we can call by index. They are still 

In [141]:
tokenized_ten_reviews[0]

['These',
 'guys',
 'are',
 'great',
 '.',
 '\\n\\nExtremely',
 'friendly',
 'and',
 'nice',
 'service',
 '.',
 'Food',
 'is',
 'perfect',
 ',',
 'I',
 'wish',
 'they',
 'would',
 'dim',
 'their',
 'lights',
 'and',
 'make',
 'more',
 'cosy',
 'atmosphere',
 '.',
 'The',
 'food',
 'sure',
 'deserves',
 'it',
 '.',
 '\\n\\nAnyway',
 ',',
 'this',
 'restaurant',
 'well',
 'deserves',
 'full',
 'five',
 'stars',
 '.',
 '\\n\\nFood',
 'is',
 'great',
 '.',
 '\\nService',
 'is',
 'careful',
 ',',
 'accommodating',
 'and',
 'friendly',
 '.',
 '\\n\\nEveryone',
 'should',
 'try',
 'it',
 'out']

In [142]:
review_FirstSentParseTree = list(review_parses[0]) #iterators so be careful about re-running code, without re-running this block
print(review_FirstSentParseTree)

[Tree('ROOT', [Tree('S', [Tree('S', [Tree('NP', [Tree('DT', ['These']), Tree('NNS', ['guys'])]), Tree('VP', [Tree('VBP', ['are']), Tree('NP', [Tree('NP', [Tree('NP', [Tree('JJ', ['great'])]), Tree('.', ['.']), Tree('NP', [Tree('NN', ['\\n\\nExtremely']), Tree('JJ', ['friendly']), Tree('CC', ['and']), Tree('JJ', ['nice']), Tree('NN', ['service'])]), Tree('.', ['.'])]), Tree('SBAR', [Tree('S', [Tree('NP', [Tree('NNP', ['Food'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('JJ', ['perfect'])])])])])])])]), Tree(',', [',']), Tree('S', [Tree('NP', [Tree('PRP', ['I'])]), Tree('VP', [Tree('VBP', ['wish']), Tree('SBAR', [Tree('S', [Tree('NP', [Tree('PRP', ['they'])]), Tree('VP', [Tree('MD', ['would']), Tree('VP', [Tree('VP', [Tree('VB', ['dim']), Tree('NP', [Tree('PRP$', ['their']), Tree('NNS', ['lights'])])]), Tree('CC', ['and']), Tree('VP', [Tree('VB', ['make']), Tree('NP', [Tree('NP', [Tree('JJR', ['more'])]), Tree('FRAG', [Tree('X', [Tree('ADVP', [Tree('RB', ['cosy'])]), Tree('NP

We can see that 'service' occurs three times in three different nested noun phrases in the sentence.

In [149]:
treeRelation(review_FirstSentParseTree, 'NP', 'service')

[[('NP',
   'great . \\n\\nExtremely friendly and nice service . Food is perfect')],
 [('NP', 'great . \\n\\nExtremely friendly and nice service .')],
 [('NP', '\\n\\nExtremely friendly and nice service')]]

We can see that 'atmosphere' occurs twice in two different nested noun phrases in the sentence.

In [158]:
treeRelation(review_FirstSentParseTree, 'NP','atmosphere')

[[('NP', 'more cosy atmosphere . The food sure deserves it .')],
 [('NP', 'atmosphere')]]

We can see that 'food' occurs twice in two different nested noun phrases in the sentence.

In [159]:
treeRelation(review_FirstSentParseTree, 'NP','Food')

[[('NP',
   'great . \\n\\nExtremely friendly and nice service . Food is perfect')],
 [('NP', 'Food')]]

Let's see what adjectives are around the word "Food". Similar with above, good words instead of bad words :(

In [160]:
treeSubRelation(review_FirstSentParseTree, 'NP','JJ', 'Food')

{'friendly', 'great', 'nice', 'perfect'}

Let's see what adjectives are around the word "atmosphere". Nothing?

In [162]:
treeSubRelation(review_FirstSentParseTree, 'NP','JJ', 'atmosphere')

set()

Let's see what adjectives are around the word "service". Similar with above, good words instead of bad words too:(

In [161]:
treeSubRelation(review_FirstSentParseTree, 'NP', 'JJ', 'service')

{'friendly', 'great', 'nice', 'perfect'}

Now let's see how deep the tree is. It is deeper than the exemple given by instructor, which makes sense because my sentence is more complicated than the exemple sentence. 

In [148]:
review_FirstSentParseTree[0].pretty_print()

                                                                                                                                                                                            ROOT                                                                                                                                                                                                                             
                                                                                                                                                                                             |                                                                                                                                                                                                                                
                                                                                                                                                                           

# Information extraction

Information extraction approaches typically (as here, with Stanford's Open IE engine) ride atop the dependency parse of a sentence. They are a pre-coded example of the type analyzed in the prior. 

In [42]:
ieDF = stanford.openIE(text)

Starting OpenIE run
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.1 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... 
[main] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 14.682 (s)
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [16.4 sec].
[main] INFO edu.stanford.nlp.pip

`openIE()` prints everything stanford core produces and we can see from looking at it that initializing the dependency parser takes most of the time, so calling the function will always take at least 12 seconds.

In [43]:
ieDF

Unnamed: 0,certainty,subject,verb,object
0,1.0,elephant,is in,my pajamas
1,1.0,I,saw,elephant in my pajamas
2,1.0,I,saw,elephant
3,1.0,quick brown fox,jumped over,lazy dog
4,1.0,quick brown fox,jumped over,dog
5,1.0,quick fox,jumped over,dog
6,1.0,fox,jumped over,dog
7,1.0,brown fox,jumped over,lazy dog
8,1.0,brown fox,jumped over,dog
9,1.0,quick fox,jumped over,lazy dog


No buffalos (because there were no verbs), but the rest is somewhat promising. Note, however, that it abandoned the key theme of the sentence about the tragic Trayvon Martin death ("fatally shot"), likely because it was buried so deeply within the complex phrase structure. This is obviously a challenge. 

## <span style="color:red">*Exercise 4*</span>

<span style="color:red">How would you extract relevant information about the Trayvon Martin sentence directly from the dependency parse (above)? Code an example here. (For instance, what compound nouns show up with what verb phrases within the sentence?) How could these approaches inform your research project?

In [250]:
treeSubRelation(fourthSentParseTree, 'NP', 'VBN')

{'shot'}

From the dependency parse (above), it can actually find the key theme of the sentence about the tragic Trayvon Martin death,which is "shot". That left me doubt with the openIE. When I am doing my research project, I better be careful with openIE, and choose parsers carefully. 

And we can also look for subject, object, target triples in one of the reddit stories.

In [44]:
ieDF = stanford.openIE(redditTopScores['text'][0])

Starting OpenIE run
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.0 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... 
[main] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 15.02 (s)
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [16.7 sec].
[main] INFO edu.stanford.nlp.pipe

In [45]:
ieDF

Unnamed: 0,certainty,subject,verb,object
0,1.000000,we,'ll get,calls
1,1.000000,we,Quite often 'll get,calls
2,1.000000,we,often 'll get,calls
3,0.831036,we,coax,direct to TV
4,0.774359,straight analog cable,coax,direct from wall
5,0.774359,analog cable,coax,direct from wall to TV
6,0.774359,straight analog cable,coax,direct to TV
7,1.000000,we,would supply analog cable to,homes
8,0.831036,we,coax,direct from wall
9,0.774359,analog cable,coax,direct from wall


That's almost 200 triples in only:

In [46]:
len(redditTopScores['sentences'][0])

37

sentences and

In [47]:
sum([len(s) for s in redditTopScores['sentences'][0]])

971

words.

Lets find at the most common subject in this story.

In [48]:
ieDF['subject'].value_counts()

I                        48
it                       42
he                       19
He                       18
we                       11
old man                   8
man                       8
letter                    4
our booking calendar      4
call                      4
analog cable              4
straight analog cable     4
my supervisor             3
TV                        2
his TV set                2
they                      2
you                       2
handling                  1
people                    1
our digital equipment     1
me                        1
our equipment             1
repeat offenders          1
Name: subject, dtype: int64

I is followed by various male pronouns and compound nouns (e.g., "old man"). 'I' occures most often with the following verbs:

In [49]:
ieDF[ieDF['subject'] == 'I']['verb'].value_counts()

could come                        8
even brought                      5
brought                           5
had                               4
was                               4
speak for                         3
have                              1
do                                1
had cable within                  1
think about                       1
took                              1
've dealt with                    1
get to                            1
still think occasionally about    1
get                               1
felt                              1
think occasionally about          1
speak with                        1
still think about                 1
eventually had                    1
So anyway get                     1
anyway get                        1
ask                               1
complaint in                      1
instantly felt                    1
Name: verb, dtype: int64

and the following objects

In [None]:
ieDF[ieDF['subject'] == 'I']['object'].value_counts()

Mr. Smith                                             4
him                                                   3
call                                                  3
simplified remote                                     2
this                                                  2
remote for his set top box                            2
get                                                   2
willing                                               2
bad                                                   2
simplified remote for his set top box                 2
remote                                                2
bit about account for Mr. Smith                       1
speak with her for bit about account for Mr. Smith    1
cable running                                         1
cable                                                 1
speak with her for bit about account                  1
bit                                                   1
her                                             

We can also run the corenlp server. When you run this server (with the command below), you can click on the browswer link provided to experiment with it. Note that when we run the server, executing the command below, it interrupts the current jupyter process and you will not be able to run code here again (processes will "hang" and never finish) until you interrup the process by clicking "Kernel" and then "Interrupt".

In [None]:
stanford.startCoreServer()

Starting server on http://localhost:16432 , please wait a few seconds


## <span style="color:red">*Exercise 5*</span>

<span style="color:red">In the cells immediately following, perform open information extraction on a modest subset of texts relevant to your final project. Analyze the relative attachment of several subjects relative to verbs and objects and visa versa. Describe how you would select among these statements to create a database of high-value statements for your project and then do it by extracting relevant statements into a pandas dataframe.

Let's do some analysis on the first piece of review in the dataframe.

In [176]:
review_ieDF = stanford.openIE(closed_restaurant_reviews_smallDF['text'][0])

Starting OpenIE run
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.2 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... 
[main] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 16.276 (s)
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [18.5 sec].
[main] INFO edu.stanford.nlp.pip

Let's also do some analysis on the second piece of review in the dataframe.

In [177]:
review_ieDF_1 = stanford.openIE(closed_restaurant_reviews_smallDF['text'][1])

Starting OpenIE run
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [1.6 sec].
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator depparse
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Loading depparse model file: edu/stanford/nlp/models/parser/nndep/english_UD.gz ... 
[main] INFO edu.stanford.nlp.parser.nndep.Classifier - PreComputed 99996, Elapsed Time: 15.33 (s)
[main] INFO edu.stanford.nlp.parser.nndep.DependencyParser - Initializing dependency parser ... done [17.4 sec].
[main] INFO edu.stanford.nlp.pipe

In [167]:
review_ieDF

Unnamed: 0,certainty,subject,verb,object
0,1.0,Matt,for,Big Breakfast
1,1.0,people,made,comment
2,1.0,Denny,makes,breakfast
3,1.0,Denny,makes,better breakfast
4,1.0,Two people,made,comment
5,1.0,That,is,patently ridiculous
6,1.0,someone cracks,open egg at,at Denny
7,1.0,someone cracks,open,egg
8,1.0,you,make food at,home
9,1.0,whole point,is going,OUT


In [178]:
review_ieDF_1

Unnamed: 0,certainty,subject,verb,object
0,1.000000,I,'ve lived now,downtown
1,0.238943,I,'ve lived for,5 years
2,1.000000,I,'ve lived,downtown
3,1.000000,I,'ve lived downtown for,5 years
4,0.238943,I,'ve lived now for,5 years
5,1.000000,I,'ve lived downtown now for,5 years
6,1.000000,I,saw,it
7,1.000000,I,'m,semi-regular
8,1.000000,I,say,semi
9,1.000000,I,go in,series


In [169]:
len(closed_restaurant_reviews_smallDF['sentences'][0])

27

In [170]:
sum([len(s) for s in closed_restaurant_reviews_smallDF['sentences'][0]])

519

Lets find at the most common subject in this story. Yeah it is "I", which is understanable since people like to use this as the start of the review.

In [171]:
review_ieDF['subject'].value_counts()

I                 18
it                11
my name            4
richness           3
Denny              2
egg                2
reviews            2
someone cracks     2
It                 1
Two people         1
whole point        1
scramble           1
specials           1
That               1
Matt               1
Food Network       1
point              1
you                1
people             1
Name: subject, dtype: int64

Lets find the most common object in the second piece of review.Looks like this restaurant has some really good maple syrup!

In [184]:
review_ieDF_1['object'].value_counts()

real maple syrup                    5
awesome                             5
maple syrup                         5
awesome with real maple syrup       4
awesome with maple syrup            4
5 years                             4
breakfast                           3
I go                                2
downtown                            2
leave                               2
great                               2
bill                                2
leave there                         2
I go in series                      2
leave with bill                     2
leave there with bill               2
fried egg on it for breakfast       1
dangerous breakfast                 1
that                                1
most people are saying              1
delightfully dangerous breakfast    1
weekend                             1
semi                                1
series                              1
going                               1
two week period                     1
people are s

In [172]:
review_ieDF[review_ieDF['subject'] == 'I']['verb'].value_counts()

enter                    2
save                     2
had As                   2
decided                  2
wait for                 1
Anyway digress           1
watching                 1
try first                1
save housemade jam as    1
'm in                    1
finished                 1
try                      1
digress                  1
make                     1
Name: verb, dtype: int64

This restaurant is good for breakfast~

In [186]:
review_ieDF_1[review_ieDF_1['object'] == 'breakfast']['verb'].value_counts()

are great for    2
is               1
Name: verb, dtype: int64

In [173]:
review_ieDF[review_ieDF['subject'] == 'I']['object'].value_counts()

nothing terribly special    2
them                        2
tiny eatery                 1
my way                      1
Phoenix                     1
so speak                    1
eatery                      1
speak                       1
my dessert                  1
lover                       1
lover of hash browns        1
DDD                         1
housemade jam               1
my toast                    1
my meal                     1
my plate                    1
Name: object, dtype: int64

For my project, since I am interested in learning how food, ambience and service are related to restaurant close, I would try to find all the sentences that pertain to this three categories, and analyze the relative attachment of several subjects relative to verbs and objects. For example,what adjective do people use when they talk about the food? And then I would collect all of these useful information into a dataframe for later use.