# Introduction

Today's workshop will address various concepts in Natural Language Processing, primarily through the use of NTLK. Students should already have a fundmental understanding of Python. We'll work with a corpus of documents and learn how to identify different types of linguistic structure in the text, which can help in classifying the documents or extracting useful information from them. We'll cover:

1. NLTK Corpora
2. Tokenization
3. Part-of-Speech (POS) Tagging
4. Phrase Chunking
5. Named Entity Recognition (NER)
6. Dependency Parsing

You will need:

* NLTK (in Bash $ pip install nltk)

* NLTK Book corpora and packages (In Python >>> nltk.download() )

* NumPy package (in Bash $ pip install numpy)

* Stanford Parser: Download Stanford Parser 3.6.0 and unzip to a location that's easy for you to find (e.g. a folder called SourceCode in your Documents folder). Link: http://nlp.stanford.edu/software/lex-parser.shtml#Download

This workshop will further help to solidfy understandings of regex and list comprehensions.

Much of today's work will be adapted, or taken directly, from the NLTK book found here: http://www.nltk.org/book/ .

## Motivation

Why would we use natural language processing? How does it relate to other things we might be doing -- or trying to do -- with text in our research?

Natural language processing is a field of computer science and linguistics; it aims to enable computers to process and derive meaning from input in human language. NLP research is being used to automate tasks like translation, question answering, voice recognition, and language generation.

For social scientists and humanists, we use NLP concepts to improve our analysis of texts of 
interest in our research. Reasons you might use these methods include:

1. You want to be able to better classify documents, or
2. You want to be able to extract information from those documents.

We'll set up an example of each of these two tasks, look at how well we can accomplish that task without NLP, and then see what we gain by adding each of the concepts we'll cover today.

### Task 1. Document Classification

We're often interested in characterizing text from different sources, e.g. measuring the ideology of different politicians based on the language used in their speeches. A simple case would be a situation in which we have a bunch of documents that we want to label as "positive" or "negative". This is often called "sentiment analysis", and it can be very difficult, despite only having two categories, because sentiment is a subjective and often subtle idea.

Since sentiment analysis involves human judgment about the meaning of language, we'll need to do this in a supervised manner, using training data that has already been labeled. We'll need to use a bit of machine learning for this task, but we'll use one of the existing classifiers provided by NLTK. These classifiers take a set of training documents that have already been categorized, and learn how to predict the categories of other documents. We'll use the NLTK Movie Reviews corpus for our training data, and the NLTK Naive Bayes classifer.

We can't give a classifier raw text, because it wouldn't know how to use human language in its calculations. Instead, we represent each document by a vector of features (numeric or boolean values) and then the machine learns what combinations of those features fall into each category. Throughout the workshop today, we'll learn how NLP tools can help us extract different features from documents, to represent the documents with more meaningful or relevant information that might help classify them more accurately.

### Task 2. Information Extraction

Sometimes, we aren't interested in characterizing the documents we've collected, we just want to use them as a source of information about something else. For instance, I might not care *how* different news media outlets describe or articulate elections, I just want to figure out how many instances of elections occurred in a particular place or time. Or I might want to discover what people or groups were involved in elections -- e.g. who voted, who won.

For this type of task, the simplest approach might just be a keyword search, looking for all of the news articles with the word "election" (or maybe "elected" or "voted") and then pulling out the articles or sentences in which those keywords appear. But that seems very blunt, we don't know if an appearance of a word really means an election has occurred, and a keyword search won't help us isolate the names of entities involved.

Today, we'll learn how NLP tools can help us identify different actions and actors in a text, to get closer to being able to extract instances of events and the actors that filled specific roles in those events. This will be an exploratory task; we don't yet have annotated training data to allow us to test the accuracy of our automated extraction process. But we can compare different approaches to see what different kinds of information we might be able to identify.

## 1) Starting with a Corpus

We can use NLTK on strings, lists or dictionaries of strings, or files containing text. We call the overall body of texts we're working with a "corpus", which is a collection of written documents or texts (plural "corpora"). You might have a single .csv file of sentences, titles, tweets, etc. that you want to read in to Python all at once and then analyze. If you have your documents in different files, however, NLTK provides a class called PlainTextCorpusReader for working with a corpus as a group of text files. We can declare an NLTK corpus object containing all text files in the current working directory or subdirectories as follows:

In [None]:
from nltk.corpus import PlaintextCorpusReader

corpus_root = "" # relative path, i.e. current working directory
my_corpus = PlaintextCorpusReader(corpus_root, '.*txt')

We can list all of the files we've just included in the corpus as follows:

In [None]:
my_corpus.fileids()

We can read in the contents of a file in our corpus as a string using the .raw() method:

In [None]:
my_corpus.raw('example.txt')

We can also extract either all the words, or sentences and their words, as lists of strings:

In [None]:
my_corpus.words('example.txt')

In [None]:
sents = my_corpus.sents('example.txt')
print(sents)

NLTK comes with a variety of downloadable corpora that can be used for trying out the methods in the toolkit. We'll use two of these corpora in just a bit, when we set up our practical tasks.

## 2) Tokenization

We usually look at grammar and meaning at the level of words, related to each other within sentences, within each document. So if we're starting with raw text, we first need to split the text into sentences, and those sentences into words -- which we call "tokens". An NLTK corpus object does this for us, allowing us to read in lists of words from our text files. But NLTK also provides tools that enable us to "tokenize" strings ourselves, if we've read them in in longer form.

To understand how to pre-process raw text, let's read in the contents of 'example.txt' in your current directory, using the .raw() method to get one long string.

In [None]:
example_text = my_corpus.raw('example.txt')
print(example_text)

Now, you might imagine that the easiest way to identify sentences is to split the document at every period '.', and to split the sentences using white space to get the words.

In [None]:
example_sents = example_text.split('.')
example_sents_toks = [sent.split(' ') for sent in example_sents]

for sent_toks in example_sents_toks:
    print(sent_toks)

This doesn't look right. Not all periods divide sentences (periods may also used in abbreviations), and not all sentences end in a period (some end in question marks or exclamation points). Words might be separated by not only single spaces, but also tabs or newlines. We can use the 're' package split method to use regular expressions that capture these various possibilities:

In [None]:
import re

example_sents = re.split('(?<=[a-z])[.?!]\s', example_text)
example_sents_toks = [re.split('\s+', sent) for sent in example_sents]

for sent_toks in example_sents_toks:
    print(sent_toks)

This looks better, though we've lost the punctuation at the end of each sentence, except for the period at the end of the string (since we only split sentences on a period followed by white space). That last period has remained attached to the word 'out', since we only split words on white space. We could instead use 're.findall()' to search for all sequences of alphanumeric characters. This would split apart conjunctions, which might be useful if we want to consider 'I' and ''m' (short for 'am') to represent separate words.

We'll stop there, because NLTK provides handy classes to do this for us:

In [None]:
import nltk

example_sents = nltk.sent_tokenize(example_text)
example_sents_toks = [nltk.word_tokenize(sent) for sent in example_sents]

for sent_toks in example_sents_toks:
    print(sent_toks)

These lists of tokens are what we get with the words() method applied to an NLTK corpus. We'll work with the NLTK corpora from here on, but now you know how to turn your own documents into lists of words, either by creating an NLTK corpus object containing your own text files, or by reading in longer strings and then using NLTK functions to tokenize them.

## Putting into Practice

### Task 1: Classifying Documents

Now we're ready to set up our first task, sentiment analysis (i.e. classifying documents as positive or negative). For this task, we'll use the NLTK Movie Reviews corpus, which contains 2,000 movie reviews already categorized as positive or negative.

In [None]:
from nltk.corpus import movie_reviews

movie_reviews.categories()

In [None]:
movie_reviews.fileids()[:10]

Let's imagine that we have much less training data, though, which might be more realistic. This will make it easier to compare different options, since we don't expect the accuracy to be very high at first. And it'll also shorten the time it takes to process the text, since we don't have a lot of time in a workshop setting. Let's just load the first 200 documents from each of the two categories (negative and positive). We'll need to store each document as a list of tokens, in a tuple with its category. Then we'll shuffle the documents so that the negative and positive ones are interspersed.

In [None]:
import random

docs_tuples = [(movie_reviews.words(fileid), category)
               for category in movie_reviews.categories()
               for fileid in movie_reviews.fileids(category)[:200]]

random.shuffle(docs_tuples)

For our initial set of classification features, we'll look for whether certain words appear in each review (regardless of word order, etc). We could use a pre-defined list of words like "good" and "bad", or we can select a list of words from the corpus. We'll create a list of the most frequent words in the corpus, using the NLTK class FreqDist. It takes a list of words and counts each word's frequency, which can then be sorted from largest to smallest using the method most_common(). We'll take the top 1000 most common words from the 400 documents we read in.

In [None]:
movie_words = [word.lower() for (wordlist, cat) in docs_tuples for word in wordlist]
all_wordfreqs = nltk.FreqDist(movie_words)
top_wordfreqs = all_wordfreqs.most_common()[:1000]
print(top_wordfreqs[:10])

In [None]:
feature_words = [x[0] for x in top_wordfreqs]
print(feature_words[:25])

To use these words as document features for classification, we need to define a function that takes each document (as a list of tokens) and returns a set of features representing which words are in that document. NLTK requires us to provide each document's features as a dictionary object, in which feature names are paired with numeric values. Let's make each word feature a 1 or 0, depending on whether that word appears in the document or not. We'll do this for all 1000 of the top words in the corpus, which is now in our 'feature_words' list (so that each document has the same set features). We'll create feature names of the form "contains(x)" for each word x in feature_words.

In [None]:
def document_features(doc_toks):
    document_words = set(doc_toks)
    features = {}
    for word in feature_words:
        features['contains({})'.format(word)] = 1 if word in document_words else 0
    return features

Then we'll use our document_features() function on each document's word list, and store a new tuple with the resulting features and the document's category in a list of feature sets. (This is the format we provide to an NLTK classifier: a list of document tuples, each tuple containing a dictionary object of features plus a single value category or label for that document.)

In [None]:
featuresets = [(document_features(wordlist), cat) for (wordlist, cat) in docs_tuples]

Then we split the documents into a training set and a test set so we can see how well we do. We'll use the first 300 documents for training and the last 100 documents for test.

In [None]:
from nltk import NaiveBayesClassifier

train_set, test_set = featuresets[:-100], featuresets[-100:]
classifier = NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(classifier, test_set)

The NLTK classifier also lets us look at the most informative features (i.e. the words that were most useful for classifying the documents as positive or negative), and sure enough, we see some very positive and very negative words.

In [None]:
classifier.show_most_informative_features(10)

### Task 2. Information Extraction

News articles are often used for information extraction task, since they provide information about major events and actors of interest fairly consistently over time. For this task we'll use the NLTK Brown Corpus. The Brown Corpus contains text from 500 sources, categorized by genre, such as news, editorial, and so on. The full list of files in each genre is available here: http://clu.uni.no/icame/brown/bcm-los.html. Let's look at the genres available, and the files in the 'news' genre.

In [None]:
from nltk.corpus import brown

print(brown.categories())

In [None]:
print(brown.fileids(categories='news'))

Let's read in a list of sentences for each news document in the Brown Corpus. We can print out the first couple sentences from the first document to make sure it looks ok.

In [None]:
news_docs = [brown.sents(fileid) for fileid in brown.fileids(categories='news')]
print(news_docs[0][:2])

A simple way to approach information extraction might be to extract all sentences that contain certain keywords, e.g. sentences containing the word "election".

In [None]:
elect_sents = []
for doc in news_docs:
    for sent in doc:
        if 'election' in sent:
            elect_sents.append(sent)
            
len(elect_sents)

In [None]:
print(elect_sents[:2])

That's getting some relevant sentences, but it seems like there should be more. We could instead create a regular expression that would match either the root "elect" or "vote" with any ending (e.g. "elected", "elects", etc). Then we'd need to look for a match to this regular expression for each token in each sentence.

In [None]:
elect_regexp = 'elect|vote'

elect_sents = []
for doc in news_docs:
    for sent in doc:
        for tok in sent:
            if re.match(elect_regexp, tok):
                elect_sents.append(sent)
                break # Break out of the token for loop, so we only add the sentence once
            
len(elect_sents)

## 3) Part-of-Speech Tagging

One of the fundamental aspects of words that we can use to begin to understand them is their parts of speech -- whether a word is a noun, verb, adjective, etc. Labeling each word in a sequence with its part of speech is called "part-of-speech tagging" or "POS tagging". A part of speech represents a syntactic function; the aim here is to identify the grammatical components of a sentence.

Some parts of speech are easier to spot than others, because they follow certain morphological patterns (e.g. verb endings). We can use regular expressions to find these recognizable words individually:

In [None]:
patterns = [
    (r'.*ing$', 'VBG'),               # gerunds
    (r'.*ed$', 'VBD'),                # simple past
    (r'.*es$', 'VBZ'),                # 3rd singular present
    (r'.*ould$', 'MD'),               # modals
    (r'.*\'s$', 'NN$'),               # possessive nouns
    (r'.*s$', 'NNS'),                 # plural nouns
    (r'.*ly', 'RB'),                  # adverbs
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
    (r'.*', 'NN')                     # nouns (default)
    ]

NLTK has a regular expression tagger that we can use, providing it with our own patterns. Let's see how many words we're able to tag in our example text.

In [None]:
from nltk import RegexpTagger

regexp_tagger = RegexpTagger(patterns)

sent = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
sent_tagged = regexp_tagger.tag(sent)

print(sent_tagged)

That didn't work so well, but no problem, this was a very naïve attempt. We can evaluate the accuracy nonetheless, using the POS-tagged Brown corpus:

In [None]:
brown_tagged_sents = brown.tagged_sents(categories='news')
regexp_tagger.evaluate(brown_tagged_sents)

Some words may have different parts of speech depending on their context. In the sentence two fields above, the words "refuse" and "permit" both appear as verbs and as nouns. We can think of sequences that might indicate parts of speech. For instance, a word after "the" or "an" is more likely to be a noun, while a word after "did" or "does" is more likely to be a verb.

State-of-the-art POS taggers rely on probabilistic models and machine learning to tag tokens sequentially, or jointly across the whole sentence at the same time (finding the most likely combination of tags that make sense in relation to each other). Fortunately, you don't need to do this yourself. NLTK comes with an off-the-shelf POS tagger for English language text:

In [None]:
sent = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
sent_tagged = nltk.pos_tag(sent)
print(sent_tagged)

Some corpora come already tagged. If we read in the Brown Corpus in raw text format (rather than tokenized), we'll actually see pairs of tokens and tags, separated by a '/'.

In [None]:
news_raw = brown.raw('ca01').strip()
print(news_raw[:200])

We can read these tagged tokens in as a list of tuples using the corpus object's tagged_words() method:

In [None]:
news_tagged = brown.tagged_words('ca01')
print(news_tagged[:10])

Different POS taggers use different tags (collectively a "tagset"). The NLTK pos_tagger uses the Penn Treebank POS tagset. Good documentation can be found here: http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html.

The Brown Corpus is tagged with a different tagset, but you can see the similarities.

## Putting into Practice

### Task 1. Classifying Documents

Part-of-speech tags may be useful in document classification. For sentiment analysis in particular, we might care more about adjectives than about nouns and verbs, and probably much more than about articles or prepositions.

Let's run the code we used in the first version of Task 1, then modify it to only use adjectives as the features for each document. Here's the original code:

In [None]:
# Read in a list of document (wordlist, category) tuples, and shuffle
docs_tuples = [(movie_reviews.words(fileid), category)
               for category in movie_reviews.categories()
               for fileid in movie_reviews.fileids(category)[:200]]
random.shuffle(docs_tuples)

# Create a list of the most frequent words in the entire corpus
movie_words = [word.lower() for (wordlist, cat) in docs_tuples for word in wordlist]
all_wordfreqs = nltk.FreqDist(movie_words)
top_wordfreqs = all_wordfreqs.most_common()[:1000]
feature_words = [x[0] for x in top_wordfreqs]

# Define a function to extract features of the form containts(word) for each document
def document_features(doc_toks):
    document_words = set(doc_toks)
    features = {}
    for word in feature_words:
        features['contains({})'.format(word)] = 1 if word in document_words else 0
    return features

# Create feature sets of document (features, category) tuples
featuresets = [(document_features(wordlist), cat) for (wordlist, cat) in docs_tuples]

# Separate train and test sets, train the classifier, print accuracy and best features
train_set, test_set = featuresets[:-100], featuresets[-100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
print(classifier.show_most_informative_features(10))

Now write a modified version of the section that builds a list of feature words from the corpus. Run the POS tagger on the full list of movie_words from the corpus, and pull out only the ones tagged as an adjective ('JJ'). Then create a list of the 1000 most frequent adjectives, and assign that to feature_words.

(We're leaving the first step the way we already did it above, so we only shuffle the document tuples once and leave them in the same order for both versions. Then we can test how well the classifier worked, when trained and tested on the same subset of documents.)

In [None]:
# Create a list of the most frequent adjectives in the entire corpus


We'll leave the rest of the code the same (as long as our new list of adjectives is called "feature_words" again). All we've changed is the words we're looking for in each document. Run this code again and see what happens to the accuracy and informative features.

In [None]:
# Define a function to extract features of the form containts(word) for each document
def document_features(doc_toks):
    document_words = set(doc_toks)
    features = {}
    for word in feature_words:
        features['contains({})'.format(word)] = 1 if word in document_words else 0
    return features

# Create feature sets of document (features, category) tuples
featuresets = [(document_features(wordlist), cat) for (wordlist, cat) in docs_tuples]

# Separate train and test sets, train the classifier, print accuracy and best features
train_set, test_set = featuresets[:-100], featuresets[-100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
print(classifier.show_most_informative_features(10))

### Task 2. Information Extraction

Part-of-speech tags might also help us find the entities involved in elections. The code below is what we used to extract full sentences if they contained an election-related keyword.

In [None]:
# Read in all news docs as a list of sentences, each sentence a list of tokens
news_docs = [brown.sents(fileid) for fileid in brown.fileids(categories='news')]

# Create regular expression to search for election-related words
elect_regexp = 'elect|vote'

# Loop through documents and extract each sentence containing an election-related word
elect_sents = []
for doc in news_docs:
    for sent in doc:
        for tok in sent:
            if re.match(elect_regexp, tok):
                elect_sents.append(sent)
                break # Break out of last for loop, so we only add the sentence once
            
len(elect_sents)

See if you can write new code that will only extract the nouns from each sentence, if the sentence contains a keyword related to elections.

In [None]:
# Loop through docs and extract nouns from each sentence containing an election-related word


Now see if you can write code that will only extract the nouns from each sentence, if that sentence contains a *verb* that's related to elections.

In [None]:
# Loop through docs and extract nouns from each sentence containing an election-related verb


## 4) Phrase Chunking

We may want to work with larger segments of text than single words (but still smaller than a sentence). For instance, in the sentence "The black cat climbed over the tall fence", we might want to treat "The black cat" as one thing (the subject), "climbed over" as a distinct act, and "the tall fence" as another thing (the object). The first and third sequences are noun phrases, and the second is a verb phrase.

We can separate these phrases by "chunking" the sentence, i.e. splitting it into larger chunks than individual tokens. This is also an important step toward identifying entities, which are often represented by more than one word. You can probably imagine certain patterns that would define a noun phrase, using part of speech tags. For instance, a determiner (e.g. an article like "the") could be concatenated onto the noun that follows it. If there's an adjective between them, we can include that too.

To define rules about how to structure words based on their part of speech tags, we use a grammar (in this case, a "chunk grammar"). NLTK provides a RegexpParser that takes as input a grammar composed of regular expressions. The grammar is defined as a string, with one line for each rule we define. Each rule starts with the label we want to assign to the chunk (e.g. NP for "noun phrase"), followed by a colon, then an expression in regex-like notation that will be matched to tokens' POS tags.

We can define a single rule for a noun phrase like this. The rule allows 0 or 1 determiner, then 0 or more adjectives, and finally at least 1 noun. (By using 'NN.*' as the last POS tag, we can match 'NN', 'NNP' for a proper noun, or 'NNS' for a plural noun.) If a matching sequence of tokens is found, it will be labeled 'NP'.

In [None]:
grammar = "NP: {<DT>?<JJ>*<NN.*>+}"

We create a chunk parser object by supplying this grammar, then use it to parse a sentence into chunks. The sentence we want to parse must already be POS-tagged, since our grammar uses those POS tags to identify chunks. Let's try this on the first sentence in the election-related sentences we just extracted.

In [None]:
from nltk import RegexpParser

cp = RegexpParser(grammar)

sent = elect_sents[0]
sent_tagged = nltk.pos_tag(sent)
sent_chunked = cp.parse(sent_tagged)

print(sent_chunked)

When we called print() on this chunked sentence, it printed out a nested list of nodes. Some are phrases (labeled 'NP') and others that didn't get chunked into a phrase are just the original tagged tokens (e.g. the verb 'climbed').

The chunked sentence is actually an NLTK tree object, we can find out by calling type() on the output from the RegexpParser:

In [None]:
type(sent_chunked)

The tree object has a number of methods we can use to interact with its components. For instance, we can use the method draw() to see a more graphical representation. This will open a separate window.

The tree is pretty flat, because we defined a grammar that only grouped words into non-overlapping noun phrases, with no additional hierarchy above them. This is sometimes referred to as "shallow parsing". We'll get to more complex parsing later.

In [None]:
sent_chunked.draw()

If we want to move through the chunks and look at certain phrases, since the tree is essentially flat, we can use a 'for' loop to iterate through all of the nodes in the order they were printed above. Some of the nodes are themselves NLTK tree objects, containing the noun phrases we chunked. Other nodes are just tuples with a token and tag, that didn't make it into a chunk.

If a node is a tree object, it has a method label(), in this case marked 'NP'. It also has a method leaves() that will give us the list of tagged tokens (tuples) in the phrase. If we pull out the first token from each tuple, and concatenate these, we can get the original phrase back.

In [None]:
for node in sent_chunked:
    if type(node)==nltk.tree.Tree and node.label()=='NP':
        phrase = [tok for (tok, tag) in node.leaves()]
        print(' '.join(phrase))

## 5) Named Entity Recognition

Once we have noun phrases separated out, we might find it useful to figure out what categories of things these nouns refer to. Especially if the noun phrase is a proper noun, i.e. a name of something, we might be able to tell if it is the name of a person, an organization, a place, or some other thing. Labeling noun phrases as different types of named entities is called "Named Entity Recognition" or "NER".

Named Entity Recognition involves meaning (semantics) as well as grammar (syntax). The name of a person or an organization might appear in the same place in the exact same sentence, so we also have to know something about existing person and organization names to be able to tell them apart. For that reason, NER taggers are usually trained from labeled training data, using supervised machine learning techniques. NLTK comes with a pre-trained NER tagger we can use for general English text:

In [None]:
sent_nes = nltk.ne_chunk(sent_tagged)
print(sent_nes)

Now we can extract named entities and their NER categories. For instance, we might want to pull out a list of all of the organizations or people mentioned in the document:

In [None]:
entities = {'ORGANIZATION':[], 'PERSON':[], 'LOCATION':[]}
for node in sent_nes:
    if type(node)==nltk.tree.Tree:
        phrase = [tok for (tok, tag) in node.leaves()]
        if node.label() in entities.keys():
            entities[node.label()].append(' '.join(phrase))

for key, value in entities.items():
    print(key, value)

## Putting into Practice

### Task 1. Classifying Documents

Not all adjectives are alike; maybe what matters is which adjectives modify certain nouns: "awful movie" clearly sounds like a negative review, while "awful lines" or "awful crowds" might actually indicate that the movie was popular.

Let's try using noun phrases as the features for our sentiment classifier. To do so, we'll need to identify a list of common noun phrases from the corpus, then also chunk each document to see if each noun phrase appears there. These operations are time-consuming, we don't want to do them twice for each document. So let's first chunk all of the documents in our doc_tuples list, and extract a list of the noun phrases in each.

We might also want to change our grammar slightly, so that it just looks for noun phrases with an adjective followed by a noun (i.e. no articles, no nouns by themselves.)

In [None]:
grammar = "NP: {<JJ><NN.*>}"
cp = RegexpParser(grammar)

def extract_nps(wordlist):
    wordlist_tagged = nltk.pos_tag(wordlist)
    wordlist_chunked = cp.parse(wordlist_tagged)
    nps = []
    for node in wordlist_chunked:
        if type(node)==nltk.tree.Tree and node.label()=='NP':
            phrase = [tok for (tok, tag) in node.leaves()]
            nps.append(' '.join(phrase))
    return nps

docs_tuples_nps = [(extract_nps(wordlist), cat) for (wordlist, cat) in docs_tuples]

Now instead of adjectives, write new code to identify the 1000 most common noun phrases in the corpus, using the chunking grammar and RegexpParser we created above.

In [None]:
# Create a list of the most frequent noun phrases in the entire corpus


Now we will also need to modify the last line, so that we pass in each document's list of nps to the function document_features, since we've already chunked all of the documents in our corpus.

In [None]:
# Create feature sets of document (features, category) tuples


Run the rest of the code as-is, and see what happens to accuracy and informative features.

In [None]:
# Separate train and test sets, train the classifier, print accuracy and best features
train_set, test_set = featuresets[:-100], featuresets[-100:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
print(classifier.show_most_informative_features(10))

### Task 2. Information Extraction

Named Entity Recognition is especially useful for information extraction. Let's say we're especially interested in identifying all of the named people and organizations involved in reported elections, but we don't care about locations or other entity names.

Again, to remind you what we started with, the code below is what we used to extract full sentences if they contained a keyword related to elections.

In [None]:
# Loop through documents and extract each sentence containing a election-related word
elect_sents = []
for doc in news_docs:
    for sent in doc:
        for tok in sent:
            if re.match(elect_regexp, tok):
                elect_sents.append(sent)
                break # Break out of last for loop, so we only add the sentence once
            
len(elect_sents)

Try writing new code to extract all of the named entities that are either a PERSON or ORGANIZATION from a sentence that contains an election-related word.

Now we aren't getting places or things, but we're also missing relevant entities like "voters" because they aren't named entities. Try opening it back up to any noun phrase, but being even more specific about position in the sentence, extracting a noun phrase only if it appears immediately before or immediately after an elect word. What might those entities represent?

## 6) Parsing

Breaking down parts of a sentence and identifying their grammatical roles constitutes parsing. There are two main types of parsing in NLP: constituency parsing and dependency parsing. 

### Constituency Parsing

The grammar rule we used to chunk noun phrases represents the first layer of a constituency parse. We can add rules to the grammar to identify other types of phrases as well.

The most common additional types of phrases are prepositional phrases and verb phrases. The standard approach is not just to put prepositions in prepositional phrases, and verbs into verb phrases. Instead, we label a verb plus the noun object it's acting on as a verb phrase. Similarly, a preposition plus the following noun is a prepositional phrase.

In other words, these phrases are nested. The noun phrases we identified with our first rule become components (or constituents) of the verb or prepositional phrases in the next layer up, until we get to the level of the sentence overall.

Let's add those additional phrase types to our grammar. We'll use three quotation marks to indicate a string that covers multiple lines.

In [None]:
grammar = r"""
    NP: {<DT>?<JJ>*<NN.*>+}      # Chunk sequences of DT, JJ, NN
    PP: {<IN><NP>}               # Chunk prepositions followed by NP
    VP: {<VB.*><NP|PP|CLAUSE>+} # Chunk verbs and their arguments
    CLAUSE: {<NP><VP>}           # Chunk NP, VP into a clause
    """
cp = RegexpParser(grammar)

Now let's parse the first sentence in our election-related sentences again.

In [None]:
sent = elect_sents[0]
sent_tagged = nltk.pos_tag(sent)
sent_chunked = cp.parse(sent_tagged)
print(sent_chunked)

We have more nested phrases now, NPs within PPs within VPs. We can draw the tree to better visualize the greater depth.

In [None]:
sent_chunked.draw()

Since this is no longer a flat list of noun phrases or other nodes, it would be less wise to use a simple for loop to iterate through the nodes and look for certain phrases of interest. A tree structure is best traversed using recursion: define a function to perform the operations you want to do on one node, then have the function call itself on each of its children.

In [None]:
def extract_nps_recurs(tree):
    nps = []
    if not type(tree)==nltk.tree.Tree:
        return nps
    if tree.label()=='NP':
        nps.append(' '.join([tok for (tok, tag) in tree.leaves()]))
    for subtree in tree:
        nps.extend(extract_nps_recurs(subtree))
    return nps

In [None]:
extract_nps_recurs(sent_chunked)

### Dependency Parsing

Another way to parse sentences is to identify which words are syntactically dependent on other words, and what their dependency relationship is. Dependency parsing usually places the main verb of a sentence at the root of the tree, then assigns the verb's subject, direct object, and indirect objects as dependents. An indirect object will usually be connected to a root verb through a preposition. And nouns can have dependents too, which modify or are about some aspect of the noun.

> #### Prepositional phrase attachment
> 
> Dependency parsing is very complex; determining which words depend on which other words 
> involves not only part-of-speech tags, but other information that's more specific to each 
> verb or noun in the given sequence. Here's a classic example:
> 
> * "He ate pizza with olives."
> * "He ate pizza with a fork."
> 
> Which word in the sentence does the last word modify? In the first sentence, the olives are 
> on the pizza, they modify the noun. Saying "He ate with olives" wouldn't make sense without 
> the pizza. In the second sentence, we aren't talking about a thing called "a pizza with a 
> fork", that doesn't make sense. The fork modifies the verb "ate": "He ate with a fork".

Because of these nuances, dependency parsers are usually built using extensive training data, in the form of "treebanks" of sentences annotated with dependency relations. Several major dependency parsers are available in pre-trained form for English language text. It is also possible to train open-source dependency parsers on other publicly available treebanks (such as from the Universal Dependencies project, which offers annotated treebanks in many languages).

Today, we'll work with the Stanford Parser, which is part of the Stanford CoreNLP toolkit. Stanford CoreNLP provides a number of state-of-the-art NLP tools and is widely used by computer scientists as well as social scientists and humanists. It is written in Java, but there are APIs that enable you to access some of the tools from Python. Several of the most popular tools can be used through NLTK.

### Stanford CoreNLP

To get started, you'll need to have downloaded the Stanford Parser from this website: http://nlp.stanford.edu/software/lex-parser.html#Download and unzip it to a location on your computer that's easy to find (e.g. a folder called SourceCode in your Documents folder).

Then in Python, import the StanfordDependencyParser class from NLTK's parser package. You'll also need to import the module 'os' and set the following environment variables to the location on your computer where you put the unzipped Stanford Parser folder.

In [None]:
import os
from nltk.parse.stanford import StanfordDependencyParser

os.environ['STANFORD_PARSER'] = '/Users/natalieahn/Documents/SourceCode/stanford-parser-full-2016-10-31'
os.environ['STANFORD_MODELS'] = '/Users/natalieahn/Documents/SourceCode/stanford-parser-full-2016-10-31'

Now let's create a dependency parser object and try parsing our election-related sentences.

In [None]:
dependency_parser = StanfordDependencyParser(model_path="edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz")
sents_parsed = dependency_parser.parse_sents(elect_sents)

The NLTK interface to the Stanford Parser returns an iterators (over iterators) over NLTK Dependency Graph objects. To be able to access the graph objects more than once, we can convert this into a list:

In [None]:
sents_parseobjs = [obj for sent in sents_parsed for obj in sent]

In [None]:
len(sents_parseobjs)

The graph object contains a method .tree() to depict the parse tree. (If we add .draw(), it will open in a separate window.)

In [None]:
sents_parseobjs[0].tree()

This tree shows us the dependencies (i.e. the arcs), but it doesn't show us the labeled dependency relations, which are a huge part of the value of dependency parsing. In other words, it shows us that "investigation" and "evidence" are both dependents of the verb "produced", but it doesn't show which was the subject and which the object of the action.

The method .triples() extracts dependency triples of the form: ((head word, head tag), rel, (dep word, dep tag)). So for every head word - dependent word pair, it will give us a triple, with the dependency relation label in between. (The method .triples() also returns an iterator; here we'll just use a for loop to print out each triple.)

In [None]:
for triple in sents_parseobjs[0].triples():
    print(triple)

This list of triples repeats a lot of head words, in order to capture all of their relations. Another format in which we can view the parse information is to convert it into CoNLL format. (CoNLL stands for the SIGNLL Conference on Computational Natural Language Learning; it organizes annual shared tasks relating to syntactic and semantic parsing.) The CoNLL formatted output is a string with one line for each word in the original sentence. The lines contain the word, its part-of-speech tag (two versions), the line number for the head word it is directly dependent on, and the label for that dependency relation.

In [None]:
print(sents_parseobjs[0].to_conll(10))

In [None]:
print(dir(sents_parseobjs[0]))

## Putting into Practice

### Task 2. Information Extraction

Parse trees can be used to extract features for document classification, but today we'll focus on applying dependency parsing to the task of information extraction. This should finally enable us to identify entities with particular roles in elections, like voters, candidates, winners, and losers.

Let's try looking for winners. What nouns, in relation to what verbs, would represent the winner of an election? We can look for the subject of verbs like "win" (or "won"), or maybe "defeated" (the direct object would be the loser). We can also look for the direct object of verbs like "elected".

Note: Dependency roles "subject" or "direct object" are syntactic (or grammatical) roles. Roles in events like "winner" or "loser" are semantic (or meaningful) roles. When we use a dependency parse and then add our own rules to extract certain entities that mean something in a real-world event, we're doing "semantic role labeling." This is a hard, complicated task, and we're just scratching the surface of it with this simple example.

Since we're only looking at the dependency relation between a verb and one subject or object in this case, we can use the NLTK graph object's method to get dependency triples. See if you can fill in the rest of the code below to extract words that might represent a winner of an election.

## Wrapping Up, Further Exploration:

This last exercise didn't give us a lot of clear entities. But you can see where we're headed. It may be because "elect" and "vote" aren't often the main verb in a sentence about elections. We could also look for subjects of verbs like "won" or "defeated", but that would only show us the candidates, not who elected them.

We also aren't getting entities' full names, just the head word of a noun phrase. And many of the subjects and objects we extracted are pronouns like "it" or "they", so we'd need to look at a previous sentence or clause to figure out what entity that pronoun refers to, adding another NLP task called correference resolution.

Finally, if we wanted to know which voters or group elected which candidate, we'd need to look at multiple dependents of the same verb. The dependency triplets don't allow us to do that. Instead, we could look in the CoNLL-formatted output for lines that have the same line number in the "head" column. Or we could construct a tree of nodes out of the CoNLL output, then traverse it with a recursive function.

Clearly, information extraction is complicated. Exploring these additional options is beyond the scope of this workshop. The NLTK book discusses additional resources (see Chapter 7 for Information Extraction: http://www.nltk.org/book/ch07.html.) And the Stanford CoreNLP toolkit provides other tools that can help as well (full suite here: http://nlp.stanford.edu/software/).