# Introduction to Data Science, CS 5963 / Math 3900
## Lab 17: Natural Language Processing (NLP)

In this lab, we'll do a sentiment analysis for movie reviews. For this purpose, we'll introduce  the [Natural Language Toolkit (NLTK)](http://www.nltk.org/), a python library for  Natural Language Processing. 

**Further reading:** 

[S. Bird, E. Klein, and E. Loper, *Natural Language Processing with Python - Analyzing Text with the Natural Language Toolkit*](http://www.nltk.org/book/). 


[C. Manning and H. Schütze, *Foundations of Statistical Natural Language Processing* (1999).](http://nlp.stanford.edu/fsnlp/)

[D. Jurafsky and J. H. Martin, *Speech and Language Processing* (2016).](https://web.stanford.edu/~jurafsky/slp3/)

### Recall 
**Last lecture:** Guest lecturer Vivek Srikumar gave a nice overview of Natural Language Processing (NLP). He gave several examples of NLP tasks: 
1. Part of speech tagging
+ Information Extraction
+ Sentiment Analysis
+ Semantic Parsing

One of the major takeaways from his talk is that the current state-of-the-art for many NLP tasks is to find a good way to represent the text ("extract features") and then to use machine learning / statistics tools, such as classification or clustering. 

Our goal today is to use NLTK + scikit-learn to do some basic NLP tasks.  

### Install datasets and models

To use NLTK, you must first download and install the datasets and models. I couldn't do this directly from a Jupyter notebook and instead had to go to the command line and enter:

```python
$ python3
>>> import nltk
>>> nltk.download('all')
```

In [None]:
# imports and setup
import nltk

import numpy as np
from sklearn.neighbors import NearestNeighbors
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.cross_validation import cross_val_score

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (15, 9)
plt.style.use('ggplot')

## Basics of NLTK

In [None]:
from nltk.book import *

### Searching text

The text of Monty Python and the Holy Grail has 16,967 words. 

In [None]:
len(text6)

The word "swallow" appears 10 times. 

In [None]:
text6.count("swallow")

We might want to know the context in which "swallow" appears in the text

"You shall know a word by the company it keeps."    ---John Firth

Use the `concordance` function to print out the words just before and after all occurrences of the word "swallow". 

In [None]:
text6.concordance("swallow")

We can see what other words appear in the same context using the  `similar` function.  

In [None]:
text6.similar("african")

This means that 'african' and 'unladen' both appeared in the text with the same word just before and just after. To see what the phase is, we can use the `common_contexts` function. 

In [None]:
text6.common_contexts(["unladen", "african"])

We see that both "an unladen swallow" and "an african swallow" appear in the text. 

In [None]:
text6.concordance("unladen")
print()
text6.concordance("african")

### Dispersion plot

`text4` is the Inaugural Address Corpus which includes inaugural addresses going back to 1789. 
We can use a dispersion plot to see where in a text certain words appear, and hence how the language of the address has changed over time. 


In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duty", "America", "nation", "God"])

### Representing  language using statistics

We'll represent a text by counting the frequency of different words.

The total number of words ("outcomes") in Moby Dick is 260,819 and the number of different words is 19,317. 

In [None]:
fdist1 = FreqDist(text1)
print(fdist1)

# find 50 most common words
print('\n',fdist1.most_common(50))

# not suprisingly, whale occurs quite frequently (906 times!)
print('\n', fdist1['whale'])

### Tricky python
We can find all the words in Moby Dick with more than 15 characters

*Note:* its faster to sort through a set

In [None]:
V = set(text1)
long_words = [w.lower() for w in V if len(w) > 15]
sorted(long_words)

### Stopwords
It might be useful to ignore frequently used words. These are referred to as *stopwords*.

In [None]:
from nltk.corpus import stopwords
stopwords = nltk.corpus.stopwords.words('english')
print(stopwords)

**Exercise:** Find the most frequently used words in Moby Dick which are not stopwords and not punctuation. 

In [None]:
# your code here



Is there a difference between the frequency in which stopwords appear in the different texts? 

In [None]:
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)

for i,t in enumerate([text1,text2,text3,text4,text5,text6,text7,text8,text9]):
    print(i+1,content_fraction(t))

Apparently, "text8: Personals Corpus" has the most content. 

### Collocations
A *collocation* is a sequence of words that occur together unusually often. 

In [None]:
text8.collocations()

## Sentiment analysis for movie reviews
We ask the simple question: Is the attitude of a movie review positive or negative? 

A *corpus* is a large body of linguistic data. (Plural: corpora)

Our data is a corpus consisting of 2000 movie reviews together with the user's sentiment polarity (positive or negative). Our goal is to predict the sentiment polarity from just the review. 

Of course, this is something that we can do very easily: 
1. That movie was terrible. -> negative
+ That movie was great! -> positive

More information about this dataset is available [from this website](https://www.cs.cornell.edu/people/pabo/movie-review-data/).



In [None]:
from nltk.corpus import movie_reviews as reviews

The datset contains 1000 positive and 1000 negative movie reviews. 

In [None]:
num_reviews = len(reviews.fileids())
print(num_reviews)
print(len(reviews.fileids('pos')),len(reviews.fileids('neg')))

Let's see the review for the third movie. Its a negative review! 

In [None]:
# the name of the file 
fid = reviews.fileids()[2]
print(fid)

print('\n', reviews.raw(fid))

print('\n', reviews.categories(fid) )

print('\n', reviews.words(fid))

### Sentiment Analysis 
Goal: build a classifier that predicts the label ['neg', 'pos'] from the review text

`reviews.categories(file_id)` returns the label ['neg', 'pos'] for that movie

In [None]:
categories = [reviews.categories(fid) for fid in reviews.fileids()]
my_dictionary = {'pos':1, 'neg':0}
y = [my_dictionary[x[0]] for x in categories]

In [None]:
doc_words = [list(reviews.words(fid)) for fid in reviews.fileids()]

In [None]:
# first 10 words of the third document
doc_words[2][1:10]

Get all of the words in the reviews and make a FreqDist

In [None]:
all_words = nltk.FreqDist(w.lower() for w in reviews.words())

We  define a feature for each word, indicating whether the document contains that word. To limit the total number of features, we construct a list of the 4000 most frequently appearing words in the  corpus. 

In [None]:
num_features = 4000
word_features = list(all_words)[:num_features]
print(word_features)

We define a function that takes a document and returns a list of zeros and ones indicating which of the words in  `word_features` appears in that document. 

In [None]:
def document_features(document):
    document_words = set(document)
    features = np.zeros(num_features)
    for i,word in enumerate(word_features):
        features[i] = (word in document_words)
    return features

Let's just focus on the third document. Which words from `word_features` are in this document? 

In [None]:
words_in_doc_2 = document_features(doc_words[2])
print(words_in_doc_2)

inds = np.where(words_in_doc_2 == 1)[0]
print('\n', [word_features[i] for i in inds])

In [None]:
X = np.zeros([num_reviews,num_features])
for ii in range(num_reviews):
    X[ii,:] = document_features(doc_words[ii])

### Sentiment Analysis = Classification 

Now that we have features for each document and labels, we have a classification problem! 

NLTK has a built-in classifier, but we'll use the scikit-learn classifiers we're already familiar with. 

In [None]:
k = 5
model = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(model, X, y, cv=10)
print(scores)

In [None]:
model = svm.SVC(kernel='rbf',C=30)
scores = cross_val_score(model, X, y, cv=10)
print(scores)

Of course, we could now use the cross validation scores to find the optimal parameters, `k` and `C`

## We could have also used the Classifier from the NLTK library

Below is the sentiment analysis from [Ch. 6 of the NLTK book](http://www.nltk.org/book/ch06.html). 



In [None]:
documents = [(list(reviews.words(fileid)), category)
             for category in reviews.categories() 
             for fileid in reviews.fileids(category)]

Extract the features from all of the documents

In [None]:
def document_features(document):
    # Note: checking whether a word occurs in a set is much faster 
    # than checking whether it occurs in a list     
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]

Split into train_set, test_set and perform classification 

In [None]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(nltk.classify.accuracy(classifier, test_set))

classifier.show_most_informative_features(10)

### Improvements? 

More data, better features: n-grams, part of speech tagging, ... 