# NLTK

NLTK is a large collection of NLP tools.We won't have time to cover everything, so we'll focus on the most common tools:

[Existing corpora](#existing)<br>

[Tokenization](#tokenization)<br>

[Sentence segmentation](#sent-seg)<br>

[Collocations](#collocations)<br>

[Sentiment analysis](#sentiment)<br>

[Stemming](#stemming)<br>

[What we didn't cover](#didnt)<br>

### Time
- Teaching: 30 minutes
- Exercises: 30 minutes

In [None]:
%matplotlib inline
import os
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Existing corpora <a id='existing'></a>

When you downloaded data from nltk using `nltk.download('all')`, you downloaded a whole bunch of great corpora (collections of text documents) and lexical resources (structured information about words). This gives us data to work with already! If you ever want to learn/practice an NLP method, know that just by importing nltk you have access to some data. Here are some corpora and resources that are particularly useful and that we'll use throughout this workshop:

- ABC
- Brown
- CMU pronunciation dictionary
- Genesis
- Project Gutenberg selections
- Inaugural addresses
- Movie reviews
- Names
- State of the Union addresses
- Stopwords
- Twitter samples
- Universal Declaration of Human Rights
- WordNet

Full list of data in NLTK [here](http://www.nltk.org/nltk_data/).

In [None]:
from nltk.corpus import (abc, brown, cmudict, genesis, gutenberg,
                         inaugural, movie_reviews, names, state_union, 
                         stopwords, swadesh, twitter_samples, udhr2, 
                         wordnet)

Corpora in NLTK are special objects in NLTK that give you the exact data you want only when you ask for it. For example, `brown` is not a string or a list of words.

In [None]:
brown

#### Words, raw, sents, fileids

But if I wanted the Brown corpus as a list of words, I could ask for it like this:

In [None]:
brown.words()

Similarly, if I wanted the text of the ABC corpus as a string, I could get it like this:

In [None]:
abc.raw()[:100]

If you wanted the sentences of a corpus, you can ask for them like this:

In [None]:
movie_reviews.sents()

These corpora are often made up of multiple files. You can see the file names by using the `fileids` method.

In [None]:
names.fileids()

To restrict the words, raw or sents to just the words/raw/sents in a particular file, you can list the file name as an optional argument to the `words`/`raw`/`sents` method.

In [None]:
male_names = names.words('male.txt')
male_names[:10]

#### Unique properties

Some corpora have unique aspects to them. For example, the CMU pronunciation dictionary lists (some standard) pronunciation of English words.

In [None]:
pronunciation = cmudict.dict()

In [None]:
pronunciation['hello']

### Male vs. female names

In [None]:
male_names = names.words('male.txt')
female_names = names.words('female.txt')

In [None]:
def last_letter(name):
    """Returns the last letter of `name`."""
    return name.strip()[-1]

def count_letters(names):
    """Returns the distribution of the last letters in `names`."""
    return pd.Series([last_letter(n) for n in names]).value_counts(normalize=True)

def letter_distribution():
    male_value_counts = count_letters(male_names)
    female_value_counts = count_letters(female_names)
    return pd.DataFrame.from_dict({'male': male_value_counts, 'female': female_value_counts})

df = letter_distribution()
df.plot(kind='bar', figsize=(16, 8))
plt.legend(prop={'size': 20})
plt.xticks(rotation=0, size=20);

### Challenge

- Count the lengths of the sentences (i.e. the number of words per sentence) in the `inaugural` corpus. Find the minimum, average and maximum sentence length.
- Visualize the distribution of lengths.
- Count the number of times the following words appear in the corpus: "america", "citizen", "united", "senate" and "freedom".
- If you are surprised by anything in the answer to the last question, think about capitalization issues. Make all words lowercase and then perform your counts.

In [None]:
# your answer goes here

In [None]:
# your answer goes here

In [None]:
for word in ["america", "citizen", "united", "senate", "freedom"]:
    # your answer goes here

In [None]:
# your answer goes here

## Tokenization <a id='tokenization'></a>

More often than not, you'll want to analyze some text that doesn't come from NLTK. Perhaps you've scraped a few websites and stored the text in a text file. One of the first steps in processing your text data is tokenization. **Tokenization refers to breaking a running string of text into individual words.**

I've download the text contents of the Wikipedia page on [Python][1], and saved it in the `data` directory. We can read it in as follows:

[1]: https://en.wikipedia.org/wiki/Python_(programming_language)

In [None]:
DATA_DIR = 'data'
python_wiki_fname = os.path.join(DATA_DIR, 'python_wikipedia.txt')
with open(python_wiki_fname) as f:
    text = f.read()

Now, `text` is a string:

In [None]:
text[:100]

We can tokenize this string by using nltk's `word_tokenize` function, which returns a list of strings. Each string is either a word or a punctuation symbol.

In [None]:
tokens = nltk.word_tokenize(text)
tokens[:10]

This uses NLTK's recommended tokenizer. There are plenty of [other tokenizers in NLTK](https://github.com/nltk/nltk/tree/develop/nltk/tokenize), but unless you have good reason to do otherwise it's best to stick to the recommended tokenizer.

### Challenge

I've also downloaded the Wikipedia page for [Berkeley, California][2], and saved the contents as a file called 'berkeley_wikipedia.txt'. Borrowing from the code above, read this file in and tokenize the text. Then find the 10 most frequenct "words". After that, if you don't like counting punctuation symbols as "words", then remove all punctuation symbols then find the 10 most frequenct words.

[2]: https://en.wikipedia.org/wiki/Berkeley,_California

In [None]:
# your answer goes here

In [None]:
from string import punctuation
# your answer goes here

### Sentence segmentation <a id='sent-seg'></a>

Sentence segmentation refers to finding the beginnings and ends of sentences. It's also sometimes called sentence tokenization. Again, there are lots of ways in NLTK to do this, but they have conviently chosen a default method for us. The `nltk.sent_tokenize` function takes in a string and returns a list of strings, where each string is a sentence.

In [None]:
sents = nltk.sent_tokenize(text)
sents[:2]

### Collocations <a id='collocations'></a>

Collocations are words that frequently appear together. They can help us identify key phrases in a text. Collocations can be bigrams (two words), tri-grams (three) or 4-grams. In NLTK, we can use the `BigramCollocationFinder` to find all the bigram collocations in a text. First, we feed in the tokenized text. Here, we'll use the 'learned' portion of the Brown corpus.

In [None]:
tokens = brown.words(categories='learned')
collocations = nltk.BigramCollocationFinder.from_words(tokens)

Then we decide which words to filter out. I don't want words less than three characters or stopwords.

In [None]:
ignored_words = stopwords.words('english')
word_filter = lambda w: len(w) < 3 or w.lower() in ignored_words
collocations.apply_freq_filter(3)
collocations.apply_word_filter(word_filter)

Then we decide what method NLTK should use to decide what makes a collocation special. We'll use the likelihood ratio, which is a good standard choice.

In [None]:
scorer = nltk.collocations.BigramAssocMeasures.likelihood_ratio
collocations.nbest(scorer, 15)

This was kinda messy. We can wrap all this up into a nicer function that just takes in the tokens and spits out the collocations.

In [None]:
def my_collocations(tokens):
    collocations = nltk.BigramCollocationFinder.from_words(tokens)
    ignored_words = stopwords.words('english')
    word_filter = lambda w: len(w) < 3 or w.lower() in ignored_words
    collocations.apply_freq_filter(3)
    collocations.apply_word_filter(word_filter)
    scorer = nltk.collocations.BigramAssocMeasures.likelihood_ratio
    return collocations.nbest(scorer, 15)

And now run `my_collocations` on some new text.

In [None]:
my_collocations(state_union.words())

In [None]:
emma = gutenberg.words('austen-emma.txt')
my_collocations(emma)

In [None]:
my_collocations(genesis.words('english-kjv.txt'))

### Sentiment analysis <a id='sentiment'></a>

NLTK has support for sentiment analysis. [Sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis) is the task of extracting [affective states](https://en.wikipedia.org/wiki/Affect_(psychology)) from text. The VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. There was a [Python package](https://github.com/cjhutto/vaderSentiment) developed for it outside of NLTK, which was then incorporated into NLTK. Loading it through NLTK is often buggy, but we can install the original package if it fails through NLTK. It ends up working the same.

In [None]:
from nltk.sentiment import SentimentIntensityAnalyzer
try:
    sentiment = SentimentIntensityAnalyzer()
except LookupError:
    print('Sentiment analysis in NLTK is not working at the moment :(')

If the `SentimentIntensityAnalyzer` isn't loading properly from `nltk`, then you'll have to install the original package using the line below:

In [None]:
!pip install -U vaderSentiment

And then import it like this:

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

Whether you used NLTK's `SentimentIntensityAnalyzer` or gor it from `vaderSentiment`, the rest of the code is identical.

In [None]:
sentiment = SentimentIntensityAnalyzer()

Analyzing a sentence for its sentiment returns a dictionary with four items. The `compound` key holds the overall score.

In [None]:
sentence = "I hate this sentence so much. I just want it to end. It sucks!"
sentiment.polarity_scores(sentence)

In [None]:
sentences = ["VADER is smart, handsome, and funny.",      # positive sentence example
            "VADER is not smart, handsome, nor funny.",   # negation sentence example
            "VADER is smart, handsome, and funny!",       # punctuation emphasis handled correctly (sentiment intensity adjusted)
            "VADER is very smart, handsome, and funny.",  # booster words handled correctly (sentiment intensity adjusted)
            "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
            "VADER is VERY SMART, handsome, and FUNNY!!!",# combination of signals - VADER appropriately adjusts intensity
            "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!",# booster words & punctuation make this close to ceiling for score
            "The book was good.",                                     # positive sentence
            "The book was kind of good.",                 # qualified positive sentence is handled correctly (intensity adjusted)
            "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
            "At least it isn't a horrible book.",         # negated negative sentence with contraction
            "Make sure you :) or :D today!",              # emoticons handled
            "Today SUX!",                                 # negative slang with capitalization emphasis
            "Today only kinda sux! But I'll get by, lol"  # mixed sentiment example with slang and constrastive conjunction "but"
             ]

In [None]:
scores = []
for sent in sentences:
    score = sentiment.polarity_scores(sent)
    scores.append(score)
df = pd.DataFrame(scores)
df['sentence'] = sentences
df

> _The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. Calling it a 'normalized, weighted composite score' is accurate._

> _It is also useful for researchers who would like to set standardized thresholds for classifying sentences as either positive, neutral, or negative._

In [None]:
df['positive_sentiment'] = df['compound'] >= 0.5
df

### Challenge

I've read in a bunch of tweets from Trump, and stored them as a list of strings in `tweet_text`. Use the code from above to find the positive sentiment tweets and save them to a list called `positive_tweets`. Do the same for negative tweets, storing them in a variable called `negative_tweets`. What's the proportion of positive to negative tweets?

In [None]:
tweets_fname = os.path.join(DATA_DIR, 'trump-tweets.csv')
tweets = pd.read_csv(tweets_fname)
tweet_text = list(tweets['Tweet_Text'].values)
tweet_text[:2]

In [None]:
# your answer goes here

In [None]:
# your answer goes here

### Stemming <a id='stemming'></a>

Stemming and lemmatization both refer to removing morphological affixes on words. For example, if we stem the word "grows", we get "grow". If we stem the word "running", we get "run". We do this because often we care more about the core content of the word (i.e. that it has something to do with growth or running, rather than the fact that it's a third person present tense verb, or progressive participle).

NLTK provides many algorithms for stemming. For English, a great baseline is the [Porter algorithm](https://tartarus.org/martin/PorterStemmer/), which is in spirit isn't that far from a bunch of regular expressions.

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

In [None]:
stemmer.stem('grows')

In [None]:
stemmer.stem('running')

In [None]:
stemmer.stem('leaves')

NLTK has a variety of other stemming algorithms, and lemmatizers.

In [None]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
snowball = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

In [None]:
print(snowball.stem('running'))
print(snowball.stem('eats'))
print(snowball.stem('embarassed'))

But watch out for errors:

In [None]:
# Thanks to Chris Hench for these examples
print(snowball.stem('cylinder'))
print(snowball.stem('cylindrical'))

And collisions:

In [None]:
# Thanks to Chris Hench for these examples
print(snowball.stem('vacation'))
print(snowball.stem('vacate'))

In [None]:
print(lemmatizer.lemmatize('vacation'))
print(lemmatizer.lemmatize('vacate'))

But why would you want to stem words in the first place? Well, stemming improves performance!

In [None]:
# Thanks again to Chris Hench for inspiration of this example
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Don't worry about following along with this code, although it's great if you do!
def read_data():
    airline_fname = 'airline_tweets.csv'
    airline_fname = os.path.join(DATA_DIR, airline_fname)
    df = pd.read_csv(airline_fname)
    twitter_handle_pattern = r'@(\w+)'
    hashtag_pattern = r'(?:^|\s)[＃#]{1}(\w+)'
    url_pattern = r'https?:\/\/.*.com'
    df['clean_text'] = (df['text']
                        .str.replace(hashtag_pattern, 'HASHTAG')
                        .str.replace(twitter_handle_pattern, 'USER')
                        .str.replace(url_pattern, 'URL')
                              )
    text = list(df['clean_text'].str.lower())
    sentiment = list(df['airline_sentiment'])
    return text, sentiment

def prepare_stems(sents):
    snowball = SnowballStemmer('english')
    tokenized_sents = [nltk.word_tokenize(s) for s in sents]
    stemmed_sents = [[snowball.stem(s) for s in tokenized_sent] for tokenized_sent in tokenized_sents]
    return [' '.join(sent) for sent in stemmed_sents]

def prepare_no_stems(sents):
    tokenized_sents = [nltk.word_tokenize(s) for s in sents]
    return [' '.join(sent) for sent in tokenized_sents]

def fit_model(X_train, y_train):
    model = RandomForestClassifier(n_estimators=10, criterion='gini')                
    model.fit(X_train, y_train)
    return model

def test_model(model, X_test, y_test):
    print('Accuracy: ', model.score(X_test, y_test))

def classify(sents, target):
    vectorizer = TfidfVectorizer(max_features=5000, binary=True)
    X = vectorizer.fit_transform(sents)
    X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.25, random_state=42)
    model = fit_model(X_train, y_train)
    test_model(model, X_test, y_test)

text, sentiment = read_data()
stemmed_text = prepare_stems(text)
unstemmed_text = prepare_no_stems(text)

In [None]:
classify(stemmed_text, sentiment)

In [None]:
classify(unstemmed_text, sentiment)

## What we didn't cover <a id='didnt'></a>

### Distance

NLTK has some functionality for calculating the distance between two strings. String distance is a measure of how different two strings are. For example:

In [None]:
nltk.edit_distance('hello', 'helo')

In [None]:
nltk.edit_distance('hello', 'hi')

There are lots of different ways to measure edit distance. This method uses Levenshtein distance, which is the number of insertions, deletions and substitutions required to turn one string into another. Edit distance is useful if you're looking for spelling mistakes.

The [fuzzywuzzy library](https://github.com/seatgeek/fuzzywuzzy) does a great job of edit distance too.

In [None]:
!pip install -U fuzzywuzzy

In [None]:
'this is a test' == 'this is a test!'

In [None]:
from fuzzywuzzy import fuzz
fuzz.ratio('this is a test', 'this is a test!')

### Translation

NLTK offers [some tools](https://github.com/nltk/nltk/tree/develop/nltk/translate) for machine translation. This is great for learning traditional translation models, but is out-dated. If you actually need to translate some text, currently I'd highly using the [Google Translate API](https://cloud.google.com/translate/docs/).

### Text classification

NLTK has [support for text classification](https://github.com/nltk/nltk/tree/develop/nltk/classify) using machine learning. However, I'd recommend using [scikit-learn](http://scikit-learn.org/stable/), [TensorFlow](https://www.tensorflow.org/) or [Keras](https://keras.io/) for this now.

### Chatbots

These are mainly just for fun. But check out the [source code](https://github.com/nltk/nltk/tree/develop/nltk/chat) if you're ever interested in building a simple chatbot yourself.

In [None]:
# doesn't work so well in a Jupyter notebook because it requires interaction,
# but try it in a terminal or IDE!
#nltk.chat.chatbots()