<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Natural Language Processing (NLP) Review Lab

_Authors: Joseph Nelson (DC)_ 

---

> **Note: This lab is intended to be completed with the help of an instructor.**

## Introduction


*Adapted from [NLP Crash Course](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) by Charlie Greenbacker, [Introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf) by Dan Jurafsky, and Kevin Markham's data school curriculum*.

### What Is NLP?

- It uses computers to process (analyze, understand, and generate) natural human languages.
- Most knowledge created by humans is unstructured text, and computers need a way to make sense of it.
- It builds probabilistic models using data about a language.

### What are some of the higher-level task areas?

- **Information retrieval**: Finding relevant and similar results.
    - [Google](https://www.google.com/).
- **Information extraction**: Distilling structured information from unstructured documents.
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en).
- **Machine translation**: Translating one language to another.
    - [Google Translate](https://translate.google.com/).
- **Text simplification**: Preserving the meaning of text but simplifying the grammar and vocabulary.
    - [Rewordify](https://rewordify.com/).
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page).
- **Predictive text input**: Faster or easier typing.
    - [My application](https://justmarkham.shinyapps.io/textprediction/).
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/).
- **Sentiment analysis**: Assessing the attitude of the speaker.
    - [Hater News](http://haternews.herokuapp.com/).
- **Automatic summarization**: Extractive or abstractive summarization.
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0).
- **Natural language generation**: Generating text from data.
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052).
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763).
- **Speech recognition and generation**: Speech-to-text and text-to-speech.
    - [Google's web speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html).
    - [Vocalware’s text-to-speech demo](https://www.vocalware.com/index/demo).
- **Question answering**: Determining the intent of the question, matching the query with the knowledge base, and evaluating the hypotheses.
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson trivia challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html).
    - [The AI behind Watson](http://www.aaai.org/Magazine/Watson/watson.php).

### What are some of the lower-level components?

- **Tokenization**: Breaking text into tokens (words, sentences, and n-grams).
- **Stop word removal**: "a," "an," and "the."
- **Stemming and lemmatization**: Root words.
- **TF-IDF**: Word importance.
- **Parts-of-speech tagging**: Noun, verb, adjective, and adverb.
- **Named entity recognition**: Person, organization, and location.
- **Spelling correction**: "New Yrok City."
- **Word sense disambiguation**: “Buy a mouse."
- **Segmentation**: "New York City subway."
- **Language detection**: “Translate this page."
- **Machine learning.**

### Why is NLP difficult?

- **Ambiguity**:
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: Text messages.
- **Idioms**: “Throw in the towel."
- **Newly coined words**: “Retweet."
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters;" "Mary and Sue are mothers."

NLP requires an understanding of **language** and the **world**.

## Part 1: Reading in the Yelp Reviews

- Corpus = A collection of documents.
- Corpora = A plural form of corpus.

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer
%matplotlib inline

In [2]:
csv_file = './datasets/yelp.csv'

In [3]:
# A:

### 1.A) Subset the reviews to best and worst.

- Select only five-star and one-star reviews.
- The text will be the features and the stars will be the target.
- Create a train/test split.

In [4]:
# A:

## Part 2: Tokenization

- **What:** It separates the text into units such as sentences or words.
- **Why:** It gives structure to previously unstructured text.
- **Notes:** It’s relatively easy to use with English language text but not as easy with some other languages.

### 2.A) Use `CountVectorizer` to convert the training and testing text data.

[`CountVectorizer` documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

- **Lowercase:** Boolean; `True` by default.
    - Convert all characters to lowercase before tokenizing.
- **ngram_range:** tuple `(min_n, max_n)`.
    - The lower and upper boundary of the range of `n` values for different n-grams to be extracted. All values of `n` such that $min_n <= n <= max_n$ will be used.

In [5]:
# A:

### 2.B) Predict the star rating with the new features from `CountVectorizer`.

Validate on the testing set.

In [6]:
# A:

## Part 3: Stop Word Removal

- **What:** It removes common words that will likely appear in any text.
- **Why:** Stop words don't tell you much about your text.

### 3.A) Recreate the features and remove stop words using `CountVectorizer`. 

- **stop_words:** string `{'english'}`, list, or `None` (default).
- If `'english'`, a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If `None`, no stop words will be used. `max_df` can be set to a value in the range, [0.7, 1.0), to automatically detect and filter stop words based on intra-corpus document frequency of terms.

In [7]:
# A:

### 3.B) Validate your model using the features with stop words removed.

In [8]:
# A:

## Part 4: Other CountVectorizer Options

### 4.A) Shrink the maximum number of features and retest the model.

- **max_features:** `int` or `None`; `default=None`.
- If not `None`, build a vocabulary that only considers the top `max_features` ordered by term frequency across the corpus.

In [9]:
# A:

### 4.B) Change the minimum document frequency for terms and test the model's performance.

- **min_df:** Float in range `[0.0, 1.0]` or int; `default=1`.
- When building the vocabulary, ignore terms that have a document frequency that is strictly lower than the given threshold. This value is also called the cut-off. If float, the parameter represents a proportion of documents while the integer represents absolute counts.

In [10]:
# A:

## Part 5: Introduction to TextBlob

TextBlob: Simplified text processing.

### 5.A) Use `TextBlob` to convert the text in the first review in the data set.

In [11]:
# A:

### 5.B) List the words in the `TextBlob` object.

In [12]:
# A:

### 5.C) List the sentences in the `TextBlob` object.

In [13]:
# A:

## Part 6: Stemming and Lemmatization

**Stemming**

- **What:** It reduces a word to its base/stem/root form.
- **Why:** It often makes sense to treat related words the same way.
- **Notes:**
    - It uses a simple and fast rule-based approach.
    - Stemmed words are usually not shown to users (used for analysis/indexing).
    - Some search engines treat words with the same stem as synonyms.

### 6.A) Initialize the `SnowballStemmer` and stem the words in the first review.

In [14]:
# A:

### 6.B) Use the built-in `lemmatize()` function on the first review's words (parsed by `TextBlob`).

**Lemmatization**

- **What:** It derives the canonical form (lemma) of a word.
- **Why:** It can be more effective than stemming.
- **Notes:** It uses a dictionary-based approach (slower than stemming).

In [15]:
# A:

### 6.C) Write a function that uses `TextBlob` and `lemmatize()` to lemmatize text.

In [16]:
# A:

### 6.D) Provide your function to `CountVectorizer` as the `analyzer` and test the performance of your model.

In [17]:
# A:

## Part 7: Term Frequency-Inverse Document Frequency (TF-IDF)

- **What:** It computes the "relative frequency" with which a word appears in a document compared to its frequency across all documents.
- **Why:** It's more useful than "term frequency" for identifying "important" words in each document (e.g., high frequency in that document, low frequency in other documents).
- **Notes:** It’s used for search engine scoring, text summarization, and document clustering.

### 7.A) Build a simple TF-IDF using `CountVectorizer`.

- Term frequency can be calculated with a default `CountVectorizer`.
- Inverse document frequency can be calculated with `CountVectorizer` and the argument `binary=True`.

**More details:** [TF-IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/).

In [18]:
# A:

## Part 8: Using TF-IDF to Summarize a Yelp Review

> **Note:** Reddit's autotldr uses the [SMMRY](http://smmry.com/about) algorithm, which is based on TF-IDF.

### 8.A) Build a TF-IDF predictor matrix that excludes stop words with `TfidfVectorizer`.

In [19]:
# A:

### 8.B) Write a function to pull out the top five words from a review based on TF-IDF score.

In [20]:
# A:

## Part 9: Sentiment Analysis

### 9.A) Extract sentiment from a review parsed with `TextBlob`.

Sentiment polarity ranges from negative one — the most negative — to one — the most positive. A parsed `TextBlob` object has sentiment that can be accessed with:

    review.sentiment.polarity

In [21]:
# A:

### 9.B) Calculate the sentiment for every review in the full Yelp data set as a new column.

In [22]:
# A:

### 9.C) Create a box plot of sentiment based on star rating.

In [23]:
# A:

### 9.D) Print the reviews with the highest and lowest sentiment.

In [24]:
# A:

## 10. [Bonus] Explore Fun `TextBlob` Features

### 10.A) Correct spelling with `.correct()`.

In [25]:
# A:

### 10.B) Perform spell-checking with `.spellcheck()`.

In [26]:
# A:

### 10.C) Extract definitions with `.define()`.

In [27]:
# A:

## Conclusion

- NLP is a gigantic field.
- Understanding the basics broadens the types of data you can work with.
- Simple techniques go a long way.
- Use scikit-learn for NLP whenever possible.