# NATURAL LANGUAGE PROCESSING

## PART 1: INTRODUCTION
*Adapted from [NLP Crash Course](http://files.meetup.com/7616132/DC-NLP-2013-09%20Charlie%20Greenbacker.pdf) by Charlie Greenbacker and [Introduction to NLP](http://spark-public.s3.amazonaws.com/nlp/slides/intro.pdf) by Dan Jurafsky*

## What is NLP?
- Using computers to process (analyze, understand, generate) natural human languages

## Why is NLP useful?
- Most knowledge created by humans is unstructured text
- Need some way to make sense of it
- Enables quantitative analysis of text data

## What are some of the higher level task areas?
- **Speech recognition and generation**: Apple Siri
    - Speech to text
    - Text to speech
- **Question answering**: IBM Watson
    - Match query with knowledge base
    - Reasoning about intent of question
- **Machine translation**: Google Translate
    - One language to another to another
- **Information retrieval**: Google
    - Finding relevant results
    - Finding similar results
- **Information extraction**: Gmail
    - Structured information from unstructured documents
- **Assistive technologies**: Google autocompletion
    - Predictive text input
    - Text simplification
- **Natural Language Generation**: computer-generated articles
    - Generating text from data
- **Automatic summarization**: Google News
    - Extractive summarization
    - Abstractive summarization
- **Sentiment analysis**: Twitter analysis
    - Attitude of speaker

## What are some of the lower level components?
- **Tokenization**: breaking text into tokens (words, sentences, n-grams)
- **Stopword removal**: a/an/the
- **Stemming and lemmatization**: root word
- **TF-IDF**: word importance
- **Part-of-speech tagging**: noun/verb/adjective
- **Named entity recognition**: person/organization/location
- **Spelling correction**: "New Yrok City"
- **Word sense disambiguation**: "buy a mouse"
- **Segmentation**: "New York City subway"
- **Language detection**: "translate this page"
- **Machine learning**

## Why is NLP hard?
- **Ambiguity**:
    - Teacher Strikes Idle Kids
    - Red Tape Holds Up New Bridges
    - Hospitals are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English**: tweets/text messages
- **Idioms**: "throw in the towel"
- **Newly coined words**: "retweet"
- **Tricky entity names**: "Where is A Bug's Life playing?"
- **World knowledge**: "Mary and Sue are sisters", "Mary and Sue are mothers"

## How does NLP work?
- Build probabilistic model using data about a language
- Requires an understanding of the language
- Requires an understanding of the world (or a particular domain)


## PART 2: READING IN THE YELP REVIEWS

- "corpus" = collection of documents
- "corpora" = plural form of corpus

In [None]:
## PRE-REQUISITES (Install the following from the Terminal)
## pip install textblob
## python -m textblob.download_corpora

In [None]:
! pip install textblob 

In [None]:
! python -m textblob.download_corpora -y

In [None]:
import pandas as pd
import numpy as np
import scipy as sp
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer

In [None]:
import sys

# Check for python version
req_version = (2,5)
cur_version = sys.version_info

In [None]:
# read yelp.csv into a DataFrame
yelp = pd.read_csv('../data/yelp.csv')

In [None]:
yelp.head()

In [None]:
yelp.stars.value_counts()

In [None]:
# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

In [None]:
yelp_best_worst.stars.value_counts()

In [None]:
yelp_best_worst.head()

In [None]:
yelp_best_worst.shape

In [None]:
yelp_best_worst.stars.value_counts()

In [None]:
# split the new DataFrame into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(yelp_best_worst.text, yelp_best_worst.stars, random_state=1)

## PART 3: TOKENIZATION
- **What:** Separate text into units such as sentences or words
- **Why:** Gives structure to previously unstructured text
- **Notes:** Relatively easy with English language text, not easy with some languages

In [None]:
# use CountVectorizer to create document-term matrices from X_train and X_test
vect = CountVectorizer()
train_dtm = vect.fit_transform(X_train)
test_dtm = vect.transform(X_test)

In [None]:
# rows are documents, columns are terms (aka "tokens" or "features")
train_dtm.shape

In [None]:
# last 50 features
print (vect.get_feature_names()[-50:])

In [None]:
# show vectorizer options
vect

**[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)**
- **lowercase:** boolean, True by default
- Convert all characters to lowercase before tokenizing.

In [None]:
# don't convert to lowercase
vect = CountVectorizer(lowercase=False)
train_dtm = vect.fit_transform(X_train)
train_dtm.shape

- **token_pattern:** string
- Regular expression denoting what constitutes a "token". The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

In [None]:
# allow tokens of one character
vect = CountVectorizer(token_pattern=r'(?u)\b\w+\b')
train_dtm = vect.fit_transform(X_train)
train_dtm.shape

In [None]:
print (vect.get_feature_names()[-50:])

- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
train_dtm = vect.fit_transform(X_train)
train_dtm.shape

In [None]:
print (vect.get_feature_names()[-50:])

In [None]:
sw = ['in', 'on', 'the']
vect = CountVectorizer(ngram_range=(1, 2), stop_words="english" )
train_dtm = vect.fit_transform(X_train)
train_dtm.shape

In [None]:
vect.get_stop_words()

In [None]:
# last 50 features
print (vect.get_feature_names()[-50:])

### `PREDICTING THE STAR RATING`

In [None]:
# use default options for CountVectorizer
vect = CountVectorizer()
vect.fit(X_train)

# create document-term matrices. In this case, we are transforming X_test tokens to fit the vocabulary generated by X_train
train_dtm = vect.transform(X_train)
test_dtm = vect.transform(X_test)

In [None]:
print (train_dtm.shape)
print (test_dtm.shape)

In [None]:
# You get different set of words counts when you fit and transform train and test sets separately
train_vect = CountVectorizer()
train_vect.fit(X_train)
train_dtm2 = train_vect.transform(X_train)
print (train_dtm2.shape)

test_vect = CountVectorizer()
test_vect.fit(X_test)
test_dtm2 = test_vect.transform(X_test)
print (test_dtm2.shape)

In [None]:
# use Naive Bayes to predict the star rating
nb = MultinomialNB()
nb.fit(train_dtm, y_train)
y_pred_class = nb.predict(test_dtm)

In [None]:
X_train
vect.get_feature_names()[-50:]

In [None]:
y_test

In [None]:
# calculate accuracy
print (metrics.accuracy_score(y_test, y_pred_class))

In [None]:
# calculate null accuracy
y_test_binary = np.where(y_test==5, 1, 0)
y_test_binary.mean()

In [None]:
# define a function that accepts a vectorizer and returns the accuracy
def tokenize_test(vect):
    train_dtm = vect.fit_transform(X_train)
    print ('Features: ', train_dtm.shape[1])
    test_dtm = vect.transform(X_test)
    nb = MultinomialNB()
    nb.fit(train_dtm, y_train)
    y_pred_class = nb.predict(test_dtm)
    print ('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [None]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 3))
tokenize_test(vect)

In [None]:
vect.get_feature_names()[-50:]

## PART 4: STOPWORD REMOVAL
- **What:** Remove common words that will likely appear in any text
- **Why:** They don't tell you much about your text

In [None]:
# show vectorizer options
vect

- **stop_words:** string {'english'}, list, or None (default)
- If 'english', a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

In [None]:
# remove English stop words
vect = CountVectorizer(stop_words='english')
tokenize_test(vect)

In [None]:
# set of stop words
print (vect.get_stop_words())

## PART 5: OTHER COUNTVECTORIZER OPTIONS 
- **max_features:** int or None, default=None
- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [None]:
# remove English stop words and only keep 100 features
vect = CountVectorizer(stop_words='english', max_features=100)
tokenize_test(vect)

In [None]:
train_dtm = vect.fit_transform(X_train)
pd.DataFrame(train_dtm.toarray(), columns=vect.get_feature_names())

In [None]:
# all 100 features
print (vect.get_feature_names())

In [None]:
# include 1-grams and 2-grams, and limit the number of features
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
tokenize_test(vect)

- **min_df:** float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [None]:
# include 1-grams and 2-grams, and only include terms that appear at least 2 times
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
tokenize_test(vect)

In [None]:
#### confirm the code tomorrow
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
train_dtm = vect.fit_transform(X_train)
pd.DataFrame(train_dtm.toarray(), columns=vect.get_feature_names())

## Part 6: INTRODUCTION TO TextBlob
* TextBlob: "Simplified Text Processing"

In [None]:
# print the first review
print (yelp_best_worst.text[0])

In [None]:
# save it as a TextBlob object
review = TextBlob(yelp_best_worst.text[0])

In [None]:
# list the words
review.words

In [None]:
len(review.words)

In [None]:
# list the sentences
review.sentences

In [None]:
# some string methods are available
review.lower()

In [None]:
review.noun_phrases

## PART 7: STEMMING AND LEMMATIZATION

**STEMMING:**
- **What:** Reduce a word to its base/stem/root form
- **Why:** Often makes sense to treat related words the same way
- **Notes:**
    - Uses a "simple" and fast rule-based approach
    - Stemmed words are usually not shown to users (used for analysis/indexing)
    - Some search engines treat words with the same stem as synonyms

In [None]:
# initialize stemmer
stemmer = SnowballStemmer('english')

In [None]:
# stem each word
stems = [stemmer.stem(word) for word in review.words]
print (len(stems))
print (review.words)
print (stems)

In [None]:
print (len(stems))
print (len(set(stems)))

**LEMMATIZATION**
- **What:** Derive the canonical form ('lemma') of a word
- **Why:** Can be better than stemming
- **Notes:** Uses a dictionary-based approach (slower than stemming)

In [None]:
# assume every word is a noun
lems = [word.lemmatize() for word in review.words]
print (len(lems))
print (review.words)
print (set(lems))
print (len(set(lems)))

In [None]:
# assume every word is a verb
vlems = [word.lemmatize() for word in review.words]
print (len(vlems))
print (len(set(vlems)))

In [None]:
# define a function that accepts text and returns a list of lemmas
def split_into_lemmas(text):
    if cur_version >= req_version:
        text = text.lower()
    else:
        text = unicode(text, 'utf-8').lower()
        
    words = TextBlob(text).words
    return [word.lemmatize() for word in words]

In [None]:
# use split_into_lemmas as the feature extraction function
vect = CountVectorizer(analyzer=split_into_lemmas)
tokenize_test(vect)

In [None]:
# last 50 features
print (vect.get_feature_names()[-50:])

## PART 8: TERM FREQUENCY - INVERSE DOCUMENT FREQUENCY 
- **What:** Computes "relative frequency" that a word appears in a document compared to its frequency across all documents
- **Why:** More useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents)
- **Notes:** Used for search engine scoring, text summarization, document clustering

In [None]:
# example documents
train_simple = ['call you tonight',
                'Call me a cab',
                'please call me... PLEASE!']

In [None]:
# CountVectorizer
vect = CountVectorizer()
pd.DataFrame(vect.fit_transform(train_simple).toarray(), columns=vect.get_feature_names())

In [None]:
# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(train_simple).toarray(), columns=vect.get_feature_names())

## PART 9: USING TF-IDF TO SUMMARIZE A YELP REVIEW

In [None]:
# create a document-term matrix using TF-IDF
vect = TfidfVectorizer(stop_words='english')
dtm = vect.fit_transform(yelp.text)
features = vect.get_feature_names()
dtm.shape

In [None]:
def summarize():
    
    # choose a random review that is at least 300 characters
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = ""
        review_length = 0
        
        if cur_version >= req_version:
            review_text = yelp.text[review_id]
            review_length = len(review_text)
        else:
            # Python version 2.7
            review_text = unicode(yelp.text[review_id], 'utf-8')
            review_length = len(review_text)

    # create a dictionary of words and their TF-IDF scores
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # print words with the top 5 TF-IDF scores
    print ('TOP SCORING WORDS:')
    top_scores = sorted(word_scores.items(), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print (word)
    
    # print 5 random words
    print ('RANDOM WORDS:')
    
    if cur_version >= req_version:
        random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False)
    else:
        random_words = np.random.choice(word_scores.keys(), size=5, replace=False)
        
    for word in random_words:
        print (word)
    
    # print the review
    print (review_text)

In [None]:
summarize()

## PART 10: SENTIMENT ANALYSIS

In [None]:
print (review)

In [None]:
# polarity ranges from -1 (most negative) to 1 (most positive)
review.sentiment.polarity

In [None]:
review.translate(to="es")

In [None]:
yelp.columns

In [None]:
# understanding the apply method
yelp['length'] = yelp.text.apply(len)

In [None]:
# define a function that accepts text and returns the polarity
def detect_sentiment(text):
    blob = None
    if cur_version >= req_version:
        blob = TextBlob(text).sentiment.polarity
    else:
        blob = TextBlob(text.decode('utf-8')).sentiment.polarity
    return blob

In [None]:
# create a new DataFrame column for sentiment
yelp['sentiment'] = yelp.text.apply(detect_sentiment)

In [None]:
yelp.columns

In [None]:
%matplotlib inline

# boxplot of sentiment grouped by stars
yelp.boxplot(column='sentiment', by='stars')

In [None]:
# reviews with most positive sentiment
yelp[yelp.sentiment == 1].text.head()

In [None]:
# reviews with most negative sentiment
yelp[yelp.sentiment == -1].text.head()

In [None]:
# widen the column display
pd.set_option('max_colwidth', 500)

In [None]:
# negative sentiment in a 5-star review
yelp[(yelp.stars == 5) & (yelp.sentiment < -0.3)].head()

In [None]:
# positive sentiment in a 1-star review
yelp[(yelp.stars == 1) & (yelp.sentiment > 0.5)].head()

In [None]:
# reset the column display width
pd.reset_option('max_colwidth')

## PART 11: ADDING FEATURES TO A DOCUMENT-TERM MATRIX

In [None]:
# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# split the new DataFrame into training and testing sets
feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']
X = yelp_best_worst[feature_cols]
y = yelp_best_worst.stars
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
X_train.shape

In [None]:
# use CountVectorizer with text column only
vect = CountVectorizer()
train_dtm = vect.fit_transform(X_train.iloc[:, 0])
test_dtm = vect.transform(X_test.iloc[:, 0])
print (train_dtm.shape)

In [None]:
# cast other feature columns to float and convert to a sparse matrix
extra = sp.sparse.csr_matrix(X_train.iloc[:, 1:].astype(float))
extra.shape

In [None]:
extra

In [None]:
# combine sparse matrices
train_dtm_extra = sp.sparse.hstack((train_dtm, extra))
train_dtm_extra.shape

In [None]:
# repeat for testing set
extra = sp.sparse.csr_matrix(X_test.iloc[:, 1:].astype(float))
test_dtm_extra = sp.sparse.hstack((test_dtm, extra))
test_dtm_extra.shape

In [None]:
# use logistic regression with text column only
logreg = LogisticRegression(C=1e9)
logreg.fit(train_dtm, y_train)
y_pred_class = logreg.predict(test_dtm)
print (metrics.accuracy_score(y_test, y_pred_class))

In [None]:
# use logistic regression with all features
logreg = LogisticRegression(C=1e9)
logreg.fit(train_dtm_extra, y_train)
y_pred_class = logreg.predict(test_dtm_extra)
print (metrics.accuracy_score(y_test, y_pred_class))

## PART 12: SAVE AND LOAD YOUR MODEL

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import wordpunct_tokenize

In [None]:
try:
    import cPickle as pickle
except ImportError:
    import pickle

# create a new DataFrame that only contains the 5-star and 1-star reviews
yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]

# split the new DataFrame into training and testing sets
feature_cols = ['text']
X = yelp_best_worst[feature_cols]
y = yelp_best_worst.stars
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# use CountVectorizer with text column only
vect = CountVectorizer()
train_dtm = vect.fit_transform(X_train.iloc[:, 0])
test_dtm = vect.transform(X_test.iloc[:, 0])

# use logistic regression with text column only
logreg = LogisticRegression(C=1e9)
logreg.fit(train_dtm, y_train)
y_pred_class = logreg.predict(test_dtm)
print (metrics.accuracy_score(y_test, y_pred_class))

## Dump the Logistic Regression Model
out_s = open('yelp_nlp_logreg.pkl', 'wb')
pickle.dump(logreg, out_s)
out_s.close()

## Save vocabulary
ngram_size = 1
vectorizer = CountVectorizer(ngram_range=(ngram_size, ngram_size), min_df=1)
vect = vectorizer.fit(X_train.iloc[:, 0])

dictionary_filepath = 'yelp_nlp_vocabulary.pkl'
pickle.dump(vect.vocabulary_, open(dictionary_filepath, 'wb'))

In [None]:
try:
    import cPickle as pickle
except ImportError:
    import pickle
    
def classify_with_logreg_model(review):
    ## TESTING WITH LOGISTIC REGRESSION & BAG-OF-WORDS
    dictionary_filepath = 'yelp_nlp_vocabulary.pkl'
    model_filepath = 'yelp_nlp_logreg.pkl'

    # LOAD VOCABULARY
    vocabulary_to_load = pickle.load(open(dictionary_filepath, 'rb'))
    loaded_vectorizer = CountVectorizer(ngram_range=(ngram_size, ngram_size), min_df=1, vocabulary=vocabulary_to_load)
    loaded_vectorizer._validate_vocabulary()

    ## LOAD THE SAVED CLASSIFIER
    in_logreg = open(model_filepath, 'rb')
    classifier = pickle.load(in_logreg)
    in_logreg.close()
    
    review_counts = loaded_vectorizer.fit_transform([review]).toarray()
    predictions = classifier.predict(review_counts)
    return predictions

In [None]:
yelp_best_worst.text[0]

In [None]:
yelp_best_worst.text[35]

In [None]:
print (classify_with_logreg_model(yelp_best_worst.text[0]))
print (classify_with_logreg_model(yelp_best_worst.text[35]))

In [None]:
some_text = "Addison Londoner is crazy fun during soccer matches... but not so awful for everyday lunches"
classify_with_logreg_model(some_text)

## PART 13: FUN TEXTBLOB FEATURES

In [None]:
# spelling correction
TextBlob('15 minuets late').correct()

In [None]:
# spellcheck
Word('parot').spellcheck()

In [None]:
# definitions
Word('bank').define('v')

In [None]:
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

In [None]:
doc = nlp(yelp_best_worst.text[35])

# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)