
# Natural Language Processing


### Learning Objectives
*After completing this notebook, you will be able to:*
- Discuss the major tasks involved with natural language processing.
- Discuss, on a low level, the components of natural language processing.
- Identify why natural language processing is difficult.
- Demonstrate text classification.
- Demonstrate common text preprocessing techniques.

## <font color='red'> Now you try
    
Let's read in the dataset for this session.

Run the cells below and work out what each row and column corresponds to.

Then run the final cell to filter the dataframe to include **only** reviews with **1 or 5 stars**. 

In [1]:
import pandas as pd
import numpy as np
import scipy as sp

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB         # Naive Bayes
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

from nltk.stem.snowball import SnowballStemmer

%matplotlib inline

In [None]:
# Read yelp.csv into a DataFrame.
yelp = pd.read_csv('./data/yelp.csv')
yelp.head()

In [3]:
# Create a new DataFrame that only contains the 5-star and 1-star reviews.
yelp_best_worst = yelp[(yelp['stars']==5) | (yelp['stars']==1)]
yelp_best_worst.head()

Unnamed: 0,business_id,date,review_id,stars,text,type,user_id,cool,useful,funny
0,9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,My wife took me here on my birthday for breakf...,review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0
1,ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,I have no idea why some people give bad review...,review,0a2KyEL0d3Yb1V6aivbIuQ,0,0,0
3,_1QQZuf4zZOyFCvXc0o6Vg,2010-05-27,G-WvGaISbqqaMHlNnByodA,5,"Rosie, Dakota, and I LOVE Chaparral Dog Park!!...",review,uZetl9T0NcROGOyFfughhg,1,2,0
4,6ozycU1RpktNG2-1BroVtw,2012-01-05,1uJFq2r5QfJG_6ExMRCaGw,5,General Manager Scott Petello is a good egg!!!...,review,vYmM4KTsC8ZfQBg-j5MWkw,0,0,0
6,zp713qNhx8d9KCJJnrw1xA,2010-02-12,riFQ3vxNpP4rWLk_CSri2A,5,Drop what you're doing and drive here. After I...,review,wFweIWhv2fREZV_dYkz_1g,7,7,4


<a id="cleaning"></a>

# <font color='blue'> Cleaning
    
Let's start with some basic cleaning tasks; converting everything to lowercase and stripping out punctuation and special characters.

In [4]:
yelp_best_worst['text'] = yelp_best_worst['text'].str.lower().str.replace('[^\w\s]','').str.replace('\n',' ')
yelp_best_worst.loc[0,'text']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


'my wife took me here on my birthday for breakfast and it was excellent  the weather was perfect which made sitting outside overlooking their grounds an absolute pleasure  our waitress was excellent and our food arrived quickly on the semibusy saturday morning  it looked like the place fills up pretty quickly so the earlier you get here the better  do yourself a favor and get their bloody mary  it was phenomenal and simply the best ive ever had  im pretty sure they only use ingredients from their garden and blend them fresh when you order it  it was amazing  while everything on the menu looks excellent i had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious  it came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete  it was the best toast ive ever had  anyway i cant wait to go back'

<a id="removing-stopwords"></a>

# <font color='blue'> Removing stopwords

- **What:** This process is used to remove common words that will likely appear in any text.
- **Why:** Because common words exist in most documents, they likely only add noise to your model and should be removed.

**What are stop words?**
Stop words are some of the most common words in a language. They are used so that a sentence makes sense grammatically, such as prepositions and determiners, e.g., "to," "the," "and." However, they are so commonly used that they are generally worthless for predicting the class of a document.  

Example: 

> 1. Original sentence: "The dog jumped over the fence"  
> 2. After stop-word removal: "dog jumped over fence"

The fact that there is a fence and a dog jumped over it can be derived with or without stop words.

NLTK has a built-in list of English stop words that we can inspect below:

In [5]:
import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
stop_words

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

We can remove stopwords from each document in our corpus.

In [6]:
yelp_best_worst['text_no_stopwords'] = yelp_best_worst['text'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [7]:
yelp_best_worst.loc[0,'text_no_stopwords']

'wife took birthday breakfast excellent weather perfect made sitting outside overlooking grounds absolute pleasure waitress excellent food arrived quickly semibusy saturday morning looked like place fills pretty quickly earlier get better favor get bloody mary phenomenal simply best ive ever im pretty sure use ingredients garden blend fresh order amazing everything menu looks excellent white truffle scrambled eggs vegetable skillet tasty delicious came 2 pieces griddled bread amazing absolutely made meal complete best toast ive ever anyway cant wait go back'

<a id="stemming-lemmatization"></a>

# <font color='blue'> Stemming and lemmatization
    
Stemming is a crude process of removing common endings from sentences, such as "s", "es", "ly", "ing", and "ed".

- **What:** Reduce a word to its base/stem/root form.
- **Why:** This intelligently reduces the number of features by grouping together (hopefully) related words.
- **Notes:**
    - Stemming uses a simple and fast rule-based approach.
    - Stemmed words are usually not shown to users (used for analysis/indexing).
    - Some search engines treat words with the same stem as synonyms.
    
Lemmatization is a more refined process that uses specific language and grammar rules to derive the root of a word.  

This is useful for words that do not share an obvious root such as "better" and "best".

- **What:** Lemmatization derives the canonical form ("lemma") of a word.
- **Why:** It can be better than stemming.
- **Notes:** Uses a dictionary-based approach (slower than stemming).
    
**Lemmatization and Stemming Examples**

|Lemmatization|Stemming|
|-------------|---------|
|shouted → shout|badly → bad|
|best → good|computing → comput|
|better → good|computed → comput|
|good → good|wipes → wip|
|wiping → wipe|wiped → wip|
|hidden → hide|wiping → wip|

## Stemming

Let's try out stemming on some of our documents.

In [8]:
# Initialize stemmer.
stemmer = SnowballStemmer('english')

# Stem each word.
print(' '.join([stemmer.stem(word) for word in yelp_best_worst.loc[0,'text_no_stopwords'].split()]))

wife took birthday breakfast excel weather perfect made sit outsid overlook ground absolut pleasur waitress excel food arriv quick semibusi saturday morn look like place fill pretti quick earlier get better favor get bloodi mari phenomen simpli best ive ever im pretti sure use ingredi garden blend fresh order amaz everyth menu look excel white truffl scrambl egg veget skillet tasti delici came 2 piec griddl bread amaz absolut made meal complet best toast ive ever anyway cant wait go back


## Lemmatize

Now let's try lemmatizing some of our documents

In [9]:
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 

# Assume every word is a noun.
print([lemmatizer.lemmatize(word,pos='v') for word in yelp_best_worst.loc[0,'text_no_stopwords'].split()])


['wife', 'take', 'birthday', 'breakfast', 'excellent', 'weather', 'perfect', 'make', 'sit', 'outside', 'overlook', 'ground', 'absolute', 'pleasure', 'waitress', 'excellent', 'food', 'arrive', 'quickly', 'semibusy', 'saturday', 'morning', 'look', 'like', 'place', 'fill', 'pretty', 'quickly', 'earlier', 'get', 'better', 'favor', 'get', 'bloody', 'mary', 'phenomenal', 'simply', 'best', 'ive', 'ever', 'im', 'pretty', 'sure', 'use', 'ingredients', 'garden', 'blend', 'fresh', 'order', 'amaze', 'everything', 'menu', 'look', 'excellent', 'white', 'truffle', 'scramble', 'egg', 'vegetable', 'skillet', 'tasty', 'delicious', 'come', '2', 'piece', 'griddle', 'bread', 'amaze', 'absolutely', 'make', 'meal', 'complete', 'best', 'toast', 'ive', 'ever', 'anyway', 'cant', 'wait', 'go', 'back']


<a id="text-classification"></a>

# <font color='blue'> Text classification
    
We'll be training a classifier to predict the number of stars of a review based on the review text. 

* What are the features in this case?

* What is the response variable?

**Text classification is the task of predicting which category or topic a text sample is from.**

We may want to identify:
- Is an article a sports or business story?
- Does an email have positive or negative sentiment?
- Is the rating of a recipe 1, 2, 3, 4, or 5 stars?

**Predictions are often made by using the words as features and the label as the target output.**

Starting out, we will make each unique word (across all documents) a single feature. In any given corpora, we may have hundreds of thousands of unique words, so we may have hundreds of thousands of features!

- For a given document, the numeric value of each feature could be the number of times the word appears in the document.
    - So, most features will have a value of zero, resulting in a sparse matrix of features.

- This technique for vectorizing text is referred to as a bag-of-words model. 
    - It is called bag of words because the document's structure is lost — as if the words are all jumbled up in a bag.
    - The first step to creating a bag-of-words model is to create a vocabulary of all possible words in the corpora.

> Alternatively, we could make each column an indicator column, which is 1 if the word is present in the document (no matter how many times) and 0 if not. This vectorization could be used to reduce the importance of repeated words. For example, a website search engine would be susceptible to spammers who load websites with repeated words. So, the search engine might use indicator columns as features rather than word counts.

**We need to consider several things to decide if bag-of-words is appropriate.**

- Does order of words matter?
- Does punctuation matter?
- Does upper or lower case matter?

## Training and testing sets

We start by splitting our dataset into a training set and a testing set.

We'll train our classifier on the training set, then test its performance on the testing set.

In [10]:
# Define our features and response
X = yelp_best_worst['text_no_stopwords']
y = yelp_best_worst['stars']

# Split the new DataFrame into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42,test_size=0.7)

Let's inspect our training set.

In [11]:
X_train

7473    unexpected gem wine drinker fabulous flights g...
9598    finally teriyaki bowl place right nothing samu...
5778    sunflower market way cheaper whole foods lot s...
3915    yo favorite ajs gots beef prices go somewhere ...
2713    taylors refreshing cafe menu ive breakfast far...
                              ...                        
2737    let tell first crush phoenix happened least ex...
3142    dear love went museum romantic date lovely tim...
2065    like hotdogs motor thats says get mommys oldsm...
8596    nice facilities nice ac two fatal flaws 1 chil...
7787    single best ribs ive ever restaurant place bra...
Name: text_no_stopwords, Length: 1225, dtype: object

In [12]:
y_train

7473    5
9598    5
5778    5
3915    5
2713    5
       ..
2737    5
3142    5
2065    5
8596    1
7787    5
Name: stars, Length: 1225, dtype: int64

<a id="count-vectoriser"></a>

# <font color='blue'> Converting documents into numerical features

To use a machine learning model, we must convert unstructured text into numeric features. There are several different methods for doing this. 

`CountVectorizer` does what it sounds like! It converts each document into a vector of counts of different words. 

The result of running `CountVectorizer` across a corpus is a matrix, where each row corresponds to a document and each column corresponds to a unique words that occurs across all documents. 

![DTM](images/DTM.png)

We can do this easily using `nltk`:

In [13]:
# Use CountVectorizer to create document-term matrices from X_train and X_test.
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

In [14]:
# Rows are documents, columns are terms (aka "tokens" or "features", individual words in this situation).
X_train_dtm.shape

(1225, 11513)

In [15]:
X_test_dtm.shape

(2861, 11513)

We can inspect the features in our matrix easily. Let's preview a slice of our vocabulary. 

In [16]:
vect.vocabulary_

{'unexpected': 10774,
 'gem': 4472,
 'wine': 11275,
 'drinker': 3399,
 'fabulous': 3875,
 'flights': 4149,
 'great': 4683,
 'selection': 9005,
 'cant': 1783,
 'say': 8883,
 'much': 6786,
 'awesome': 955,
 'finally': 4059,
 'teriyaki': 10277,
 'bowl': 1459,
 'place': 7741,
 'right': 8617,
 'nothing': 7046,
 'samuri': 8821,
 'sams': 8820,
 'yogis': 11445,
 'bit': 1290,
 'better': 1229,
 'opinion': 7217,
 'menu': 6509,
 'strait': 9820,
 'forward': 4264,
 'clearly': 2212,
 'lists': 6106,
 'options': 7228,
 'available': 934,
 'clean': 2200,
 'staff': 9680,
 'fast': 3944,
 'friendly': 4338,
 'food': 4208,
 'also': 555,
 'reasonably': 8309,
 'priced': 7988,
 'impressed': 5330,
 'different': 3139,
 'bowlplatesalad': 1461,
 'able': 304,
 'order': 7236,
 'veggies': 10924,
 'chicken': 2056,
 'andor': 613,
 'beef': 1159,
 'pretty': 7980,
 'however': 5191,
 'want': 11076,
 'extra': 3857,
 'meat': 6449,
 'lettuce': 6014,
 'brown': 1565,
 'rice': 8597,
 'white': 11229,
 'etc': 3719,
 'anyway': 669,
 

## N-Grams

N-grams are features which consist of N consecutive words. This is useful because using the bag-of-words model, treating `data scientist` as a single feature has more meaning than having two independent features `data` and `scientist`!

Example:
```
my cat is awesome
Unigrams (1-grams): 'my', 'cat', 'is', 'awesome'
Bigrams (2-grams): 'my cat', 'cat is', 'is awesome'
Trigrams (3-grams): 'my cat is', 'cat is awesome'
4-grams: 'my cat is awesome'
```

- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [17]:
# Include 1-grams and 2-grams.
vect = CountVectorizer(ngram_range=(1, 2),stop_words='english')
X_train_dtm = vect.fit_transform(X_train)
X_train_dtm.shape

(1225, 68327)

We can start to see how supplementing our features with n-grams can lead to more feature columns. When we produce n-grams from a document with $W$ words, we add an additional $(n-W+1)$ features (at most). That said, be careful — when we compute n-grams from an entire corpus, the number of _unique_ n-grams could be vastly higher than the number of _unique_ unigrams! This could cause an undesired feature explosion.

Although we sometimes add important new features that have meaning such as `data scientist`, many of the new features will just be noise. So, particularly if we do not have much data, adding n-grams can actually decrease model performance. This is because if each n-gram is only present once or twice in the training set, we are effectively adding mostly noisy features to the mix.

In [18]:
# Last 50 features
print((vect.get_feature_names()[2000:2050]))

['allowed sample', 'allowed time', 'allowed work', 'allowing', 'allowing continue', 'allowing extremely', 'allowing time', 'allows', 'allows able', 'allows patrons', 'allows person', 'allows premises', 'allstar', 'allstar opened', 'allstars', 'allstars play', 'allure', 'allure trails', 'alluringmaybe', 'alluringmaybe sushi', 'almond', 'almond brittle', 'almond buttercrunch', 'almond chocolate', 'almond croissant', 'almond croissants', 'almonds', 'almonds mixed', 'almondsthis', 'almondsthis choice', 'aloe', 'aloe vera', 'alofts', 'alofts probably', 'alofts youve', 'aloha', 'aloha hour', 'aloha kitchen', 'alongside', 'alongside barely', 'alongside steak', 'alons', 'alons rock', 'aloof', 'aloof fawning', 'alot', 'alot better', 'alot cheese', 'alot cupcakes', 'alot flavor']


<a id='cvec_opt'></a>
### Other CountVectorizer Options

- `max_features`: int or None, default=None
- If not None, build a vocabulary that only consider the top `max_features` ordered by term frequency across the corpus. This allows us to keep more common n-grams and remove ones that may appear once. If we include words that only occur once, this can lead to said features being highly associated with a class and cause overfitting.

In [19]:
# Remove English stop words and only keep 100 features.
vect = CountVectorizer(ngram_range=(1, 2),stop_words='english', max_features=100)
X_train_dtm = vect.fit_transform(X_train)


In [20]:
# All 100 features
print((vect.get_feature_names()))

['amazing', 'area', 'awesome', 'bad', 'bar', 'best', 'better', 'big', 'came', 'check', 'cheese', 'chicken', 'come', 'day', 'definitely', 'delicious', 'didnt', 'different', 'dinner', 'dont', 'eat', 'excellent', 'experience', 'family', 'favorite', 'feel', 'food', 'fresh', 'friendly', 'friends', 'going', 'good', 'got', 'great', 'happy', 'home', 'hot', 'hour', 'house', 'im', 'ive', 'know', 'like', 'little', 'location', 'long', 'look', 'looking', 'love', 'lunch', 'make', 'meal', 'menu', 'minutes', 'new', 'nice', 'night', 'old', 'order', 'ordered', 'people', 'perfect', 'phoenix', 'pizza', 'place', 'pretty', 'price', 'prices', 'really', 'recommend', 'restaurant', 'review', 'right', 'room', 'said', 'salad', 'sauce', 'say', 'service', 'small', 'special', 'staff', 'store', 'sure', 'table', 'thing', 'things', 'think', 'time', 'times', 'told', 'took', 'try', 'wait', 'want', 'way', 'went', 'worth', 'years', 'youre']


Just like with all other models, more features does not mean a better model. So, we must tune our feature generator to remove features whose predictive capability is none or very low.

In this case, there is roughly a 1.6% increase in accuracy when we double the n-gram size and increase our max features by 1,000-fold. Note that if we restrict it to only unigrams, then the accuracy increases even more! So, bigrams were very likely adding more noise than signal. 

In the end, by only using 16,000 unigram features we came away with a much smaller, simpler, and easier-to-think-about model which also resulted in higher accuracy.

In [26]:
# Include 1-grams and 2-grams, and limit the number of features.

print('1-grams and 2-grams, up to 100K features:')
vect = CountVectorizer(ngram_range=(1, 2), max_features=100000)
X_train_dtm = vect.fit_transform(X_train)
print(vect.get_feature_names()[1000:1010])

print('1-grams only, up to 100K features:')
vect = CountVectorizer(ngram_range=(1, 1), max_features=100000)
X_train_dtm = vect.fit_transform(X_train)
print(vect.get_feature_names()[1000:1010])

1-grams and 2-grams, up to 100K features:
['able open', 'able order', 'able pay', 'able properly', 'able purchase', 'able reference', 'able say', 'able seat', 'able seated', 'able see']
1-grams only, up to 100K features:
['baggage', 'bagged', 'bagginsey', 'baggy', 'bagjust', 'bags', 'baguettes', 'bahn', 'bait', 'baja']


- `min_df`: Float in range [0.0, 1.0] or int, default=1
- When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts.

In [27]:
# Include 1-grams and 2-grams, and only include terms that appear at least two times.
vect = CountVectorizer(ngram_range=(1, 2), min_df=2)
X_train_dtm = vect.fit_transform(X_train)
print(vect.get_feature_names()[1000:1010])

['bite', 'bite good', 'bite like', 'bites', 'bitterness', 'black', 'black bean', 'black beans', 'black white', 'blah']


<a id="bayes"></a>

# <font color='blue'> Text Classification with Naive Bayes

Naive Bayes is a very popular classifier because it has minimal storage requirements, is fast, can be tuned easily with more data, and has found very useful applications in text classificaton. For example, Paul Graham originally proposed using Naive Bayes to detect spam in his [Plan for Spam](http://www.paulgraham.com/spam.html).

**What is Bayes?**  
Bayes, or Bayes' Theorem, is a different way to assess probability. It considers prior information in order to more accurately assess the situation.

**Example:** You are playing roulette.

As you approach the table, you see that the last number the ball landed on was Red-3. With a frequentist mindset, you know that the ball is just as likely to land on Red-3 again given that every slot on the wheel has an equal opportunity of 1 in 37.

Given that you started believing that the ball can land in each slot with an equal likelihood _and_ that you have only seen one throw previously, you rationally believe that there would be no difference between picking Red a second time now or picking Black -- ideally they would happen with the same likelihood!

However, as you sit and watch the roulette table, you begin to notice something strange. The ball is _always_ landing on red. Every single time the ball is thrown, it lands in a red slot. Even though your past beliefs stated that red and black were equally likely, every time it lands in red, you change those beliefs a little more towards a biased roulette table. 

This is what Bayes is all about — adjusting probabilities as more data is gathered!

Below is the equation for Bayes.  

$$P(A \ | \ B) = \frac {P(B \ | \ A) \times P(A)} {P(B)}$$

- **$P(A \ | \ B)$** : Probability of `Event A` occurring given `Event B` has occurred.
- **$P(B \ | \ A)$** : Probability of `Event B` occurring given `Event A` has occurred.
- **$P(A)$** : Probability of `Event A` occurring.
- **$P(B)$** : Probability of `Event B` occurring.


Let's train a Naive Bayes classifier on our Yelp reviews.

In [None]:
# Use default options for CountVectorizer.
vect = CountVectorizer()

# Create document-term matrices.
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)


## Train a Naive Bayes classifier on our training set

We initialise a Naive Bayes classifier object. We then **fit** our classifier (or **train**) it on our training dataset, which consists of the document-term matrix `X_train_dtm` together with the `t_train`, the labels for each document in our training set.

In [None]:
# Use Naive Bayes to predict the star rating.
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_dtm, y_train)


## Test our classifier on the testing set

We can now call our classifier's `predict` method to predict the classifcation (i.e. star rating of the reviews in our testing set)

In [None]:
y_pred_class = naive_bayes_classifier.predict(X_test_dtm)
y_pred_class

We can now predict the accuracy of our classifier by calling `scikit-learn`'s `metrics.accuracy_score` method to compare the actual star ratings (`y_test`) to our classifier's predictions, `y_pred_class`

In [None]:
# Calculate accuracy.
print((metrics.accuracy_score(y_test, y_pred_class)))

## Comparing our classifier to a baseline

Let's compare the accuracy of our classifier to our basline accuracy, which is a classifier that always predicts the most frequently occuring class in our training set (i.e. 5 star reviews).

In [None]:
# Calculate null accuracy.
y_test_binary = np.where(y_test==5, 1, 0) # five stars become 1, one stars become 0
print('Percent 5 Stars:', y_test_binary.mean())
print('Percent 1 Stars:', 1 - y_test_binary.mean())

Our model predicted ~89% accuracy, which is an improvement over this baseline 81% accuracy (assuming our model always predicts 5 stars).

<a id='tfidf'></a>
## Term Frequency–Inverse Document Frequency (TF–IDF)

While a Count Vectorizer simply totals up the number of times a "word" appears in a document, the more complex TF-IDF Vectorizer analyzes the uniqueness of words between documents to find distinguishing characteristics. 
     
Term frequency–inverse document frequency (TF–IDF) computes the "relative frequency" with which a word appears in a document, compared to its frequency across all documents.

It's more useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents).

It's used for search-engine scoring, text summarization, and document clustering.

**More details:** [TF–IDF is about what matters](http://planspace.org/20150524-tfidf_is_about_what_matters/)

In [None]:
# Example documents
simple_train = ['call you tomorrow', 'Call me a cab', 'please call me... PLEASE!']

In [None]:
# Term frequency
vect = CountVectorizer()
tf = pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

In [None]:
# Document frequency
vect = CountVectorizer(binary=True)
df = vect.fit_transform(simple_train).toarray().sum(axis=0)
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

In [None]:
# Term frequency–inverse document frequency (simple version)
tf/df

The higher the TF–IDF value, the more "important" the word is to that specific document. Here, "cab" is the most important and unique word in document 1, while "please" is the most important and unique word in document 2. TF–IDF is often used for training as a replacement for word count.

In [None]:
# TfidfVectorizer
vect = TfidfVectorizer()
pd.DataFrame(vect.fit_transform(simple_train).toarray(), columns=vect.get_feature_names())