# Introduction

## What Is Natural Language Processing (NLP)?

Using computers to process (analyze, understand, generate) natural human languages.

## Why use NLP?

An enormous amount of information is stored as text. Computers can process this information much faster than humans.

## Higher-Level NLP Tasks

- **Chatbots:** Understand natural language from the user and return intelligent responses.
    - [Api.ai](https://api.ai/)
- **Information retrieval:** Find relevant results and similar results.
    - [Google](https://www.google.com/)    
- **Information extraction:** Structured information from unstructured documents.
    - [Events from Gmail](https://support.google.com/calendar/answer/6084018?hl=en)
- **Machine translation:** One language to another.
    - [Google Translate](https://translate.google.com/)
- **Text simplification:** Preserve the meaning of text, but simplify the grammar and vocabulary.
    - [Rewordify](https://rewordify.com/)
    - [Simple English Wikipedia](https://simple.wikipedia.org/wiki/Main_Page)
- **Predictive text input:** Faster or easier typing.
    - [Phrase completion application](https://justmarkham.shinyapps.io/textprediction/)
    - [A much better application](https://farsite.shinyapps.io/swiftkey-cap/)
- **Sentiment analysis:** Attitude of speaker.
    - [Hater News](https://medium.com/@KevinMcAlear/building-hater-news-62062c58325c)
- **Automatic summarization:** Extractive or abstractive summarization.
    - [autotldr](https://www.reddit.com/r/technology/comments/35brc8/21_million_people_still_use_aol_dialup/cr2zzj0)
- **Natural language generation:** Generate text from data.
    - [How a computer describes a sports match](http://www.bbc.com/news/technology-34204052)
    - [Publishers withdraw more than 120 gibberish papers](http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763)
- **Speech recognition and generation:** Speech-to-text, text-to-speech.
    - [Google's Web Speech API demo](https://www.google.com/intl/en/chrome/demos/speech.html)
    - [Vocalware Text-to-Speech demo](https://www.vocalware.com/index/demo)
- **Question answering:** Determine the intent of the question, match query with knowledge base, evaluate hypotheses.
    - [How did supercomputer Watson beat Jeopardy champion Ken Jennings?](http://blog.ted.com/how-did-supercomputer-watson-beat-jeopardy-champion-ken-jennings-experts-discuss/)
    - [IBM's Watson Trivia Challenge](http://www.nytimes.com/interactive/2010/06/16/magazine/watson-trivia-game.html)
    - [The AI Behind Watson](http://www.aaai.org/Magazine/Watson/watson.php)

## Lower-Level Components

- **Tokenization:** Breaking text into tokens (words, sentences, n-grams)
- **Stop-word removal:** a/an/the
- **Stemming and lemmatization:** root word
- **TF-IDF:** word importance
- **Part-of-speech tagging:** noun/verb/adjective
- **Named entity recognition:** person/organization/location
- **Spelling correction:** "New Yrok City"
- **Word sense disambiguation:** "buy a mouse"
- **Segmentation:** "New York City subway"
- **Language detection:** "translate this page"
- **Vectorizing:** Turning documents into vectors of numbers for use in machine learning
- **Machine learning:** specialized models that work well with text

## Why NLP is hard

Natural language processing requires an understanding of the language and the world. Several limitations of NLP are:

- **Ambiguity**:
    - Hospitals Are Sued by 7 Foot Doctors
    - Juvenile Court to Try Shooting Defendant
    - Local High School Dropouts Cut in Half
- **Non-standard English:** text messages
- **Idioms:** "throw in the towel"
- **Newly coined words:** "retweet"
- **Tricky entity names:** "Where is A Bug's Life playing?"
- **World knowledge:** "Mary and Sue are sisters", "Mary and Sue are mothers"

## NLP terms

- **corpus** (plural **corpora**): a collection of documents (derived from the Latin word for "body")
- **document**: any item in a corpus (e.g. email, book chapter, tweet, article, or text message).

<a id='yelp_rev'></a>

# Reading in the Yelp Reviews

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import scipy as sp

from nltk.stem.snowball import SnowballStemmer
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from textblob import TextBlob, Word

%matplotlib inline

In [None]:
# Read yelp.csv into a DataFrame.
path = Path('..', 'assets', 'data', 'yelp.csv')
yelp = pd.read_csv(path)

In [None]:
# Simplify star rating prediction problem by making it binary, splitting between 3 and 4


In [None]:
# Define X and y.


In [None]:
# Split the new DataFrame into training and testing sets.


In [None]:
# The head of the data


<a id='text_class'></a>


# Introduction: Text Classification

**Text classification is the task of predicting which category or topic a text sample is from.**

E.g.:
- Is an article a sports or business story?
- Does an email have positive or negative sentiment?
- Is the rating of a recipe 1, 2, 3, 4, or 5 stars?


**Turning text into feature vectors**

The only difference between this task and the kinds of classification tasks we have been considering is that our inputs consist of text rather than numeric features.

If we can find a way to represent text using a set of numeric features, then we can use standard machine learning classifiers for text classification.

We will start out with a **bag-of-words representation:**

- Preprocess the text, e.g. to remove punctuation and convert uppercase letters to lowercase.
- Create a vocabulary, e.g. every word in the corpus.
- Make each word in the vocabulary a feature.
- Represent each document with a vector that indicates how many times each word in the vocabulary appears in that document.

## Demo: Text Processing in scikit-learn

<a id='count_vec'></a>
### Creating Features Using CountVectorizer

- **What:** Converts each document into a set of words and their counts.
- **Why:** To use a machine learning model, we must convert unstructured text into numeric features.

<a id='countvectorizer-model'></a>


### Using CountVectorizer in a Model
![DTM](../assets/images/DTM.png)

**Notes on bag-of-words:**

- The phrases "term-document matrix" and "document-term matrix" are interchangeable.
- Vocabulary will often contain tens of thousands of words or more.
- Most features will have a value of zero for most documents, resulting in a sparse matrix of features.
- This approach is called "bag-of-words" because it loses the document's structure — as if the words are all jumbled up in a bag.
- Rather than counting occurrences of each word, you might just record a 1 or 0 to indicate whether it is present or divide by the length of the document to indicate the word's frequency.

In [None]:
# Use CountVectorizer to create document-term matrices from X_train and X_test.


In [None]:
# Transformed feature matrices are stored as sparse matrices for efficiency.

# A sparse representation stores the vaues and locations of non-zero elements,
# rather than storing a number for every element, which saves space when
# most elements are zero.


The "sparse matrix" data structure records which positions in the matrix have nonzero values and what those values are, as opposed to a "dense matrix" data structure that records the value at every position. The sparse matrix format is much more space-efficient when the matrix consists primarily of zeros, as a typical document-term matrix does.

In [None]:
# View as a dense matrix


In [None]:
# Rows are documents, columns are terms (aka "tokens" or "features", individual words in this situation).


**Exercise (3 mins., post right away)**

- 7500 of what?

- 25797 of what?

- How would you interpret the output of `X_train_dtm.sum(axis='columns')`?

- How would you interpret the output of `X_train_dtm.sum(axis='index')`?

$\blacksquare$

In [None]:
# Last 50 features


In [None]:
# Let's take a look at the vocabulary that was generated, containing 16,825 unique words.
#   'vocabulary_' is a dictionary that converts each word to its index in the sparse matrix.


In [None]:
# Show vectorizer options.


[CountVectorizer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

By default, CountVectorizer converts all text to lowercase before generating features. Otherwise, "Pizza" at the start of a sentence becomes a different feature from "pizza" in the middle of a sentence.

On the other hand, you would want different features corresponding to "Apple" the company and "apple" the fruit, so this step does discard some information.

In [None]:
# Don't convert to lowercase.


**Exercise (1 min., post right away)**

- What is the vocabulary size for CountVectorizer on this dataset with `lowercase=False`? Is this size greater or smaller than the size with `lowercase=True`? Why?

$\blacksquare$

In [None]:
# Create document-term matrices using default options for CountVectorizer.


In [None]:
# Use Naive Bayes to predict the star rating.


In [None]:
# Calculate accuracy.


In [None]:
# Check label balance


In [None]:
# first create an array with the same shape as y
# then fill it in with the most common value -- numpy "broadcasts" the sum over the whole array


In [None]:
# then compare predicting the mean every time to the true values


Our estimator predicted with ~82% accuracy, which is an improvement over this baseline 69% accuracy (always predicting a positive rating).

Let's look more into how the vectorizer works.

In [None]:
# Define a function that accepts a vectorizer and calculates the accuracy.


In [None]:
# min_df=2 says to ignore words that occur less than twice ('df' means "document frequency").


Let's take a look next at other ways of preprocessing text!

<a id='ngrams'></a>
### N-Grams

N-grams are features which consist of N consecutive words. This is useful because using the bag-of-words model, treating `data scientist` as a single feature has more meaning than having two independent features `data` and `scientist`!

Example:
```
my cat is awesome
Unigrams (1-grams): 'my', 'cat', 'is', 'awesome'
Bigrams (2-grams): 'my cat', 'cat is', 'is awesome'
Trigrams (3-grams): 'my cat is', 'cat is awesome'
4-grams: 'my cat is awesome'
```

- **ngram_range:** tuple (min_n, max_n)
- The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

In [None]:
# Include 1-grams and 2-grams.


As $n$ gets larger, the number of *unique* n-grams increases greatly, adding pure noise to the data set.

In [None]:
# Last 50 features


<a id='stopwords'></a>

### Stop-Word Removal

- **What:** This process is used to remove common words that will likely appear in any text.
- **Why:** Because common words exist in most documents, they likely only add noise to your model and should be removed.

**What are stop words?**
Stop words are some of the most common words in a language. They are used so that a sentence makes sense grammatically, such as prepositions and determiners, e.g., "to," "the," "and." However, they are so commonly used that they are generally worthless for predicting the class of a document. Since "a" appears in spam and non-spam emails, for example, it would only contribute noise to our model.

Example: 

> 1. Original sentence: "The dog jumped over the fence"  
> 2. After stop-word removal: "dog jumped over fence"

The fact that there is a fence and a dog jumped over it can be derived with or without stop words.

In [None]:
# Show vectorizer options.


- **stop_words:** string {`english`}, list, or None (default)
- If `english`, a built-in stop word list for English is used.
- If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
- If None, no stop words will be used. `max_df` can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms. (If `max_df` = 0.7, then if > 70% of documents contain a word it will not be included in the feature set!)

In [None]:
# CountVectorizer stop words for English


<a id='cvec_opt'></a>
### Other CountVectorizer Options

- `max_features`: int or None, default=None
- If not None, build a vocabulary that only consider the top `max_features` ordered by term frequency across the corpus. This allows us to keep more common n-grams and remove ones that may appear once. If we include words that only occur once, this can lead to said features being highly associated with a class and cause overfitting.

In [None]:
# Remove English stop words and only keep 100 features.


In [None]:
# All 100 features


Just like with all other models, more features does not mean a better model. So, we must tune our feature generator to remove features whose predictive capability is none or very low.

In our case, using about 26,000 unigram features rather than 339,112 bigram features gave us a much smaller, simpler, and easier-to-think-about model and also resulted in higher accuracy. Our model and our dataset size were not sufficient to pick out the signal within the noise that the bigrams added.

In [None]:
# Include 1-grams and 2-grams, and limit the number of features.


In [None]:
# Include 1-grams and 2-grams, and only include terms that appear at least two times.


Adding bigrams does improve performance modestly if we prune down the set of features that we use.

**Exercise (2 mins., post right away)**

How does each of the following changes to the feature representation used affect the bias and variance of a resulting model?

- Increasing min_df

- Using both unigrams and bigrams instead of just unigrams.

- Removing stop words

- Decreasing `max_features`

$\blacksquare$

<a id='textblob'></a>
## Introduction to TextBlob

You should already have downloaded TextBlob, a Python library used to explore common NLP tasks. If you haven’t, please return to [this step](#textblob_install) for instructions on how to do so. We’ll be using this to organize our corpus for analysis.

As mentioned earlier, you can read more on the [TextBlob website](https://textblob.readthedocs.io/en/dev/).

In [None]:
# Print the first review.


In [None]:
# Save it as a TextBlob object.


In [None]:
# List the words.


In [None]:
# List the sentences.


In [None]:
# Some string methods are available.


<a id='stem'></a>
## Stemming and Lemmatization
Interesting read: [Stemming and Lemmatization by Stanford NLP Lab](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

Stemming is a crude process of removing common endings from sentences, such as "s", "es", "ly", "ing", and "ed".

- **What:** Reduce a word to its base/stem/root form.
- **Why:** This intelligently reduces the number of features by grouping together (hopefully) related words.
- **Notes:**
    - Stemming uses a simple and fast rule-based approach.
    - Stemmed words are usually not shown to users (used for analysis/indexing).
    - Some search engines treat words with the same stem as synonyms.

[Snowball stemming algorithm (a.k.a. Porter2)](http://snowball.tartarus.org/algorithms/english/stemmer.html) is  an improved version of the original [Porter's stemming algorithm](http://www.cs.odu.edu/~jbollen/IR04/readings/readings5.pdf) by Martin Porter

In [None]:
# Initialize stemmer.


In [None]:
# Stem each word.


Some examples you can see are "excellent" stemmed to "excel" and "amazing" stemmed to "amaz".

Lemmatization is a more refined process that uses specific language and grammar rules to derive the root of a word.  

This is useful for words that do not share an obvious root such as "better" and "best".

- **What:** Lemmatization derives the canonical form ("lemma") of a word.
- **Why:** It can be better than stemming.
- **Notes:** Uses a dictionary-based approach (slower than stemming).

In [None]:
# Lemmatize assume every word is a verb.


Some examples you can see are "was" lemmatized to "be" and "arrived" lemmatized to "arrive".

Without `post='v'`, the `lemmatize` method used here assumes that each word is a noun, and it does not do much.

**More Lemmatization and Stemming Examples**

|Word|Lemmatization|Stemming|
|-----|-------------|---------|
|shouted|shout|shout|
|best | good|best|
|better | good|better|
|good | good|good|
|hidden | hide|hidden|
|computing |compute| comput|
|computed |compute| comput|
|wipes |wipe| wip|
|wiped |wipe| wip|
|wiping |wipe| wip|

In [None]:
# Define a function that accepts text and returns a list of lemmas.


In [None]:
# Use split_into_lemmas as the feature extraction function (Warning: SLOW!).


In [None]:
# Last 50 features


Keep in mind that you should constantly be thinking about the result of each preprocessing step instead of blindly trying them without thinking. Does each type of preprocessing "makes sense" with the input data you are using? Is it likely to keep intact the signal and remove noise?

<a id='tfidf'></a>
## Term Frequency–Inverse Document Frequency (TF–IDF)
     
- **What:** Term frequency–inverse document frequency (TF–IDF) computes the "relative frequency" with which a word appears in a document, compared to its frequency across all documents.
- **Why:** It's more useful than "term frequency" for identifying "important" words in each document (high frequency in that document, low frequency in other documents).
- **Notes:** It's used for search-engine scoring, text summarization, and document clustering.

In [None]:
# Example documents
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']

In [None]:
# Term frequency
vect = CountVectorizer()
vect.fit(simple_train)
tf = pd.DataFrame(vect.transform(simple_train).toarray(), columns=vect.get_feature_names())
tf

In [None]:
# Document frequency
vect = CountVectorizer(binary=True)
vect.fit(simple_train)
df = vect.transform(simple_train).toarray().sum(axis=0) # can't use axis='index' in NumPy
pd.DataFrame(df.reshape(1, 6), columns=vect.get_feature_names())

In [None]:
# Term frequency–inverse document frequency (simple version)


The higher the TF–IDF value, the more "important" the word is to that specific document. Here, "cab" is the most important and unique word in document 1, while "please" is the most important and unique word in document 2. TF–IDF is often used for training as a replacement for word count.

In [None]:
# TfidfVectorizer -- uses a slightly different implementation of the same basic idea


[Details of the sklearn implementation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) -- the idf term has a log and some smoothing, and each row is normalized to have unit length.

In [None]:
# Try it on our data


<a id='yelp_tfidf'></a>
## Using TF–IDF to Summarize a Yelp Review

TF-IDF tries to pick out the most distinctive words in a given document relative to the overall corpus. Thus, we would expect that using the words with the highest TF-IDF scores for a given document would give us a better sense of what that document is about than using an equal number of random words from that document. Let's give this idea a try.

In [None]:
# Create a document-term matrix using TF–IDF.


In [None]:
# Fit transform Yelp data.


In [None]:
def summarize():
    
    # Choose a random review that is at least 300 characters.
    review_length = 0
    while review_length < 300:
        review_id = np.random.randint(0, len(yelp))
        review_text = yelp.text[review_id]
        #review_text = unicode(yelp.text[review_id], 'utf-8')
        review_length = len(review_text)
    
    # Create a dictionary of words and their TF–IDF scores.
    word_scores = {}
    for word in TextBlob(review_text).words:
        word = word.lower()
        if word in features:
            word_scores[word] = dtm[review_id, features.index(word)]
    
    # Print words with the top five TF–IDF scores.
    print('TOP SCORING WORDS:')
    top_scores = sorted(list(word_scores.items()), key=lambda x: x[1], reverse=True)[:5]
    for word, score in top_scores:
        print(word)
    
    # Print five random words.
    print(('\n' + 'RANDOM WORDS:'))
    random_words = np.random.choice(list(word_scores.keys()), size=5, replace=False)
    for word in random_words:
        print(word)
    
    # Print the review.
    print(('\n' + review_text))

In [None]:
summarize()

<a id='sentiment'></a>
## Sentiment Analysis

Understanding how positive or negative a review is. There are many ways in practice to compute a sentiment value. For example:

- Have a list of "positive" words and a list of "negative" words and count how many occur in a document. 
- Train a classifier given many examples of "positive" documents and "negative" documents. 

In [None]:
print(review)

In [None]:
# Polarity ranges from -1 (most negative) to 1 (most positive).


In [None]:
# Define a function that accepts text and returns the polarity.


In [None]:
# Create a new DataFrame column for sentiment (Warning: SLOW!).


In [None]:
# Box plot of sentiment grouped by stars


In [None]:
# Reviews with most positive sentiment


In [None]:
# Reviews with most negative sentiment


In [None]:
# Negative sentiment in a 5-star review


In [None]:
# Positive sentiment in a 1-star review


<a id='add_feat'></a>
## Bonus: Adding Features to a Document-Term Matrix

Here, we will add additional features to our `CountVectorizer()`-generated feature set to hopefully improve our model.

To make the best models, you will want to supplement the auto-generated features with new features you think might be important. After all, `CountVectorizer()` typically lowercases text and removes all associations between words. Or, you may have metadata to add in addition to just the text.

> Remember: Although you may have hundreds of thousands of features, each data point is extremely sparse. So, if you add in a new feature, e.g., one that detects if the text is all capital letters, this new feature can still have a huge effect on the model outcome!

In [None]:
# define X and y
feature_cols = ['text', 'sentiment', 'cool', 'useful', 'funny']
X = yelp.loc[:, feature_cols]
y = yelp.loc[:, 'positive_rating']

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# Use CountVectorizer with text column only.
vect = CountVectorizer()
vect.fit(X_train.text)
X_train_dtm = vect.transform(X_train.text)
X_test_dtm = vect.transform(X_test.text)
print(X_train_dtm.shape)
print(X_test_dtm.shape)

In [None]:
# Shape of other four feature columns
X_train.drop('text', axis='columns').shape

In [None]:
# Cast other feature columns to float and convert to a sparse matrix.
extra = sp.sparse.csr_matrix(X_train.drop('text', axis='columns').astype(float))
extra.shape

In [None]:
# Combine sparse matrices.
X_train_dtm_extra = sp.sparse.hstack((X_train_dtm, extra))
X_train_dtm_extra.shape

In [None]:
# Repeat for testing set.
extra = sp.sparse.csr_matrix(X_test.drop('text', axis='columns').astype(float))
X_test_dtm_extra = sp.sparse.hstack((X_test_dtm, extra))
X_test_dtm_extra.shape

In [None]:
# Use logistic regression with text column only.
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm, y_train)
y_pred_class = logreg.predict(X_test_dtm)
print((metrics.accuracy_score(y_test, y_pred_class)))

In [None]:
# Use logistic regression with all features.
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train_dtm_extra, y_train)
y_pred_class = logreg.predict(X_test_dtm_extra)
print((metrics.accuracy_score(y_test, y_pred_class)))

<a id='more_textblob'></a>
## Bonus: Fun TextBlob Features

In [None]:
# For some reason this code does not work the first time I try to run it.
# Spelling correction
TextBlob('15 minuets late').correct()

In [None]:
# Spellcheck
Word('parot').spellcheck()

In [None]:
# Definitions
Word('bank').define('v')

In [None]:
# Language identification
TextBlob('Hola amigos').detect_language()

<a id="bayes"></a>

## Appendix: Intro to Naive Bayes and Text Classification

Naive Bayes is a very popular classifier because it has minimal storage requirements, is fast, can be tuned easily with more data, and has found very useful applications in text classificaton. Paul Graham originally proposed using Naive Bayes to detect spam in his [Plan for Spam](http://www.paulgraham.com/spam.html).

Earlier we experimented with text classification using a Naive Bayes model. What exactly are Naive Bayes classifiers? 

**What is Bayes?**  
Bayes, or Bayes' Theorem, is a way to update a probability distribution given some new data.

Below is the equation for Bayes.  

$$P(A \ | \ B) = \frac {P(B \ | \ A) \times P(A)} {P(B)}$$

- **$P(A \ | \ B)$** : Probability of `Event A` occurring given `Event B` has occurred.
- **$P(B \ | \ A)$** : Probability of `Event B` occurring given `Event A` has occurred.
- **$P(A)$** : Probability of `Event A` occurring.
- **$P(B)$** : Probability of `Event B` occurring.



## Applying Naive Bayes Classification to Spam Filtering

Let's pretend we have an email with three words: "Send money now." We'll use Naive Bayes to classify it as **ham or spam.** ("Ham" just means not spam. It can include emails that look like spam but that you opt into!)

$$P(spam \ | \ \text{send money now}) = \frac {P(\text{send money now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

By assuming that the features (the words) are conditionally independent, we can simplify the likelihood function:

$$P(spam \ | \ \text{send money now}) \approx \frac {P(\text{send} \ | \ spam) \times P(\text{money} \ | \ spam) \times P(\text{now} \ | \ spam) \times P(spam)} {P(\text{send money now})}$$

Note that each conditional probability in the numerator is easily calculated directly from the training data!

So, we can calculate all of the values in the numerator by examining a corpus of spam email:

$$P(spam \ | \ \text{send money now}) \approx \frac {0.2 \times 0.1 \times 0.1 \times 0.9} {P(\text{send money now})} = \frac {0.0018} {P(\text{send money now})}$$

We would repeat this process with a corpus of ham email:

$$P(ham \ | \ \text{send money now}) \approx \frac {0.05 \times 0.01 \times 0.1 \times 0.1} {P(\text{send money now})} = \frac {0.000005} {P(\text{send money now})}$$

All we care about is whether spam or ham has the higher probability, and so we predict that the email is spam.


### Key Takeaways

- The "naive" assumption of Naive Bayes (that the features are conditionally independent) is critical to making these calculations simple.
- The normalization constant (the denominator) can be ignored since it's the same for all classes.
- The prior probability is much less relevant once you have a lot of features.

### Comparing Naive Bayes With Other Models

Advantages of Naive Bayes:

- Model training and prediction are very fast.
- It's somewhat interpretable.
- No tuning is required.
- Features don't need scaling.
- It's insensitive to irrelevant features (with enough observations).
- It performs better than logistic regression when the training set is very small.

Disadvantages of Naive Bayes:

- If "spam" is dependent on non-independent combinations of individual words, it may not work well.
- Predicted probabilities are not well calibrated.
- Correlated features can be problematic (due to the independence assumption).
- It can't handle negative features (with Multinomial Naive Bayes).
- It has a higher "asymptotic error" than logistic regression.

-----

<a id='conclusion'></a>
## Summary

- NLP techniques allow us to do machine learning with text.
- High-level NLP tasks include part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
- Common steps for preprocessing text include splitting into words ("tokenizing"), discarding punctuation, converting to lowercase, and stemming/lemmatizing.
- To apply machine learning to text, we need to convert documents into numeric vectors.
- Bag-of-words representations ignore word order, while ngram representations preserve it to some extent.
- TF-IDF is a powerful technique for turning the words in a document into a useful feature vector.