# N-Gram models
  
Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.

**Resources**
  
[SpaCy Documentation](https://spacy.io)  
[Scikit-learn Documentation](https://scikit-learn.org/stable/user_guide.html)  
[Scikit-learn CountVectorizer Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer)

In [2]:
import numpy as np                  # Numerical Python:         Arrays and linear algebra
import pandas as pd                 # Panel Datasets:           Dataset manipulation
import matplotlib.pyplot as plt     # MATLAB Plotting Library:  Visualizations
import seaborn as sns               # Seaborn:                  Visualizations
import re                           # Regular Expressions:      Text manipulation
import spacy                        # Spatium Cython:           Natural Language Processing
from pprint import pprint           # Pretty Print:             Advanced printing operations

## Building a bag of words model
  
In this chapter, we will cover vectorization which is, as you may recall, the process of converting text into vectors.
  
**Recap of data format for ML algorithms**
  
Recall that for any ML algorithm to run properly, data fed into it must be in tabular form and all the training features must be numerical. This is clearly not the case for textual data. In this lesson, we will learn a technique called bag of words that converts text documents into vectors.
  
**Bag of words model**
  
The bag of words model is a procedure of extracting word tokens from a text document (henceforth, we will refer to this as just document), computing the frequency of these word tokens and constructing a word vector based on these frequencies and the vocabulary of the entire corpus of documents. This is best explained with the help of an example.
  
**Bag of words model**  
  
- Extract word tokens  
- Compute frequency of word tokens  
- Construct a word vector out of these frequencies and vocabulary of corpus  
  
**Bag of words model example**
  
Consider a corpus of three documents:  
- "The lion is the king of the jungle."  
- "Lions have an average lifespan of 15 years."  
- "The lion is an endangered species."  
  
**Bag of words model example**
  
We now extract the unique word tokens that occur in this corpus of documents. This will be the vocabulary of our model. In this example, the following 15 word tokens will constitute our vocabulary. Since there are 15 words in our vocabulary, our word vectors will have 15 dimensions and each dimension's value will correspond to the frequency of the word token corresponding to that dimension. For instance, the second dimension will correspond to the number of times the second word in the vocabulary, an, occurs in the document. Let's now convert our documents into word vectors using this bag of words model. The lion is the king of the jungle is converted to the following vector. Similarly, the other two sentences have the following word vector representations.
  
<img src='../_images/building-a-bag-of-words-model.png' alt='img' width='740'>
  
**Text preprocessing**
  
As we were constructing this model, you may have noticed how text preprocessing would have been extremely useful in creating arguably better models. We would usually want "Lions" and "lion" to mean the same thing and therefore, counted as the same thing. The same applies to 'the' with different cases. We would also want to remove punctuations and stopwords as they are extremely common and don't really contribute much to the character of the document. Performing text preprocessing usually leads to smaller vocabularies, which is a good thing. While working with vectorization, it is routine to form word vectors running into thousands of dimensions and keeping this to a minimum helps improve performance.
  
**Text preprocessing**  
  
- Example: Lions, lion -> lion  
- Example: The, the -> the  
- No punctuations  
- No stopwords  
- Leads to smaller vocabularies  
- Reducing number of dimensions helps to improve model performance  
  
**Bag of words model using sklearn**
  
To construct the bag of words model in Python, we will use the `scikit-learn` library. We will use the corpus from before, consisting of the three sentences on lions. Let's ignore text preprocessing for now.
  
<img src='../_images/building-a-bag-of-words-model1.png' alt='img' width='740'>
  
**Bag of words model using sklearn**
  
We import the `CountVectorizer` class from `sklearn.feature_extraction.text`. This is the class that will help us build our bag of words model. Next, we instantiate a `CountVectorizer` object `vectorizer`. We finally create our matrix of word vectors by passing `corpus` to the `.fit_transform()` method of `vectorizer`. This is stored in `bow_matrix`. 
  
This `bow_matrix` is a sparse matrix and we can print out its 2D array form using `bow_matrix.toarray()`. This gives us the following output. Notice how this is different from the word vectors we generated. This is because `CountVectorizer` automatically lowercases words and ignores single character tokens such as 'a'. Also, it doesn't necessarily index the vocabulary in alphabetical order. We will learn how to map the vocabulary to the indices in the exercises. We can now use this `bow_matrix` as our training features in ML models.
  
<img src='../_images/building-a-bag-of-words-model2.png' alt='img' width='740'>
  
**Let's practice!**
  
We've covered a lot of theory in this lesson. Let us practice this in the exercises.

### Word vectors with a given vocabulary
  
You have been given a corpus of documents and you have computed the vocabulary of the corpus to be the following:  
  
**Vocabulary**: a, an, and, but, can, come, evening, forever, go, i, men, may, on, the, women
  
Which of the following corresponds to the bag of words vector for the document "men may come and men may go but i go on forever"?
  
Possible Answers
  
- [x] (0, 0, 1, 1, 0, 1, 0, 1, 2, 1, 2, 2, 1, 0, 0)  
- [ ] (0, 1, 0, 1, 1, 1, 2, 0, 2, 1, 0, 0, 0, 2, 0)  
- [ ] (2, 1, 0, 0, 2, 1, 0, 0, 0, 1)  
- [ ] (0, 0, 1, 2, 1, 2, 1, 1, 1, 0, 0, 1, 1, 1, 1)  
  
That is, indeed, the correct answer. Each value in the vector corresponds to the frequency of the corresponding word in the vocabulary.

### BoW model for movie taglines
  
In this exercise, you have been provided with a `corpus` of more than 7000 movie tag lines. Your job is to generate the bag of words representation `bow_matrix` for these taglines. For this exercise, we will ignore the text preprocessing step and generate `bow_matrix` directly.
  
We will also investigate the shape of the resultant `bow_matrix`. The first five taglines in `corpus` have been printed to the console for you to examine.
  
```python
1            Roll the dice and unleash the excitement!
2    Still Yelling. Still Fighting. Still Ready for...
3    Friends are the people who let you be yourself...
4    Just When His World Is Back To Normal... He's ...
5                             A Los Angeles Crime Saga
Name: tagline, dtype: object
```
  
1. Import the `CountVectorizer` class from `sklearn`.
2. Instantiate a `CountVectorizer` object. Name it `vectorizer`.
3. Using `.fit_transform()`, generate `bow_matrix` for `corpus`.

In [3]:
movies = pd.read_csv('../_datasets/movie_overviews.csv').dropna()   # Load df
movies['tagline'] = movies['tagline'].str.lower()                   # Standardization
print(movies.shape)                                                 # Dataframe shape
movies.head()                                                       # Display


(7033, 4)


Unnamed: 0,id,title,overview,tagline
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,roll the dice and unleash the excitement!
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,still yelling. still fighting. still ready for...
3,31357,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",friends are the people who let you be yourself...
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,just when his world is back to normal... he's ...
5,949,Heat,"Obsessive master thief, Neil McCauley leads a ...",a los angeles crime saga


In [4]:
corpus = movies['tagline']  # Corpus instantiation
print(corpus.shape)         # Shape of the pd.Series
print(corpus)               # Display


(7033,)
1               roll the dice and unleash the excitement!
2       still yelling. still fighting. still ready for...
3       friends are the people who let you be yourself...
4       just when his world is back to normal... he's ...
5                                a los angeles crime saga
                              ...                        
9091                        kingsglaive: final fantasy xv
9093    what happens in vegas, stays in vegas. unless ...
9095    decorated officer. devoted family man. defendi...
9097                      a god incarnate. a city doomed.
9098              the band you know. the story you don't.
Name: tagline, Length: 7033, dtype: object


In [5]:
from sklearn.feature_extraction.text import CountVectorizer


# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Print the shape of bow_matrix, datatype is csr_matrix
print(bow_matrix.shape)

(7033, 6614)


You now know how to generate a Bag-of-Words(BoW) representation for a given corpus of documents. Notice that the word vectors created have more than 6600 dimensions. However, most of these dimensions have a value of zero since most words do not occur in a particular tagline.

### Analyzing dimensionality and preprocessing
  
In this exercise, you have been provided with a `lem_corpus` which contains the pre-processed versions of the movie taglines from the previous exercise. In other words, the taglines have been lowercased and lemmatized, and stopwords have been removed.
  
Your job is to generate the bag of words representation `bow_lem_matrix` for these lemmatized taglines and compare its shape with that of `bow_matrix` obtained in the previous exercise. The first five lemmatized taglines in `lem_corpus` have been printed to the console for you to examine.
  
1. Import the `CountVectorizer` class from `sklearn`.
2. Instantiate a `CountVectorizer` object. Name it `vectorizer`.
3. Using `.fit_transform()`, generate `bow_lem_matrix` for `lem_corpus`.

NOTE: Breakdown of the Anonymous function in successor cell
  
1. Applies a lambda function to each row of the corpus. 
2. For each token in the processed row, the lambda function checks if its lemma (base form) is not in a set of stopwords.
3. If the lemma is not in the set of stopwords and is composed entirely of alphabetic characters it is added to a list comprehension.
4. Finally, the lemmas are joined into a single string using a space as the separator.
  
```python
# Creating the lemmatized corpus
lem_corpus = corpus.apply(lambda row: ' '.join(
    [t.lemma_ for t in nlp(row) if t.lemma_ not in stopwords and t.lemma_.isalpha()])
    )
```
Datatype: pd.Series  
Shape: (7033,)  

In [6]:
# Loading pre-trained model, Object Oriented Programming (OOP) instance
nlp = spacy.load('en_core_web_sm')

# Creating the stopwords, datatype is a Set
stopwords = spacy.lang.en.stop_words.STOP_WORDS

# Creating the lemmatized corpus
lem_corpus = corpus.apply(lambda row: ' '.join(
    [t.lemma_ for t in nlp(row) if t.lemma_ not in stopwords and t.lemma_.isalpha()])
    )

lem_corpus

1                            roll dice unleash excitement
2                                   yell fight ready love
3                            friend people let let forget
4                              world normal surprise life
5                                  los angeles crime saga
                              ...                        
9091                         kingsglaive final fantasy xv
9093                       happen vegas stay vegas happen
9095    decorate officer devote family man defend hono...
9097                              god incarnate city doom
9098                                      band know story
Name: tagline, Length: 7033, dtype: object

In [7]:
# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate of word vectors
bow_lem_matrix = vectorizer.fit_transform(lem_corpus)

# Print the shape of bow_lem_matrix, datatype is a csr_matrix
print(bow_lem_matrix.shape)

(7033, 4941)


Notice how the number of features have reduced significantly from 6614 to 4941 (a 25.2948292% reduction) for pre-processed movie taglines. The reduced number of dimensions on account of text preprocessing usually leads to better performance when conducting machine learning and it is a good idea to consider it. However, as mentioned in a previous lesson, the final decision always depends on the nature of the application.

### Mapping feature indices with feature names
  
In the lesson video, we had seen that `CountVectorizer` doesn't necessarily index the vocabulary in alphabetical order. In this exercise, we will learn to map each feature index to its corresponding feature name from the vocabulary.
  
We will use the same three sentences on lions from the video. The sentences are available in a list named `corpus` and has already been printed to the console.
  
`['The lion is the king of the jungle', 'Lions have lifespans of a decade', 'The lion is an endangered species']`
  
1. Instantiate a `CountVectorizer` object. Name it `vectorizer`.
2. Using `.fit_transform()`, generate `bow_matrix` for `corpus`.
3. Using the `.get_feature_names_out()` method, map the column names to the corresponding word in the vocabulary.

In [8]:
# Creating the list of Lion sentences
sentences = ['The lion is the king of the jungle',
             'Lions have lifespans of a decade', 
             'The lion is an endangered species']

# Create CountVectorizer object, Scikit-learn OOP object
vectorizer = CountVectorizer()

# Generate matrix of word vectors, datatype is a csr_matrix
bow_matrix = vectorizer.fit_transform(sentences)

# Convert bow_matrix into a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray())

# Map the column names to vocabulary
bow_df.columns = vectorizer.get_feature_names_out()

# Print bow_df
print(bow_df.shape)
bow_df

(3, 13)


Unnamed: 0,an,decade,endangered,have,is,jungle,king,lifespans,lion,lions,of,species,the
0,0,0,0,0,1,1,1,0,1,0,1,0,3
1,0,1,0,1,0,0,0,1,0,1,1,0,0
2,1,0,1,0,1,0,0,0,1,0,0,1,1


Observe that the column names refer to the token whose frequency is being recorded. Therefore, since the first column name is `an`, the first feature represents the number of times the word 'an' occurs in a particular sentence. `get_feature_names_out()` essentially gives us a list which represents the mapping of the feature indices to the feature name in the vocabulary.

# Building a BoW Naive Bayes classifier
  
In this lesson, we will walk through a machine learning problem that utilizes feature engineering techniques we've learned, to arrive at a desired result.
  
**Spam filtering**
  
Let's take a look at the spam filtering problem. We're given a dataset of messages that have been labelled as spam or ham. Here, you can see a typical spam and ham message. Our task is to train an ML model that can predict the label given a particular text.
  
  <table>
    <tr>
      <th>Message</th>
      <th>Label</th>
    </tr>
    <tr>
      <td>WINNER!!! As a valued Wa1mart customer you have been selected to WIN $900 dollars and a FREE trip to the BAHAMAS! Enter your social security number to claim your prize!</td>
      <td>Spam</td>
    </tr>
    <tr>
      <td>Hey Alexander, I wanted to reach out to see if you have finished your notebook on Natural Language Processing?</td>
      <td>Ham</td>
    </tr>
  </table>

**Steps**
  
There are 3 steps involved. The first is to preprocess the text. Next, we proceed to build the bag-of-words model. Finally, we conduct predictive modeling using the generated BoW vectors. Note that although we use the term 'modeling' in the context of both BoW and machine learning, they mean two different things.
  
1. Text processing
2. Building a Bag-of-Words representation/model
3. Machine Learning
  
**Text preprocessing using `CountVectorizer`**
  
We've already learned how to conduct text preprocessing using `spaCy`. However, it is also possible to do this using `CountVectorizer`. `CountVectorizer` takes in a number of arguments to perform preprocessing. 
  
The `lowercase=` argument, when set to `True`, converts words to lowercase. The `strip_accents=` argument can convert accented characters according to unicode or ASCII mapping. Passing in a `stopwords=` argument will lead to `CountVectorizer` ignoring stopwords. You can pass in a custom list or the string 'english' to use scikit-learn's list of English stopwords. You can specify tokenization using a regular expression as the value of the `token_pattern=` argument. Tokenization can also be specified using a `tokenizer=` argument. 
  
Here, you can pass a function that takes a string as an argument and returns a list of tokens. This way, `CountVectorizer` allows usage of spaCy's tokenization techniques. `CountVectorizer` cannot perform certain steps such as lemmatization automatically. This is where `spaCy` is useful. Although it performs tokenization and preprocessing, CountVectorizer's main job is to convert a corpus into a matrix of numerical vectors.
  
**Building the BoW model**
  
As usual, we import `CountVectorizer` from `scikit-learn`. We then instantiate a `CountVectorizer` object called vectorizer. We perform accent stripping using ASCII mapping and remove English stopwords. We also set the `lowercase=` argument to False. This is because spam messages usually tend to abuse all-capital words and we might want to preserve this information for the ML step. The dataset has been already been loaded into the dataframe `df`. We split this dataset into training and test sets using scikit-learn's `train_test_split()` function.
  
<img src='../_images/building-the-bow-model.png' alt='img' width='740'>
  
**Building the BoW model**
  
We now fit the vectorizer on the training set and transform it into its bag-of-words representation. We can perform both these steps together using the `.fit_transform()` method. Next, we transform the test set into its BoW representation. Note, that we do not fit the vectorizer with the test data. It is possible that there are some words in the test data that is not in the vocabulary of the vectorizer. In such cases, `CountVectorizer` simply ignores these words.
  
<img src='../_images/building-the-bow-model1.png' alt='img' width='740'>
  
**Training the Naive Bayes classifier**
  
We're now in a good position to train an ML model. We will use the Multinomial Naive Bayes classifier for this task. We import the `MultinomialNB` class from `scikit-learn` and create an object named `clf`. We then fit the training BoW vectors and their corresponding labels to `clf`. We can now test the performance of our model. We compute the accuracy of the model on the test set using `clf.score`. In this case, our model registered an accuracy of 76% on the test set.
  
<img src='../_images/building-the-bow-model2.png' alt='img' width='740'>
  
**Let's practice!**
  
We've covered a lot of ground in building a spam filter in this lesson. In the exercises, we will perform similar steps to perform sentiment analysis on movie reviews. Let's practice!

### BoW vectors for movie reviews
  
In this exercise, you have been given two `pandas` Series, `X_train` and `X_test`, which consist of movie reviews. They represent the training and the test review data respectively. Your task is to preprocess the reviews and generate BoW vectors for these two sets using `CountVectorizer`.
  
Once we have generated the BoW vector matrices `X_train_bow` and `X_test_bow`, we will be in a very good position to apply a machine learning model to it and conduct sentiment analysis.
  
1. Import `CountVectorizer` from the `sklearn` library.
2. Instantiate a `CountVectorizer` object named `vectorizer`. Ensure that all words are converted to lowercase and english stopwords are removed.
3. Using `X_train`, fit vectorizer and then use it to transform `X_train` to generate the set of BoW vectors `X_train_bow`.
4. Transform `X_test` using `vectorizer` to generate the set of BoW vectors `X_test_bow`.

In [11]:
movie_reviews = pd.read_csv('../_datasets/movie_reviews_clean.csv')
print(movie_reviews.shape)
movie_reviews.head()

(1000, 2)


Unnamed: 0,review,sentiment
0,this anime series starts out great interesting...,0
1,some may go for a film like this but i most as...,0
2,i ve seen this piece of perfection during the ...,1
3,this movie is likely the worst movie i ve ever...,0
4,it ll soon be 10 yrs since this movie was rele...,1


In [12]:
# X/y split
X = movie_reviews['review']
y = movie_reviews['sentiment']

In [13]:
from sklearn.model_selection import train_test_split


# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [14]:
from sklearn.feature_extraction.text import CountVectorizer


# Create a CounterVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# fit and transform X_train
X_train_bow = vectorizer.fit_transform(X_train)

# Transform X_test
X_test_bow = vectorizer.transform(X_test)

# Print shape of X_train_bow and X_test_bow
print(X_train_bow.shape)
print(X_test_bow.shape)

(750, 14934)
(250, 14934)


You now have a good idea of preprocessing text and transforming them into their bag-of-words representation using `CountVectorizer`. In this exercise, you have set `lowercase=True`. However, note that this is the default value of `lowercase=` and passing it explicitly is not necessary. Also, note that both `X_train_bow` and `X_test_bow` have 14,934 features. There were words present in `X_test` that were not in `X_train`. `CountVectorizer` chose to ignore them in order to ensure that the dimensions of both sets remain the same.

### Predicting the sentiment of a movie review
  
In the previous exercise, you generated the bag-of-words representations for the training and test movie review data. In this exercise, we will use this model to train a Naive Bayes classifier that can detect the sentiment of a movie review and compute its accuracy. Note that since this is a binary classification problem, the model is only capable of classifying a review as either positive (1) or negative (0). It is incapable of detecting neutral reviews.
  
In case you don't recall, the training and test BoW vectors are available as `X_train_bow` and `X_test_bow` respectively. The corresponding labels are available as `y_train` and `y_test` respectively. Also, for you reference, the original movie review dataset is available as `df`.
  
1. Instantiate an object of `MultinomialNB`. Name it `clf`.
2. Fit `clf` using `X_train_bow` and `y_train`.
3. Measure the accuracy of `clf` using `X_test_bow` and `y_test`.

In [23]:
from sklearn.naive_bayes import MultinomialNB


# Create a MultinomialNB object
clf = MultinomialNB()

# Fit the classifier
clf.fit(X_train_bow, y_train)

# Measure the accuracy
accuracy = clf.score(X_test_bow, y_test)
print("The accuracy of the classifier on the test set is {:.3f}".format(accuracy))

# Predict the sentiment of a negative review
review = 'The movie was terrible. The music was underwhelming and the acting mediocre.'
prediction = clf.predict(vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is {}".format(prediction))

The accuracy of the classifier on the test set is 0.792
The sentiment predicted by the classifier is 0


You have successfully performed basic sentiment analysis. Note that the accuracy of the classifier is 79.2%. Considering the fact that it was trained on only 750 reviews, this is reasonably good performance. The classifier also correctly predicts the sentiment of a mini negative review which we passed into it.

## Building n-gram models
  
We already know how to build bag-of-words representations of our documents and use it to conduct various machine learning tasks.
  
**BoW shortcomings**
  
Consider the following mini reviews. One is a positive review which states that the movie was good and not boring. The other is negative; commenting that the movie was not good and boring. If we were to construct BoW vectors for these reviews, we would get identical vectors since both reviews contain exactly the same words. And here in lies the biggest shortcoming of the bag-of-words model: context of the words is lost. In this example, the position of the word 'not' changes the entire sentiment of the review. Therefore, in this lesson, we will study techniques that will allow us to model this.
  
  <table>
    <tr>
      <th>Review</th>
      <th>Label</th>
    </tr>
    <tr>
      <td>The movie was good and not boring.</td>
      <td>Positive</td>
    </tr>
    <tr>
      <td>The moview was not good and boring.</td>
      <td>Negative</td>
    </tr>
  </table>
  
- Exactly the same BoW representation.  
- Context of the words are lost.  
- Sentiment dependent on the position of "not".  
  
**n-grams**
  
An n-gram is a contiguous sequence of n elements (or words) in a given document. The bag-of-words model that we've explored so far is nothing but an n-gram model where $n$ is equal to one. Let's now explore n-grams when $n$ is greater than one. Consider the sentence "For you a thousand times over". If we set $n$ to 2, then the n-grams (called bi-grams in this case) would be "for you", "you a", "a thousand", "thousand times" and "times over".
  
<img src='../_images/n-grams-introduction-nlp.png' alt='img' width='740'>
  
**n-grams**
  
Similarly, for $n$ equal to 3, the n-grams (or tri-grams) will be for you a, you a thousand, a thousand times, thousand times over. Therefore, we can use these n-grams to capture more context and account for cases like 'not'.
  
<img src='../_images/n-grams-introduction-nlp1.png' alt='img' width='740'>
  
**Applications**
  
Apart from capturing more context, n-grams have a host of other useful applications. They are used in sentence completion, spelling correction and machine translation correction. In all these cases, the model computes the probability of n words occurring contiguously to perform the above processes.
  
- Sentence completion
- Spelling correction
- Machine translation correction
  
**Building n-gram models using scikit-learn**
  
Building these n-gram models using scikit-learn is extremely simple, now that we know how to use `CountVectorizer`. `CountVectorizer` takes in an argument `ngram_range=` which is a tuple containing the lower and upper bound for the range of n-values. For instance, passing 2,2 as the `ngram_range=` will generate only bi-grams. On the other hand, passing in 1,3 will generate n-grams where $n$ is equal to 1, 2 and 3.
  
<img src='../_images/n-grams-introduction-nlp2.png' alt='img' width='740'>
  
**Shortcomings**
  
While on the surface, it may seem lucrative to generate n-grams of high orders to capture more and more context, it comes with caveats. We've already seen that the BoW vectors run into thousands of dimensions. Adding higher order n-grams increases the number of dimensions even more and while performing machine learning, leads to a problem known as the curse of dimensionality. Additionally, n-grams for $n$ greater than 3 become exceedingly rare to find in multiple documents. So that feature becomes effectively useless. For these reasons, it is often a good idea to restrict yourself to n-grams where $n$ is small.
  
**Let's practice!**
  
Great! Let's now build these advanced n-gram models and discover more insights in the exercises.

### n-gram models for movie tag lines
  
In this exercise, we have been provided with a `corpus` of more than 9000 movie tag lines. Our job is to generate n-gram models up to n equal to 1, n equal to 2 and n equal to 3 for this data and discover the number of features for each model.
  
We will then compare the number of features generated for each model.
  
1. Generate an n-gram model with n-grams up to n=1. Name it `ng1`
2. Generate an n-gram model with n-grams up to n=2. Name it `ng2`
3. Generate an n-Gram Model with n-grams up to n=3. Name it `ng3`
4. Print the number of features for each model.

In [26]:
corpus

1               roll the dice and unleash the excitement!
2       still yelling. still fighting. still ready for...
3       friends are the people who let you be yourself...
4       just when his world is back to normal... he's ...
5                                a los angeles crime saga
                              ...                        
9091                        kingsglaive: final fantasy xv
9093    what happens in vegas, stays in vegas. unless ...
9095    decorated officer. devoted family man. defendi...
9097                      a god incarnate. a city doomed.
9098              the band you know. the story you don't.
Name: tagline, Length: 7033, dtype: object

In [25]:
# Generate n-grams upto n=1
vectorizer_ng1 = CountVectorizer(ngram_range=(1, 1))  # Bag-of-Words
ng1 = vectorizer_ng1.fit_transform(corpus)

# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1, 2))  # Bi-grams
ng2 = vectorizer_ng2.fit_transform(corpus)

# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3))  # Tri-grams
ng3 = vectorizer_ng3.fit_transform(corpus)

# Print the number of features for each model
print("ng1, ng2 and ng3 have {}, {} and {} features respectively".format(ng1.shape[1], ng2.shape[1], ng3.shape[1]))

ng1, ng2 and ng3 have 6614, 37100 and 76881 features respectively


You now know how to generate n-gram models containing higher order n-grams. Notice that `ng2` has over 37,000 features whereas `ng3` has over 76,000 features. This is much greater than the 6,000 dimensions obtained for `ng1`. As the n-gram range increases, so does the number of features, leading to increased computational costs and a problem known as the curse of dimensionality.

### Higher order n-grams for sentiment analysis
  
Similar to a previous exercise, we are going to build a classifier that can detect if the review of a particular movie is positive or negative. However, this time, we will use n-grams up to n=2 for the task.
  
The n-gram training reviews are available as `X_train_ng`. The corresponding test reviews are available as `X_test_ng`. Finally, use `y_train` and `y_test` to access the training and test sentiment classes respectively.
  
1. Define an instance of `MultinomialNB`. Name it `clf_ng`
2. Fit the classifier on `X_train_ng` and `y_train`.
3. Measure accuracy on `X_test_ng` and `y_test` the using `.score()` method.

In [29]:
ng_vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train_ng = ng_vectorizer.fit_transform(X_train)
X_test_ng = ng_vectorizer.transform(X_test)

In [30]:
# Define an instance of MultinomialNB
clf_ng = MultinomialNB()

# Fit the classifier
clf_ng.fit(X_train_ng, y_train)

# Measure the accuracy
accuracy = clf_ng.score(X_test_ng, y_test)
print("The accuracy of the classifier on the test set is {:.3f}".format(accuracy))

# Predict the sentiment of a negative review
review = 'The movie was not good. The plot had several holes and the acting lacked panache'
prediction = clf_ng.predict(ng_vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is {}".format(prediction))

The accuracy of the classifier on the test set is 0.796
The sentiment predicted by the classifier is 0


You're now adept at performing sentiment analysis using text. Notice how this classifier performs *slightly* better than the BoW version (from 79.2% to 79.6%). Also, it succeeds at correctly identifying the sentiment of the mini-review as negative. In the next chapter, we will learn more complex methods of vectorizing textual data.

### Comparing performance of n-gram models
  
You now know how to conduct sentiment analysis by converting text into various n-gram representations and feeding them to a classifier. In this exercise, we will conduct sentiment analysis for the same movie reviews from before using two n-gram models: unigrams and n-grams upto n equal to 3.
  
We will then compare the performance using three criteria: accuracy of the model on the test set, time taken to execute the program and the number of features created when generating the n-gram representation.
  
1. Initialize a `CountVectorizer` object such that it generates uni-grams.
2. Initialize a `CountVectorizer` object such that it generates n-grams up to n=3.

In [33]:
import time


# Starting time
start_time = time.time()

# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(movie_reviews['review'],
                                                    movie_reviews['sentiment'], 
                                                    test_size=0.5, 
                                                    random_state=42,
                                                    stratify=movie_reviews['sentiment'])

# Generating 1-grams, Uni-grams
vectorizer = CountVectorizer(ngram_range=(1,1))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print the accuracy, time and number of dimensions
print("The program took {:.3f} seconds to complete. The accuracy on the test-set is {:.2f}. ".format(time.time() - start_time, clf.score(test_X, test_y)))
print("The n-gram representation had {} features.".format(train_X.shape[1]))

The program took 0.757 seconds to complete. The accuracy on the test-set is 0.75. 
The n-gram representation had 12347 features.


In [38]:
import time


# Starting time
start_time = time.time()

# Splitting the data into training and test sets
train_X, test_X, train_y, test_y = train_test_split(movie_reviews['review'],
                                                    movie_reviews['sentiment'], 
                                                    test_size=0.5, 
                                                    random_state=42,
                                                    stratify=movie_reviews['sentiment'])

# Generating 3-grams, Tri-grams
vectorizer = CountVectorizer(ngram_range=(1,3))
train_X = vectorizer.fit_transform(train_X)
test_X = vectorizer.transform(test_X)

# Fit classifier
clf = MultinomialNB()
clf.fit(train_X, train_y)

# Print the accuracy, time and number of dimensions
print("The program took {:.3f} seconds to complete. The accuracy on the test-set is {:.2f}. ".format(time.time() - start_time, clf.score(test_X, test_y)))
print("The n-gram representation had {} features.".format(train_X.shape[1]))

The program took 3.244 seconds to complete. The accuracy on the test-set is 0.77. 
The n-gram representation had 178240 features.


The program took around 0.757 seconds in the case of the unigram model and more than 4.28 times longer for the higher order n-gram model. The unigram model had over 12,000 features whereas the n-gram model for upto n=3 had over 178,000! Despite taking higher computation time and generating more features, the classifier only performs marginally better in the latter case, producing an accuracy of 77% in comparison to the 75% for the unigram model.