1\. Building a bag of words model
---------------------------------

00:00 - 00:09

In this chapter, we will cover vectorization which is, as you may recall, the process of converting text into vectors.

2\. Recap of data format for ML algorithms
------------------------------------------

00:09 - 00:31

Recall that for any ML algorithm to run properly, data fed into it must be in tabular form and all the training features must be numerical. This is clearly not the case for textual data. In this lesson, we will learn a technique called bag of words that converts text documents into vectors.

For any ML algorithm,
- Data must be in tabular form
- Training features must be numerical


3\. Bag of words model
----------------------

00:31 - 00:56

The bag of words model is a procedure of extracting word tokens from a text document (henceforth, we will refer to this as just document), computing the frequency of these word tokens and constructing a word vector based on these frequencies and the vocabulary of the entire corpus of documents. This is best explained with the help of an example.

- Extract word tokens
- Compute frequency of word tokens
- Construct a word vector out of these frequencies and vocabulary of corpus


4\. Bag of words model example
------------------------------

00:56 - 01:12

Consider a corpus of three documents. The lion is the king of the jungle. Lions have an average lifespan of 15 years. And, the lion is an endangered species.

```markdown
Corpus
- "The lion is the king of the jungle"
- "Lions have lifespans of a decade"
- "The lion is an endangered species"
```

5\. Bag of words model example
------------------------------

01:12 - 02:11

We now extract the unique word tokens that occur in this corpus of documents. This will be the vocabulary of our model. In this example, the following 15 word tokens will constitute our vocabulary. Since there are 15 words in our vocabulary, our word vectors will have 15 dimensions and each dimension's value will correspond to the frequency of the word token corresponding to that dimension. For instance, the second dimension will correspond to the number of times the second word in the vocabulary, an, occurs in the document. Let's now convert our documents into word vectors using this bag of words model. The lion is the king of the jungle is converted to the following vector. Similarly, the other two sentences have the following word vector representations.

```markdown
Vocabulary → a, an, decade, endangered, have, is, jungle, king, lifespans, lion, Lions, of, species, the, The

[0, 0, 0, 0, 1, 1, 0, 1, 0, 2, 1]

[1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0]

[0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0]
```

6\. Text preprocessing
----------------------

02:11 - 03:01

As we were constructing this model, you may have noticed how text preprocessing would have been extremely useful in creating arguably better models. We would usually want Lions and lion to mean the same thing and therefore, counted as the same thing. The same applies to 'the' with different cases. We would also want to remove punctuations and stopwords as they are extremely common and don't really contribute much to the character of the document. Performing text preprocessing usually leads to smaller vocabularies, which is a good thing. While working with vectorization, it is routine to form word vectors running into thousands of dimensions and keeping this to a minimum helps improve performance.

```markdown
- Lions, Lion → lion
- The, the → the
- No punctuations
- No stopwords
- Leads to smaller vocabularies
- Reducing number of dimensions helps improve performance
```

7\. Bag of words model using sklearn
------------------------------------

03:01 - 03:16

To construct the bag of words model in Python, we will use the scikit-learn library. We will use the corpus from before, consisting of the three sentences on lions. Let's ignore text preprocessing for now.

```python
corpus = pd.Series([
    'The lion is the king of the jungle',
    'Lions have lifespans of a decade',
    'The lion is an endangered species'
])
```

8\. Bag of words model using sklearn
------------------------------------

03:16 - 04:29

We import the CountVectorizer class from sklearn.feature_extraction.text. This is the class that will help us build our bag of words model. Next, we instantiate a CountVectorizer object vectorizer. We finally create our matrix of word vectors by passing corpus to the fit_transform method of vectorizer. This is stored in bow_matrix. This bow_matrix is a sparse matrix and we can print out its 2D array form using bow matrix dot toarray(). This gives us the following output. Notice how this is different from the word vectors we generated. This is because CountVectorizer automatically lowercases words and ignores single character tokens such as 'a'. Also, it doesn't necessarily index the vocabulary in alphabetical order. We will learn how to map the vocabulary to the indices in the exercises. We can now use this bow_matrix as our training features in ML models.

```python
# Import CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object

vectorizer = CountVectorizer()

# Generate matrix of word vectors

bow_matrix = vectorizer.fit_transform(corpus)

print(bow_matrix.toarray())

array([[0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 3],
       [0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0],
       [1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1]], dtype=int64)
```

9\. Let's practice!
-------------------

04:29 - 04:36

We've covered a lot of theory in this lesson. Let us practice this in the exercises.

Word vectors with a given vocabulary
====================================

You have been given a corpus of documents and you have computed the vocabulary of the corpus to be the following: ***V***: *a, an, and, but, can, come, evening, forever, go, i, men, may, on, the, women*

Which of the following corresponds to the bag of words vector for the document "men may come and men may go but i go on forever"?

##### Answer the question

#### Possible Answers

Select one answer

[x] -   `(0, 0, 1, 1, 0, 1, 0, 1, 2, 1, 2, 2, 1, 0, 0)`

    PRESS1

-   `(0, 1, 0, 1, 1, 1, 2, 0, 2, 1, 0, 0, 0, 2, 0)`

    PRESS2

-   `(2, 1, 0, 0, 2, 1, 0, 0, 0, 1)`

    PRESS3

-   `(0, 0, 1, 2, 1, 2, 1, 1, 1, 0, 0, 1, 1, 1, 1)`

    PRESS4

BoW model for movie taglines
============================

In this exercise, you have been provided with a `corpus` of more than 7000 movie tag lines. Your job is to generate the bag of words representation `bow_matrix` for these taglines. For this exercise, we will ignore the text preprocessing step and generate `bow_matrix`directly.

We will also investigate the shape of the resultant `bow_matrix`. The first five taglines in `corpus` have been printed to the console for you to examine.

Instructions
------------

-   Import the `CountVectorizer` class from `sklearn`.
-   Instantiate a `CountVectorizer` object. Name it `vectorizer`.
-   Using `fit_transform()`, generate `bow_matrix` for `corpus`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Print the shape of bow_matrix
print(bow_matrix.shape)


Analyzing dimensionality and preprocessing
==========================================

In this exercise, you have been provided with a `lem_corpus` which contains the pre-processed versions of the movie taglines from the previous exercise. In other words, the taglines have been lowercased and lemmatized, and stopwords have been removed. 

Your job is to generate the bag of words representation `bow_lem_matrix` for these lemmatized taglines and compare its shape with that of `bow_matrix` obtained in the previous exercise. The first five lemmatized taglines in `lem_corpus` have been printed to the console for you to examine.

Instructions
------------

-   Import the `CountVectorizer` class from `sklearn`.
-   Instantiate a `CountVectorizer` object. Name it `vectorizer`.
-   Using `fit_transform()`, generate `bow_lem_matrix` for `lem_corpus`.

In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_lem_matrix = vectorizer.fit_transform(lem_corpus)

# Print the shape of bow_lem_matrix
print(bow_lem_matrix.shape)

Mapping feature indices with feature names
==========================================

In the lesson video, we had seen that `CountVectorizer` doesn't necessarily index the vocabulary in alphabetical order. In this exercise, we will learn to map each feature index to its corresponding feature name from the vocabulary.

We will use the same three sentences on lions from the video. The sentences are available in a list named `corpus` and has already been printed to the console.

Instructions
------------

-   Instantiate a `CountVectorizer` object. Name it `vectorizer`.
-   Using `fit_transform()`, generate `bow_matrix` for `corpus`.
-   Using the `get_feature_names()` method, map the column names to the corresponding word in the vocabulary.

In [None]:
# Create CountVectorizer object
vectorizer = CountVectorizer()

# Generate matrix of word vectors
bow_matrix = vectorizer.fit_transform(corpus)

# Convert bow_matrix into a DataFrame
bow_df = pd.DataFrame(bow_matrix.toarray())

# Map the column names to vocabulary 
bow_df.columns = vectorizer.get_feature_names()

# Print bow_df
print(bow_df)

1\. Building a BoW Naive Bayes classifier
-----------------------------------------

00:00 - 00:09

In this lesson, we will walk through a machine learning problem that utilizes feature engineering techniques we've learned, to arrive at a desired result.

2\. Spam filtering
------------------

00:09 - 00:28

Let's take a look at the spam filtering problem. We're given a dataset of messages that have been labelled as spam or ham. Here, you can see a typical spam and ham message. Our task is to train an ML model that can predict the label given a particular text.

```markdown
| message                                                                                         | label |
|-------------------------------------------------------------------------------------------------|-------|
| WINNER!! As a valued network customer you have been selected to receive a $900 prize reward! To claim call 09061701461 | spam  |
| Ah, work. I vaguely remember that. What does it feel like?                                        | ham   |
```

3\. Steps
---------

00:28 - 00:51

There are 3 steps involved. The first is to preprocess the text. Next, we proceed to build the bag-of-words model. Finally, we conduct predictive modeling using the generated BoW vectors. Note that although we use the term 'modeling' in the context of both BoW and machine learning, they mean two different things.

1. Text preprocessing
2. Building a bag-of-words model (or representation)
3. Machine learning



4\. Text preprocessing using CountVectorizer
--------------------------------------------

00:51 - 02:17

We've already learned how to conduct text preprocessing using spaCy. However, it is also possible to do this using CountVectorizer. CountVectorizer takes in a number of arguments to perform preprocessing. The lowercase argument, when set to True, converts words to lowercase. The strip_accents argument can convert accented characters according to unicode or ASCII mapping. Passing in a stopwords argument will lead to CountVectorizer ignoring stopwords. You can pass in a custom list or the string 'english' to use scikit-learn's list of English stopwords. You can specify tokenization using a regular expression as the value of the token_pattern argument. Tokenization can also be specified using a tokenizer argument. Here, you can pass a function that takes a string as an argument and returns a list of tokens. This way, CountVectorizer allows usage of spaCy's tokenization techniques. CountVectorizer cannot perform certain steps such as lemmatization automatically. This is where spaCy is useful. Although it performs tokenization and preprocessing, CountVectorizer's main job is to convert a corpus into a matrix of numerical vectors.

```markdown
CountVectorizer arguments:

- lowercase: False, True
- strip_accents: 'unicode', 'ascii', None
- stop_words: 'english', list, None
- token_pattern: regex
- tokenizer: function
```

5\. Building the BoW model
--------------------------

02:17 - 02:57

As usual, we import CountVectorizer from scikit-learn. We then instantiate a CountVectorizer object called vectorizer. We perform accent stripping using ASCII mapping and remove English stopwords. We also set the lowercase argument to False. This is because spam messages usually tend to abuse all-capital words and we might want to preserve this information for the ML step. The dataset has been already been loaded into the dataframe df. We split this dataset into training and test sets using scikit-learn's train test split function.

```python
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create CountVectorizer object
vectorizer = CountVectorizer(strip_accents='ascii', stop_words='english', lowercase=False)

# Import train_test_split
from sklearn.model_selection import train_test_split

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.25)
```

6\. Building the BoW model
--------------------------

02:57 - 03:32

We now fit the vectorizer on the training set and transform it into its bag-of-words representation. We can perform both these steps together using the fit transform method. Next, we transform the test set into its BoW representation. Note, that we do not fit the vectorizer with the test data. It is possible that there are some words in the test data that is not in the vocabulary of the vectorizer. In such cases, CountVectorizer simply ignores these words.

```python
# Generate training Bow vectors
X_train_bow = vectorizer.fit_transform(X_train)

# Generate test BoW vectors
X_test_bow = vectorizer.transform(X_test)
```

7\. Training the Naive Bayes classifier
---------------------------------------

03:32 - 04:11

We're now in a good position to train an ML model. We will use the Multinomial Naive Bayes classifier for this task. We import the Multinomial NB class from scikit-learn and create an object named clf. We then fit the training BoW vectors and their corresponding labels to clf. We can now test the performance of our model. We compute the accuracy of the model on the test set using clf dot score. In this case, our model registered an accuracy of 76% on the test set.

```python
# Import MultinomialNB
from sklearn.naive_bayes import MultinomialNB

# Create MultinomialNB object
clf = MultinomialNB()

# Train clf
clf.fit(X_train_bow, y_train)

# Compute accuracy on test set
accuracy = clf.score(X_test_bow, y_test)
print(accuracy)

0.760051
```

8\. Let's practice!
-------------------

04:11 - 04:26

We've covered a lot of ground in building a spam filter in this lesson. In the exercises, we will perform similar steps to perform sentiment analysis on movie reviews. Let's practice!

BoW vectors for movie reviews
=============================

In this exercise, you have been given two pandas Series, `X_train` and `X_test`, which consist of movie reviews. They represent the training and the test review data respectively. Your task is to preprocess the reviews and generate BoW vectors for these two sets using `CountVectorizer`.

Once we have generated the BoW vector matrices `X_train_bow` and `X_test_bow`, we will be in a very good position to apply a machine learning model to it and conduct sentiment analysis.

Instructions
------------

-   Import `CountVectorizer` from the `sklearn`library.
-   Instantiate a `CountVectorizer` object named `vectorizer`. Ensure that all words are converted to lowercase and `english`stopwords are removed.
-   Using `X_train`, fit `vectorizer` and then use it to transform `X_train` to generate the set of BoW vectors `X_train_bow`.
-   Transform `X_test` using `vectorizer` to generate the set of BoW vectors `X_test_bow`.

In [None]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create a CountVectorizer object
vectorizer = CountVectorizer(lowercase=True, stop_words='english')

# Fit and transform X_train
X_train_bow = vectorizer.fit_transform(X_train)

# Transform X_test
X_test_bow = vectorizer.transform(X_test)

# Print shape of X_train_bow and X_test_bow
print(X_train_bow.shape)
print(X_test_bow.shape)

Predicting the sentiment of a movie review
==========================================

In the previous exercise, you generated the bag-of-words representations for the training and test movie review data. In this exercise, we will use this model to train a Naive Bayes classifier that can detect the sentiment of a movie review and compute its accuracy. Note that since this is a binary classification problem, the model is only capable of classifying a review as either positive (1) or negative (0). It is incapable of detecting neutral reviews.

In case you don't recall, the training and test BoW vectors are available as `X_train_bow`and `X_test_bow` respectively. The corresponding labels are available as `y_train` and `y_test` respectively. Also, for you reference, the original movie review dataset is available as `df`.

Instructions
------------

-   Instantiate an object of `MultinomialNB`. Name it `clf`.
-   Fit `clf` using `X_train_bow` and `y_train`.
-   Measure the accuracy of `clf` using `X_test_bow` and `y_test`.

In [None]:
# Create a MultinomialNB object
clf = MultinomialNB()

# Fit the classifier
clf.fit(X_train_bow, y_train)

# Measure the accuracy
accuracy = clf.score(X_test_bow, y_test)
print("The accuracy of the classifier on the test set is %.3f" % accuracy)

# Predict the sentiment of a negative review
review = "The movie was terrible. The music was underwhelming and the acting mediocre."
prediction = clf.predict(vectorizer.transform([review]))[0]
print("The sentiment predicted by the classifier is %i" % (prediction))

1\. Building n-gram models
--------------------------

00:00 - 00:09

We already know how to build bag-of-words representations of our documents and use it to conduct various machine learning tasks.

2\. BoW shortcomings
--------------------

00:09 - 00:54

Consider the following mini reviews. One is a positive review which states that the movie was good and not boring. The other is negative; commenting that the movie was not good and boring. If we were to construct BoW vectors for these reviews, we would get identical vectors since both reviews contain exactly the same words. And here in lies the biggest shortcoming of the bag-of-words model: context of the words is lost. In this example, the position of the word 'not' changes the entire sentiment of the review. Therefore, in this lesson, we will study techniques that will allow us to model this.

```markdown
| review                                 | label   |
|----------------------------------------|---------|
| 'The movie was good and not boring'    | positive |
| 'The movie was not good and boring'    | negative |

- Exactly the same BoW representation!
- Context of the words is lost.
- Sentiment dependent on the position of 'not'.
```

3\. n-grams
-----------

00:54 - 01:30

An n-gram is a contiguous sequence of n elements (or words) in a given document. The bag-of-words model that we've explored so far is nothing but an n-gram model where n is equal to one. Let's now explore n-grams when n is greater than one. Consider the sentence 'for you a thousand times over'. If we set n to 2, then the n-grams (called bigrams in this case) would be for you, you a, a thousand, thousand times and times over.

```markdown
- Contiguous sequence of n elements (or words) in a given document.
- n = 1 → bag-of-words
  - 'for you a thousand times over'
- n = 2, n-grams:
  - ['for you', 'you a', 'a thousand', 'thousand times', 'times over']
```

4\. n-grams
-----------

01:30 - 01:50

Similarly, for n equal to 3, the n-grams (or trigrams) will be for you a, you a thousand, a thousand times, thousand times over. Therefore, we can use these n-grams to capture more context and account for cases like 'not'.

```markdown
- 'for you a thousand times over'
- n = 3, n-grams:
  - ['for you a', 'you a thousand', 'a thousand times', 'thousand times over']
- Captures more context.
```

5\. Applications
----------------

01:50 - 02:12

Apart from capturing more context, n-grams have a host of other useful applications. They are used in sentence completion, spelling correction and machine translation correction. In all these cases, the model computes the probability of n words occurring contiguously to perform the above processes.

- Sentence completion
- Spelling correction
- Machine translation correction


6\. Building n-gram models using scikit-learn
---------------------------------------------

02:12 - 02:47

Building these n-gram models using scikit-learn is extremely simple, now that we know how to use CountVectorizer. CountVectorizer takes in an argument ngram range which is a tuple containing the lower and upper bound for the range of n-values. For instance, passing 2,2 as the ngram_range will generate only bigrams. On the other hand, passing in 1,3 will generate n-grams where n is equal to 1, 2 and 3.

```python
# Generates only bigrams.
bigrams = CountVectorizer(ngram_range=(2,2))

# Generates unigrams, bigrams and trigrams.
ngrams = CountVectorizer(ngram_range=(1,3))
```

7\. Shortcomings
----------------

02:47 - 03:32

While on the surface, it may seem lucrative to generate n-grams of high orders to capture more and more context, it comes with caveats. We've already seen that the BoW vectors run into thousands of dimensions. Adding higher order n-grams increases the number of dimensions even more and while performing machine learning, leads to a problem known as the curse of dimensionality. Additionally, n-grams for n greater than 3 become exceedingly rare to find in multiple documents. So that feature becomes effectively useless. For these reasons, it is often a good idea to restrict yourself to n-grams where n is small.

- Curse of dimensionality
- Higher order n-grams are rare
- Keep n small


8\. Let's practice!
-------------------

03:32 - 03:42

Great! Let's now build these advanced n-gram models and discover more insights in the exercises.