1\. Bag-of-words
----------------

00:00 - 00:24

Welcome to the next chapter of this course! We proceed on our journey by embarking on the first step in performing a sentiment analysis task: transforming our text data to numeric form. Why do we need to do that? A machine learning model cannot work with the text data directly, but rather with numeric features we create from the data.

2\. What is a bag-of-words (BOW) ?
----------------------------------

00:24 - 00:47

We start with a basic and crude, but often quite useful method, called bag-of-words (BOW). A bag-of-words approach describes the occurrence, or frequency, of words within a document, or a collection of documents (called corpus). It basically comes down to building a vocabulary of all the words occurring in the document and keeping track of their frequencies.

3\. Amazon product reviews
--------------------------

00:47 - 01:10

Before we continue with the discussion of BOW, we will introduce the data we will use throughout the chapter, namely reviews of Amazon products. The dataset consists of two columns: the first contains the score, which is 1 if positive and 0 if negative; The second column contains the actual review of the product.

4\. Sentiment analysis with BOW: Example
----------------------------------------

01:10 - 02:02

Let's see how BOW would work applied to an example review. Imagine you have the following string: "This is the best book ever. I loved the book and highly recommend it." The goal of a BOW approach would be to build the following dictionary-like output: 'This', occurs once in our string, so it has a count of 1, 'is' occurs once, 'the' occurs two times and so on. One thing to note is that we lose the word order and grammar rules, that's why this approach is called a 'bag' of words, resembling dropping a bunch of items in a bag and losing any sense of their order. This sounds straightforward but sometimes deciding how to build the vocabulary can be complex. We discuss some of the trade-offs we need to consider in later chapters.

5\. BOW end result
------------------

02:02 - 02:18

When we transform the text column with a BOW, the end result looks something like the table that we see: where the column is the word (also called token), and the row represents how many times we have encountered it in the respective review.

6\. CountVectorizer function
----------------------------

02:18 - 03:13

How do we execute a BOW process in Python? The simplest way to do this is by using the CountVectorizer from the text library in the sklearn.feature_extraction submodule. In Python, we import the CountVectorizer() from sklearn.feature_extraction.text. In the CountVectorizer function, for the moment we leave the default functional options, except for the max_features argument, which only considers the features with highest term frequency, i.e. it will pick the 1000 most frequent words across the corpus of reviews. We need to do that sometimes for memory's sake. We use the `fit()` method from the CountVectorizer, calling fit() on our text column. To create a BOW representation, we call the transform() method, applied again to our text column.

7\. CountVectorizer output
--------------------------

03:13 - 03:27

The result is a sparse matrix. A sparse matrix only stores entities that are non-zero, where the rows correspond to the number of rows in the dataset, and the columns to the BOW vocabulary.

8\. Transforming the vectorizer
-------------------------------

03:27 - 03:53

To look at the actual contents of a sparse matrix, we need to perform an additional step to transform it back to a 'dense' NumPy array, using the .toarray() method. We can build a pandas DataFrame from the array, where the columns' names are obtained from the `.get_feature_names()` method of the vectorizer. This returns a list where every entry corresponds to one feature.

9\. Let's practice!
-------------------

03:53 - 03:59

That was our introduction to BOW! Let's apply what we've learned in the exercises.

Which statement about BOW is true?
==================================

You were introduced to a bag-of-words(BOW) and some of its characteristics in the video. Which of the following statements about BOW **is** true?

##### Answer the question

#### Possible Answers

Select one answer

[/] -   Bag-of-words preserves the word order and grammar rules.

-   Bag-of-words describes the order and frequency of words or tokens within a corpus of documents.

-   Bag-of-words is a simple but effective method to build a vocabulary of all the words occurring in a document.

-   Bag-of-words can only be applied to a large document, not to shorter documents or single sentences.

Your first BOW
==============

A bag-of-words is an approach to transform text to numeric form. 

In this exercise, you will apply a BOW to the `annak` list before moving on to a larger dataset in the next exercise. 

Your task will be to work with this list and apply a BOW using the `CountVectorizer()`. This transformation is your first step in being able to understand the sentiment of a text. Pay attention to words which might carry a strong sentiment. 

Remember that the output of a `CountVectorizer()` is a sparse matrix, which stores only entries which are non-zero. To look at the actual content of this matrix, we convert it to a dense array using the `.toarray()`method.

Note that in this case you don't need to specify the `max_features` argument because the text is short.

Instructions
------------

-   Import the count vectorizer function from `sklearn.feature_extraction.text`.
-   Build and fit the vectorizer on the small dataset.
-   Create the BOW representation with name `anna_bow` by calling the `transform()`method.
-   Print the BOW result as a dense array.

In [None]:
# Import the required function
from sklearn.feature_extraction.text import CountVectorizer

annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())
     

BOW using product reviews
=========================

You practiced a BOW on a small dataset. Now you will apply it to a sample of Amazon product reviews. The data has been imported for you and is called `reviews`. It contains two columns. The first one is called `score` and it is `0` when the review is negative, and `1` when it is positive. The second column is called `review`and it contains the text of the review that a customer wrote. Feel free to explore the data in the IPython Shell.

Your task is to build a BOW vocabulary, using the `review` column.

Remember that we can call the `.get_feature_names()` method on the vectorizer to obtain a list of all the vocabulary elements.

Instructions
------------

-   Create a CountVectorizer object, specifying the maximum number of features. 
-   Fit the vectorizer. 
-   Transform the fitted vectorizer.
-   Create a DataFrame where you transform the sparse matrix to a dense array and make sure to correctly specify the names of columns.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())