1\. Word counts with bag-of-words
---------------------------------

00:00 - 00:08

Welcome to chapter two! We'll begin with using word counts with a bag of words approach.

2\. Bag-of-words
----------------

00:08 - 00:35

Bag of words is a very simple and basic method to finding topics in a text. For bag of words, you need to first create tokens using tokenization, and then count up all the tokens you have. The theory is that the more frequent a word or token is, the more central or important it might be to the text. Bag of words can be a great way to determine the significant words in a text based on the number of times they are used.

3\. Bag-of-words example
------------------------

00:35 - 01:17

Here we see an example series of sentences, mainly about a cat and a box. If we just us a simple bag of words model with tokenization like we learned in chapter one and remove the punctuation, we can see the example result. Box, cat, The and the are some of the most important words because they are the most frequent. Notice that the word THE appears twice in the bag of words, once with uppercase and once lowercase. If we added a preprocessing step to handle this issue, we could lowercase all of the words in the text so each word is counted only once.

4\. Bag-of-words in Python
--------------------------

01:17 - 02:18

We can use the NLP fundamentals we already know, such as tokenization with NLTK to create a list of tokens. We will use a new class called Counter which we import from the standard library module collections. The list of tokens generated using word_tokenize can be passed as the initialization argument for the Counter class. The result is a counter object which has similar structure to a dictionary and allows us to see each token and the frequency of the token. Counter objects also have a method called `most_common`, which takes an integer argument, such as 2 and would then return the top 2 tokens in terms of frequency. The return object is a series of tuples inside a list. For each tuple, the first element holds the token and the second element represents the frequency. Note: other than ordering by token frequency, the most_common method does not sort the tokens it returns or tell us there are more tokens with that same frequency.

5\. Let's practice!
-------------------

02:18 - 02:25

Now you know a bit about bag of words and can get started building your own using Python.

Bag-of-words picker
===================

It's time for a quick check on your understanding of bag-of-words. Which of the below options, with basic `nltk` tokenization, map the bag-of-words for the following text?

"The cat is in the box. The cat box."

Instructions
------------

50 XP

### Possible answers

('the', 3), ('box.', 2), ('cat', 2), ('is', 1)

('The', 3), ('box', 2), ('cat', 2), ('is', 1), ('in', 1), ('.', 1)

('the', 3), ('cat box', 1), ('cat', 1), ('box', 1), ('is', 1), ('in', 1)

[/] ('The', 2), ('box', 2), ('.', 2), ('cat', 2), ('is', 1), ('in', 1), ('the', 1)

Building a Counter with bag-of-words
====================================

In this exercise, you'll build your first (in this course) bag-of-words counter using a Wikipedia article, which has been pre-loaded as `article`. Try doing the bag-of-words without looking at the full article text, and guessing what the topic is! If you'd like to peek at the title at the end, we've included it as `article_title`. Note that this article text has had very little preprocessing from the raw Wikipedia database entry.

`word_tokenize` has been imported for you.

Instructions
------------

100 XP

-   Import `Counter` from `collections`.
-   Use `word_tokenize()` to split the article into tokens.
-   Use a list comprehension with `t` as the iterator variable to convert all the tokens into lowercase. The `.lower()` method converts text into lowercase.
-   Create a bag-of-words counter called `bow_simple` by using `Counter()` with `lower_tokens` as the argument.
-   Use the `.most_common()` method of `bow_simple` to print the 10 most common tokens.

In [None]:
from collections import Counter

# Tokenize the article: tokens
tokens = word_tokenize(article)

# Convert the tokens into lowercase: lower_tokens
lower_tokens = [t.lower() for t in tokens]

# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(lower_tokens)

# Print the 10 most common tokens
print(bow_simple.most_common(10))

1\. Simple text preprocessing
-----------------------------

00:00 - 00:06

In this video, we will cover some simple text preprocessing.

2\. Why preprocess?
-------------------

00:06 - 01:06

Text processing helps make for better input data when performing machine learning or other statistical methods. For example, in the last few exercises you have applied small bits of preprocessing (like tokenization) to create a bag of words. You also noticed that applying simple techniques like lowercasing all of the tokens, can lead to slightly better results for a bag-of-words model. Preprocessing steps like tokenization or lowercasing words are commonly used in NLP. Other common techniques are things like lemmatization or stemming, where you shorten the words to their root stems, or techniques like removing stop words, which are common words in a language that don't carry a lot of meaning -- such as and or the, or removing punctuation or unwanted tokens. Of course, each model and process will have different results -- so it's good to try a few different approaches to preprocessing and see which works best for your task and goal.

3\. Preprocessing example
-------------------------

01:06 - 01:31

We have here some example input and output text we might expect from preprocessing. First we have a simple two sentence string about pets. Then we have some example output tokens we want. You can see that the text has been tokenized and that everything is lowercase. We also notice that stopwords have been removed and the plural nouns have been made singular.

4\. Text preprocessing with Python
----------------------------------

01:31 - 02:53

We can perform text preprocessing using many of the tools we already know and have learned. In this code, we are using the same text as from our previous video, a few sentences about a cat with a box. We can use list comprehensions to tokenize the sentences which we first make lowercase using the string lower method. The string is_alpha method will return True if the string has only alphabetical characters. We use the is_alpha method along with an if statement iterating over our tokenized result to only return only alphabetic strings (this will effectively strip tokens with numbers or punctuation). To read out the process in both code and English we say we take each token from the word_tokenize output of the lowercase text if it contains only alphabetical characters. In the next line, we use another list comprehension to remove words that are in the stopwords list. This stopwords list for english comes built in with the NLTK library. Finally, we can create a counter and check the two most common words, which are now cat and box (unlike the and box which were the two tokens returned in our first result). Preprocessing has already improved our bag of words and made it more useful by removing the stopwords and non-alphabetic words.

5\. Let's practice!
-------------------

02:53 - 02:59

You can now get started by preprocessing your own text!

Text preprocessing steps
========================

Which of the following are useful text preprocessing steps?

##### Answer the question

50XP

#### Possible Answers

Select one answer

[/] -   Stems, spelling corrections, lowercase.

-   Lemmatization, lowercasing, removing unwanted tokens.

-   Removing stop words, leaving in capital words.

-   Strip stop words, word endings and digits.

Text preprocessing practice
===========================

Now, it's your turn to apply the techniques you've learned to help clean up text for better NLP results. You'll need to remove stop words and non-alphabetic characters, lemmatize, and perform a new bag-of-words on your cleaned text.

You start with the same tokens you created in the last exercise: `lower_tokens`. You also have the `Counter` class imported.

Instructions
------------

100 XP

-   Import the `WordNetLemmatizer` class from `nltk.stem`. 
-   Create a list `alpha_only` that contains **only** alphabetical characters. You can use the `.isalpha()` method to check for this.
-   Create another list called `no_stops` consisting of words from `alpha_only` that **are not** contained in `english_stops`.
-   Initialize a `WordNetLemmatizer` object called `wordnet_lemmatizer` and use its `.lemmatize()`method on the tokens in `no_stops` to create a new list called `lemmatized`.
-   Create a new `Counter` called `bow` with the lemmatized words.
-   Lastly, print the 10 most common tokens.

In [None]:
from nltk.stem import WordNetLemmatizer

# Retain alphabetic words: alpha_only
alpha_only = [t for t in lower_tokens if t.isalpha()]

# Remove all stop words: no_stops
no_stops = [t for t in alpha_only if t not in english_stops]

# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]

# Create the bag-of-words: bow
bow = Counter(lemmatized)

# Print the 10 most common tokens
print(bow.most_common(10))

1\. Introduction to gensim
--------------------------

00:00 - 00:07

In this video, we will get started using a new tool called Gensim.

2\. What is gensim?
-------------------

00:07 - 00:25

**Gensim** is a popular open-source natural language processing library. It uses top academic models to perform complex tasks like building document or word vectors, corpora and performing topic identification and document comparisons.

3\. What is a word vector?
--------------------------

00:25 - 01:19

You might be wondering what a word or document vector is? Here are some examples here in visual form. A word embedding or vector is trained from a larger corpus and is a multi-dimensional representation of a word or document. You can think of it as a multi-dimensional array normally with sparse features (lots of zeros and some ones). With these vectors, we can then see relationships among the words or documents based on how near or far they are and also what similar comparisons we find. For example, in this graphic we can see that the vector operation king minus queen is approximately equal to man minus woman. Or that Spain is to Madrid as Italy is to Rome. The deep learning algorithm used to create word vectors has been able to distill this meaning based on how those words are used throughout the text.

4\. Gensim example
------------------

01:19 - 01:46

The graphic we have here is an example of LDA visualization. LDA stands for latent dirichlet allocation, and it is a statistical model we can apply to text using Gensim for topic analysis and modelling. This graph is just a portion of a blog post written in 2015 using Gensim to analyze US presidential addresses. The article is really neat and you can find the link here.

5\. Creating a gensim dictionary
--------------------------------

01:46 - 02:51

Gensim allows you to build corpora and dictionaries using simple classes and functions. A corpus (or if plural, corpora) is a set of texts used to help perform natural language processing tasks. Here, our documents are a list of strings that look like movie reviews about space or sci-fi films. First we need to do some basic preprocessing. For brevity, we will only tokenize and lowercase. For better results, we would want to apply more of the preprocessing we have learned in this chapter, such as removing punctuation and stop words. Then we can pass the tokenized documents to the Gensim Dictionary class. This will create a mapping with an id for each token. This is the beginning of our corpus. We now can represent whole documents using just a list of their token ids and how often those tokens appear in each document. We can take a look at the tokens and their ids by looking at the token2id attribute, which is a dictionary of all of our tokens and their respective ids in our new dictionary.

6\. Creating a gensim corpus
----------------------------

02:51 - 03:59

Using the dictionary we built in the last slide, we can then create a Gensim corpus. This is a bit different than a normal corpus -- which is just a collection of documents. Gensim uses a simple bag-of-words model which transforms each document into a bag of words using the token ids and the frequency of each token in the document. Here, we can see that the Gensim corpus is a list of lists, each list item representing one document. Each document a series of tuples, the first item representing the tokenid from the dictionary and the second item representing the token frequency in the document. In only a few lines, we have a new bag-of-words model and corpus thanks to Gensim. And unlike our previous Counter-based bag of words, this Gensim model can be easily saved, updated and reused thanks to the extra tools we have available in Gensim. Our dictionary can also be updated with new texts and extract only words that meet particular thresholds. We are building a more advanced and feature-rich bag-of-words model which can then be used for future exercises.

7\. Let's practice!
-------------------

03:59 - 04:04

Now you can get started building your own dictionary with Gensim!

What are word vectors?
======================

What are word vectors and how do they help with NLP?

##### Answer the question

#### Possible Answers

Select one answer

[/] -   They are similar to bags of words, just with numbers. You use them to count how many tokens there are.

-   Word vectors are sparse arrays representing bigrams in the corpora. You can use them to compare two sets of words to one another.

-   Word vectors are multi-dimensional mathematical representations of words created using deep learning methods. They give us insight into relationships between words in a corpus.

-   Word vectors don't actually help NLP and are just hype.