1\. Stop words
--------------

00:00 - 00:11

In every language, there are words that occur too frequently and are not very informative. Sometimes, it is useful to get rid of them before we build a machine learning model.

2\. What are stop words and how to find them?
---------------------------------------------

00:11 - 01:00

Words that occur too frequently and are not very informative are called stop words. But how do we know which words are not informative? In every language, there is a set of words that most practitioners agree are not useful and should be removed when performing a natural language processing task. For instance, in English the definite and indefinite article (the, a/an), conjunctions ('and','but','for'), propositions('on', 'in', 'at'), etc. are stop words. Secondly, depending on the context, we might want to expand the standard set of stop words. For example, in the movie reviews dataset, we might want to exclude words such as 'film', 'movie', 'cinema', etc.

3\. Stop words with word clouds
-------------------------------

01:00 - 01:31

Maybe you recall from a previous video that we built word clouds using movie reviews. Here is an example of two word clouds using the movie reviews. In the picture on the left, the stop words have not been removed. Words that pop up are 'film' and 'br', which is an indication for a line break. In the cloud on the right side, stop words have been removed and now we see words such as 'character', 'see', 'good', 'story'.

4\. Remove stop words from word clouds
--------------------------------------

01:31 - 02:21

How do we remove stop words when creating a word cloud? Let's start by reviewing how we built a word cloud. First, we import the WordCloud function from wordcloud. We also import the default list of STOPWORDS from wordcloud. To create our list of stop words, we can take a set of the default list. A set is like a list but with unique, not repeating items. We can update the set of stop words by calling update and providing a list to it. We pass our list of stopwords, called my_stopwords to the stopwords argument in the WordCloud function. Then we display it. So, the only new argument we added here is defining the list of stop words. Everything else stays the same.

5\. Stop words with BOW
-----------------------

02:21 - 03:34

Removing non-informative words when we are building a BOW transformation can also be very useful. This can easily be incorporated in the countvectorizer function. First, we need to import the list of default English stop words from the same feature_extraction.text package from sci-kit learn. Let's assume we want to enrich this default list with movie-specific words. To do that, we call the union function on the default list. Remember that a union of two sets A and B consists of all elements of A and all elements of B such that no elements are repeated. In our case, the union will add the new words to the list of default stop words, if that word is not already there. To use the constructed set, we specify the stop_words argument in the CountVectorizer to be equal to our defined set. Everything else stays the same and should look pretty familiar by now. One important thing to note is that using stopwords will reduce the size of the vocabulary we built using a BOW or another approach.

6\. Let's practice!
-------------------

03:34 - 03:40

Let's solve some exercises where you will practice removing stop words!

Word cloud of tweets
====================

Your task in this exercise is to plot a word cloud using a sample of Twitter data, expressing customers' sentiments about airlines. A string `text_tweet` has been created for you and it contains the messages of a 1000 customers shared on Twitter. 

In the first step, your are asked to build the word cloud without removing the stop words, and in the second step to build the same cloud after you have removed the stop words. 

Feel free to familiarize yourself with the `text_tweet` list.

Instructions 1/2
----------------

-   -   Import the word cloud function and package.
    -   Create and generate the word cloud, using the `text_tweet` vector.

In [None]:
# Import the word cloud function 
from wordcloud import WordCloud 

# Create and generate a word cloud image
my_cloud = WordCloud(background_color='white').generate(text_tweet)

# Display the generated wordcloud image
plt.imshow(my_cloud, interpolation='bilinear') 
plt.axis("off")

# Don't forget to show the final image
plt.show()

Instructions 2/2
----------------

-   -   Define the default list of stop words and update it.
    -   Specify the stop words argument in the `WordCloud` function.

In [None]:
# Import the word cloud function and stop words list
from wordcloud import WordCloud, STOPWORDS 

# Define and update the list of stopwords
my_stop_words = STOPWORDS.update(['airline', 'airplane'])

# Create and generate a word cloud image
my_cloud = WordCloud(stopwords=my_stop_words).generate(text_tweet)

# Display the generated wordcloud image
plt.imshow(my_cloud, interpolation='bilinear') 
plt.axis("off")
# Don't forget to show the final image
plt.show()
     

Airline sentiment with stop words
=================================

You are given a dataset, called `tweets`, which contains customers' reviews and sentiments about airlines. It consists of two columns: `airline_sentiment` and `text` where the sentiment can be positive, negative or neutral, and the `text` is the text of the tweet.

In this exercise, you will create a BOW representation but will account for the stop words. Remember that stop words are not informative and you might want to remove them. That will result in a smaller vocabulary and eventually, fewer features. Keep in mind that we can enrich a default list of stop words with ones that are specific to our context.

Instructions
------------

-   Import the default list of English stop words.
-   Update the default list of stop words with the given list `['airline', 'airlines', '@']` to create `my_stop_words`. 
-   Specify the stop words argument in the vectorizer.

In [None]:
# Import the stop words
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# Define the stop words
my_stop_words = ENGLISH_STOP_WORDS.union(['airline', 'airlines', '@'])

# Build and fit the vectorizer
vect = CountVectorizer(stop_words=my_stop_words)
vect.fit(tweets.text)

# Create the bow representation
X_review = vect.transform(tweets.text)
# Create the data frame
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

Multiple text columns
=====================

In this exercise, you will continue working with the airline Twitter data. A dataset `tweets` has been imported for you. 

In some situations, you might have more than one text column in a dataset and you might want to create a numeric representation for each of the text columns. Here, besides the `text` column, which contains the body of the tweet, there is a second text column, called `negativereason`. It contains the reason the customer left a negative review. 

Your task is to build BOW representations for both columns and specify the required stop words.

Instructions
------------

-   Import the vectorizer package and the default list of English stop words.
-   Update the default list of English stop words and create the `my_stop_words` set.
-   Specify the stop words argument in the first vectorizer to the updated set, and in the second vectorizer - the default set of English stop words.

In [None]:
# Import the vectorizer and default English stop words list
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

# Define the stop words
my_stop_words = ENGLISH_STOP_WORDS.union(['airline', 'airlines', '@', 'am', 'pm'])
 
# Build and fit the vectorizers
vect1 = CountVectorizer(stop_words=my_stop_words)
vect2 = CountVectorizer(stop_words=ENGLISH_STOP_WORDS) 
vect1.fit(tweets.text)
vect2.fit(tweets.negative_reason)

# Print the last 15 features from the first, and all from second vectorizer
print(vect1.get_feature_names()[-15:])
print(vect2.get_feature_names())

1\. Capturing a token pattern
-----------------------------

00:00 - 00:14

You may have noticed while working with the airline sentiment data from Twitter that the text contains many digits and other characters. Sometimes we may want to exclude them from our numeric representation.

2\. String operators and comparisons
------------------------------------

00:14 - 00:49

If we work with a string, how can we make sure we extract only certain characters? There are a few useful functionalities we will review here. We can use string comparison operators, such as .isaplha(), which returns true if a string is composed only of letters and false otherwise; .isdigits() returns true if a string is composed only of digits; and finally .isalnum() returns true if a string is composed only of alphanumeric characters, i.e. letters and digits.

3\. String operators with list comprehension
--------------------------------------------

00:49 - 02:01

String operators can improve some of the features we created earlier. As a reminder, in a previous video we used a list comprehension to iterate over each review of the product reviews dataset and create word tokens from each review. We can adjust our original code. If we want to retain only tokens consisting of letters, for example, we can use the .isaplha() operator in a second list comprehension. Since the result of the first list comprehension is a list of lists, we first need to iterate over the items in each inner list, filtering out those tokens that are not letters. This is what happens in the first part of the list comprehension, enclosed in the inner brackets. In the second part, we are iterating over the lists, basically saying that we want to perform this filtering across all lists in the word_tokens list. When we compare the length of the first item of word_tokens and the cleaned_tokens lists, we see that the filtering decreased the number of tokens, as we might expect.

4\. Regular expressions
-----------------------

02:01 - 02:58

Regular expressions are a standard way to extract certain characters from a string. Python has a built-in package, called re, which allows you to work with regular expressions. We will not cover regular expressions in depth here but, a quick reminder on the syntax. We import the re package. Then imagine we have a string #Wonderfulday and we want to extract a hash(#) followed by any letter, capital or small. One standard way to do is by calling the search function on our string, specifying the regular expression. In our case, it starts with a #, and is followed by either an upper or lower case letter. When we print the result, we see that it is a match object, showing how large the match is - in our case, the span is 2, and also the exact characters that were matched.

5\. Token pattern with a BOW
----------------------------

02:58 - 04:01

Our familiar CountVectorizer takes a regular expression as an argument. The default pattern used matches words that consists of at least two letters or numbers (\w) and which are separated by word boundaries (\b). It will ignore single-lettered words, and will split words such as 'don't' and 'haven't'. If we are fine with this default pattern, we don't need to change any arguments in the CountVectorizer. If we want to change it, we can specify the token_pattern argument. If we want the vectorizer to ignore digits and other characters and only consider words of two or more letters, we can use the specified token pattern. In fact, there are multiple ways to specify this. It doesn't mean the one specified here is the only correct or best way to accomplish this. Feel free to experiment with this. Note, however, that we need to add an 'r' before the regular expression itself.

6\. Let's practice!
-------------------

04:01 - 04:05

Let's go to the exercises where you can apply the things you learned here!

Specify the token pattern
=========================

In this exercise, you will work with the `text`column of the `tweets` dataset. Your task is to vectorize the object column using `CountVectorizer`. You will apply different patterns of tokens in the vectorizer. Remember that by specifying the token pattern, you can filter out characters. 

The `CountVectorizer` has been imported for you.

Instructions 1/2
----------------

-   Build a vectorizer from the `text` column, specifying the pattern of tokens to be equal to `r'\b[^\d\W][^\d\W]'`.

In [None]:
# Build and fit the vectorizer
vect = CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b').fit(tweets.text)
vect_text = vect.transform(tweets.text)
print('Length of vectorizer: ', len(vect.get_feature_names()))
     

Instructions 2/2
----------------

-   Build a vectorizer from the `text` column using the default values of the function's arguments. 
-   Build a second vectorizer, specifying the pattern of tokens to be equal to `r'\b[^\d\W][^\d\W]'`.

In [None]:
# Build the first vectorizer
vect1 = CountVectorizer().fit(tweets.text)
vect1.transform(tweets.text)

# Build the second vectorizer
vect2 = CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]').fit(tweets.text)
vect2.transform(tweets.text)

# Print out the length of each vectorizer
print('Length of vectorizer 1: ', len(vect1.get_feature_names()))
print('Length of vectorizer 2: ', len(vect2.get_feature_names()))

String operators with the Twitter data
======================================

You continue working with the `tweets` data where the `text` column stores the content of each tweet. 

Your task is to turn the `text` column into a list of tokens. Then, using string operators, remove all non-alphabetic characters from the created list of tokens.

Instructions
------------

-   Import the word tokenizing function.
-   Create word tokens from each tweet.
-   Filter out all non-alphabetic characters from the created list, i.e. retain only letters.

In [None]:
# Import the word tokenizing package
from nltk import word_tokenize

# Tokenize the text column
word_tokens = [word_tokenize(review) for review in tweets.text]
print('Original tokens: ', word_tokens[0])

# Filter out non-letter characters
cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens]
print('Cleaned tokens: ', cleaned_tokens[0])

More string operators and Twitter
=================================

In this exercise, you will apply different string operators to three strings, selected from the `tweets` dataset. A `tweets_list` has been created for you.

You need to construct three new lists by applying different string operators:

-   a list retaining only letters
-   a list retaining only characters 
-   a list retaining only digits 

The required functions have been imported for you from `nltk`.

Instructions
------------

-   Create a list of the tokens from `tweets_list`.
-   In the list `letters` remove all digits and other characters, i.e. keep only letters.
-   Retain alphanumeric characters but remove all other characters in `let_digits`.
-   Create `digits` by removing letters and characters and keeping only numbers.

In [None]:
# Create a list of lists, containing the tokens from list_tweets
tokens = [word_tokenize(item) for item in tweets_list]

# Remove characters and digits , i.e. retain only letters
letters = [[word for word in item if word.isalpha()] for item in tokens]
# Remove characters, i.e. retain only letters and digits
let_digits = [[word for word in item if word.isalnum()] for item in tokens]
# Remove letters and characters, retain only digits
digits = [[word for word in item if word.isdigit()] for item in tokens]

# Print the last item in each list
print('Last item in alphabetic list: ', letters[2])
print('Last item in list of alphanumerics: ', let_digits[2])
print('Last item in the list of digits: ', digits[2])

1\. Stemming and lemmatization
------------------------------

00:00 - 00:17

In a language, words are often derived from other words, meaning words can share the same root. When we create a numeric transformation of a text feature, we might want to strip a word down to its root. This is the topic of this lesson.

2\. What is stemming?
---------------------

00:17 - 00:53

This process is called stemming. More formally, stemming can be defined as the transformation of words to their root forms, even if the stem itself is not a valid word in the language. For example, staying, stays, stayed will be mapped to the root 'stay', and house, houses, housing will be mapped to the root 'hous'. In general, stemming will tend to chop off suffixes such as '-ed', '-ing', '-er', as well as plural or possessive forms.

3\. What is lemmatization?
--------------------------

00:53 - 01:15

Lemmatization is quite a similar process to stemming, with the main difference that with lemmatization, the resulting roots are valid words in the language. Going back to our examples of words derived from 'stay', lemmatization reduces them to 'stay'; and words derived from 'house' are reduced to the noun 'house'.

4\. Stemming vs. lemmatization
------------------------------

01:15 - 02:02

You might wonder when to use stemming and when lemmatization. The main difference is in the obtained roots. With lemmatization they are actual words and with stemming they might not be. So if in your problem it's important to retain words, not only roots, lemmatization would be more suitable. However, if you use nltk - which is what we will use in this course - stemming follows an algorithm which makes it faster than the lemmatization process in nltk. Furthermore, lemmatization is dependent on knowing the part of speech of the word you want to lemmatize. For example, whether we want to transform a noun, a verb, an adjective, etc.

5\. Stemming of strings
-----------------------

02:02 - 02:31

One popular stemming library is the PorterStemmer in the nltk.stem package. The PorterStemmer is not the only stemmer in nltk but it's quite fast and easy to use, so it's often a standard choice. We call the PorterStemmer function and store it under the name porter. We can then call porter.stem on a string, for example, 'wonderful'. The result is 'wonder'.

6\. Non-English stemmers
------------------------

02:31 - 03:01

Stemming is possible using other languages as well, such as Danish, Dutch, French, Spanish, German, etc. To use foreign language stemmers we need to use the SnowballStemmer package. We can specify in the stemmer the foreign language we want to use. Then we apply the stem function on our string. For example, we have imported a Dutch stemmer and fed it a Dutch verb. The result is the root of the verb.

7\. How to stem a sentence?
---------------------------

03:01 - 03:30

If you apply the PorterStemmer on a sentence, the result is the original sentence. We see nothing has changed about our 'Today is a wonderful day!' sentence. We need to stem each word in the sentence separately. Therefore, as a first step, we need to transform the sentence into tokens using the familiar word_tokenize function. In the second step, we apply the stemming function on each word of the sentence, using a list comprehension.

8\. Lemmatization of a string
-----------------------------

03:30 - 04:17

The lemmatization of strings is similar to stemming. We import the WordNetLemmatizer from the nltk.stem library. It uses the WordNet database to look up lemmas of words. We call the WordNetLemmatizer function and store it under the name WNlemmatizer. We can then call WNlemmatizer.lemmatize() on 'wonderful'. Note that we have specified a part-of-speech, given by the 'pos' argument. The default pos is noun, or 'n'. Here we specify an adjective, that's why pos = 'a'. The result is 'wonderful'. If you'd recall, stemming returned 'wonder' as a result.

9\. Let's practice!
-------------------

04:17 - 04:25

Let's solve some exercises and reinforce the concepts related to stemming and lemmatization.

Stems and lemmas from GoT
=========================

In this exercise, you are given a couple of sentences from George R.R. Martin's **Game of Thrones**. Your task is to create stems and lemmas from the given `GoT` string.

Remember that stems reduce a word to its root whereas lemmas produce an actual word. However, speed can differ significantly between the methods with stemming being much faster. In Steps 2 and 3, pay attention to the total time it takes to perform each operation. We're making use of the `time.time()` method to measure the time it takes to perform stemming and lemmatization.

Instructions 1/3
----------------

-   Import the stemming and lemmatization functions.
-   Build a list of tokens from the `GoT` string.

In [None]:
# Import the required packages from nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize

porter = PorterStemmer()
WNlemmatizer = WordNetLemmatizer()

# Tokenize the GoT string
tokens = word_tokenize(GoT) 

Instructions 2/3
----------------

-   Using list comprehension and the `porter`stemmer you imported, create the `stemmed_tokens` list.

In [None]:
import time

# Log the start time
start_time = time.time()

# Build a stemmed list
stemmed_tokens = [porter.stem(token) for token in tokens] 

# Log the end time
end_time = time.time()

print('Time taken for stemming in seconds: ', end_time - start_time)
print('Stemmed tokens: ', stemmed_tokens) 

Instructions 3/3
----------------

-   Using list comprehension and the `WNlemmatizer` you imported, create the `lem_tokens` list.

In [None]:
import time

# Log the start time
start_time = time.time()

# Build a lemmatized list
lem_tokens = [WNlemmatizer.lemmatize(token) for token in tokens]

# Log the end time
end_time = time.time()

print('Time taken for lemmatizing in seconds: ', end_time - start_time)
print('Lemmatized tokens: ', lem_tokens) 

Stem Spanish reviews
====================

You will recall that in a previous chapter we used a language detection package to determine the language of different Amazon product reviews. In this exercise, you will first detect the languages in the `non_english_reviews`. The reviews are in multiple languages but you will select ONLY those in Spanish. Feel free to go back to the video discussing foreign language detection if you have forgotten some of the concepts. 

In the second step, you will create word tokens from the Spanish reviews and will stem them using a SnowBall stemmer for the Spanish language. The language detection package is not perfect, unfortunately. Therefore, it is possible that sometimes the detected language is not correct.

Instructions 1/2
----------------

-   Import the `langdetect` package.
-   Iterate over the rows of the `non_english_reviews` using the `len()`method and `range()` function.
-   Use `detect_langs()` to detect the language of each review in the `for` loop.

In [None]:
# Import the language detection package
import langdetect

# Loop over the rows of the dataset and append  
languages = [] 
for i in range(len(non_english_reviews)):
    languages.append(langdetect.detect_langs(non_english_reviews.iloc[i, 1]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]
# Assign the list to a new feature 
non_english_reviews['language'] = languages

# Select the Spanish ones
filtered_reviews = non_english_reviews[non_english_reviews.language == 'es']

Instructions 2/2
----------------

-   Import the `SnowballStemmer` from the respective package.
-   Create word tokens from the `review` column of the `filtered_reviews` from the previous step.
-   Use the Spanish stemmer you imported to stem the created list of tokens.

In [None]:
# Import the required packages
from nltk.stem.snowball import SnowballStemmer
from nltk import word_tokenize

# Import the Spanish SnowballStemmer
SpanishStemmer = SnowballStemmer("spanish")

# Create a list of tokens
tokens = [word_tokenize(review) for review in filtered_reviews.review]
# Stem the list of tokens
stemmed_tokens = [[SpanishStemmer.stem(word) for word in token] for token in tokens]

# Print the first item of the stemmed tokenss
print(stemmed_tokens[0])

Stems from tweets
=================

In this exercise, you will work with an array called `tweets`. It contains the text of the airline sentiment data collected from Twitter. 

Your task is to work with this array and transform it into a list of tokens using list comprehension. After that, iterate over the list of tokens and create a stem out of each token. Remember that list comprehensions are a one-line alternative to **for** loops.

Instructions
------------

-   Import the function we used to transform strings into stems. 
-   Call the Porter stemmer function you just imported.
-   Using a list comprehension, create the list `tokens`. It should contain all the word tokens from the `tweets` array.
-   Iterate over the `tokens` list and apply the stemming function to each item in the list.

In [None]:
# Import the function to perform stemming
from nltk.stem import PorterStemmer
from nltk import word_tokenize

# Call the stemmer
porter = PorterStemmer()

# Transform the array of tweets to tokens
tokens = [word_tokenize(word) for word in tweets]
# Stem the list of tokens
stemmed_tokens = [[porter.stem(word) for word in tweet] for tweet in tokens] 
# Print the first element of the list
print(stemmed_tokens[0])