1\. Bag-of-words
----------------

00:00 - 00:24

Welcome to the next chapter of this course! We proceed on our journey by embarking on the first step in performing a sentiment analysis task: transforming our text data to numeric form. Why do we need to do that? A machine learning model cannot work with the text data directly, but rather with numeric features we create from the data.

2\. What is a bag-of-words (BOW) ?
----------------------------------

00:24 - 00:47

We start with a basic and crude, but often quite useful method, called bag-of-words (BOW). A bag-of-words approach describes the occurrence, or frequency, of words within a document, or a collection of documents (called corpus). It basically comes down to building a vocabulary of all the words occurring in the document and keeping track of their frequencies.

• Describes the occurrence of words within a document or a collection of documents (corpus)

• Builds a vocabulary of the words and a measure of their presence

3\. Amazon product reviews
--------------------------

00:47 - 01:10

Before we continue with the discussion of BOW, we will introduce the data we will use throughout the chapter, namely reviews of Amazon products. The dataset consists of two columns: the first contains the score, which is 1 if positive and 0 if negative; The second column contains the actual review of the product.

| score | review |
|--------|---------|
| 0 | 1 | Stuning even for the non-gamer. This sound tr... |
| 1 | 1 | The best soundtrack ever to anything.: I'm re... |
| 2 | 1 | Amazing!: This soundtrack is my favorite musi... |
| 3 | 1 | Excellent Soundtrack: I truly like this sound... |
| 4 | 1 | Remember, Pull Your Jaw Off The Floor After H... |
| 5 | 1 | an absolute masterpiece: I am quite sure any ... |
| 6 | 0 | Buyer beware: This is a self-published book, ... |
| 7 | 1 | Glorious story: I loved Whisper of the wicked... |
| 8 | 1 | A FIVE STAR BOOK: I just finished reading Whi... |
| 9 | 1 | Whispers of the Wicked Saints: This was a eas... |



4\. Sentiment analysis with BOW: Example
----------------------------------------

01:10 - 02:02

Let's see how BOW would work applied to an example review. Imagine you have the following string: "This is the best book ever. I loved the book and highly recommend it." The goal of a BOW approach would be to build the following dictionary-like output: 'This', occurs once in our string, so it has a count of 1, 'is' occurs once, 'the' occurs two times and so on. One thing to note is that we lose the word order and grammar rules, that's why this approach is called a 'bag' of words, resembling dropping a bunch of items in a bag and losing any sense of their order. This sounds straightforward but sometimes deciding how to build the vocabulary can be complex. We discuss some of the trade-offs we need to consider in later chapters.

This is the best book ever. I loved the book and highly recommend it!!!

```python
{
    'This': 1,
    'is': 1,
    'the': 2,
    'best': 1,
    'book': 2,
    'ever': 1, 
    'I': 1,
    'loved': 1,
    'and': 1,
    'highly': 1,
    'recommend': 1,
    'it': 1
}
```

• Lose word order and grammar rules!


5\. BOW end result
------------------

02:02 - 02:18

When we transform the text column with a BOW, the end result looks something like the table that we see: where the column is the word (also called token), and the row represents how many times we have encountered it in the respective review.



6\. CountVectorizer function
----------------------------

02:18 - 03:13

How do we execute a BOW process in Python? The simplest way to do this is by using the CountVectorizer from the text library in the sklearn.feature_extraction submodule. In Python, we import the CountVectorizer() from sklearn.feature_extraction.text. In the CountVectorizer function, for the moment we leave the default functional options, except for the max_features argument, which only considers the features with highest term frequency, i.e. it will pick the 1000 most frequent words across the corpus of reviews. We need to do that sometimes for memory's sake. We use the `fit()` method from the CountVectorizer, calling fit() on our text column. To create a BOW representation, we call the transform() method, applied again to our text column.

| | 10 | 100 | 12 | 15 | 1984 | 20 | 30 | 40 | 451 | 50 | ... | wrong | wrote | year | years | yes | yet | you | young | your | yourself |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 3 | 0 | 1 | 0 |

7\. CountVectorizer output
--------------------------

03:13 - 03:27

The result is a sparse matrix. A sparse matrix only stores entities that are non-zero, where the rows correspond to the number of rows in the dataset, and the columns to the BOW vocabulary.

```python
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(max_features=1000)
vect.fit(data.review)
X = vect.transform(data.review)
```
### CountVectorizer output
```python
X

<10000x1000 sparse matrix of type '<class 'numpy.int64'>'
    with 406668 stored elements in Compressed Sparse Row format>
```

8\. Transforming the vectorizer
-------------------------------

03:27 - 03:53

To look at the actual contents of a sparse matrix, we need to perform an additional step to transform it back to a 'dense' NumPy array, using the .toarray() method. We can build a pandas DataFrame from the array, where the columns' names are obtained from the `.get_feature_names()` method of the vectorizer. This returns a list where every entry corresponds to one feature.

```python
# Transform to an array 
my_array = X.toarray()

# Transform back to a dataframe, assign column names
X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())
```


9\. Let's practice!
-------------------

03:53 - 03:59

That was our introduction to BOW! Let's apply what we've learned in the exercises.

Which statement about BOW is true?
==================================

You were introduced to a bag-of-words(BOW) and some of its characteristics in the video. Which of the following statements about BOW **is** true?

##### Answer the question

#### Possible Answers

Select one answer

[/] -   Bag-of-words preserves the word order and grammar rules.

-   Bag-of-words describes the order and frequency of words or tokens within a corpus of documents.

-   Bag-of-words is a simple but effective method to build a vocabulary of all the words occurring in a document.

-   Bag-of-words can only be applied to a large document, not to shorter documents or single sentences.

Your first BOW
==============

A bag-of-words is an approach to transform text to numeric form. 

In this exercise, you will apply a BOW to the `annak` list before moving on to a larger dataset in the next exercise. 

Your task will be to work with this list and apply a BOW using the `CountVectorizer()`. This transformation is your first step in being able to understand the sentiment of a text. Pay attention to words which might carry a strong sentiment. 

Remember that the output of a `CountVectorizer()` is a sparse matrix, which stores only entries which are non-zero. To look at the actual content of this matrix, we convert it to a dense array using the `.toarray()`method.

Note that in this case you don't need to specify the `max_features` argument because the text is short.

Instructions
------------

-   Import the count vectorizer function from `sklearn.feature_extraction.text`.
-   Build and fit the vectorizer on the small dataset.
-   Create the BOW representation with name `anna_bow` by calling the `transform()`method.
-   Print the BOW result as a dense array.

In [None]:
# Import the required function
from sklearn.feature_extraction.text import CountVectorizer

annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())
     

BOW using product reviews
=========================

You practiced a BOW on a small dataset. Now you will apply it to a sample of Amazon product reviews. The data has been imported for you and is called `reviews`. It contains two columns. The first one is called `score` and it is `0` when the review is negative, and `1` when it is positive. The second column is called `review`and it contains the text of the review that a customer wrote. Feel free to explore the data in the IPython Shell.

Your task is to build a BOW vocabulary, using the `review` column.

Remember that we can call the `.get_feature_names()` method on the vectorizer to obtain a list of all the vocabulary elements.

Instructions
------------

-   Create a CountVectorizer object, specifying the maximum number of features. 
-   Fit the vectorizer. 
-   Transform the fitted vectorizer.
-   Create a DataFrame where you transform the sparse matrix to a dense array and make sure to correctly specify the names of columns.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

1\. Getting granular with n-grams
---------------------------------

00:00 - 00:07

You might remember from an earlier video that with a bag-of-words approach the word order is discarded.



2\. Context matters
-------------------

00:07 - 00:31

Imagine you have a sentence such as 'I am happy, not sad' and another one 'I am sad, not happy'. They will have the same representation with a BOW, even though the meanings are inverted. In this case, putting NOT in front of the word (which is also called negation) changes the whole meaning and demonstrates why context is important.

#### I am happy, not sad.

#### I am sad, not happy.

```
Putting 'not' in front of a word (negation) is one example of how context matters.
```


3\. Capturing context with a BOW
--------------------------------

00:31 - 01:01

There is a way to capture the context when using a BOW by, for example, considering pairs or triples of tokens that appear next to each other. Let's define some terms. Single tokens are what we used so far and are also called 'unigrams'. Bigrams are pairs of tokens, trigrams are triples of tokens, and a sequence of n-tokens is called 'n-grams.'

```
Unigrams : single tokens
Bigrams: pairs of tokens
Trigrams: triples of tokens
n - grams: sequence of n-tokens
```


4\. Capturing context with BOW
------------------------------

01:01 - 01:22

Let's illustrate that with an example. Take the sentence 'The weather today is wonderful' and split it using unigrams, bigrams and trigrams. With unigrams we have single tokens, with bigrams, pairs of neighboring tokens, with trigrams: triples of neighboring tokens.

```
The weather today is wonderful.
Unigrams : { The, weather, today, is, wonderful }
Bigrams: {The weather, weather today, today is, is wonderful}
Trigrams: {The weather today, weather today is, today is wonderful}
```

5\. n-grams with the CountVectorizer
------------------------------------

01:22 - 01:56

It is easy to implement n-grams with the CountVectorizer method. To specify the n-grams, we use the ngram_range parameter. The ngram_range is a tuple where the first parameter is the minimum length and the second parameter is the maximum length of tokens. For instance, ngram_range =(1, 1) means we will use only unigrams, (1, 2) means we will use unigrams and bigrams and so on.

```python
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Only unigrams
ngram_range=(1, 1)

# Uni- and bigrams  
ngram_range=(1, 2)
```

6\. What is the best n?
-----------------------

01:56 - 02:37

It's not easy to determine what is the optimal sequence you should use for your problem. If we use longer token sequence, this will result in more features. In principle, the number of bigrams could be the number of unigrams squared; trigrams the number of unigrams to the power of 3 and so forth. In general, having longer sequences results in more precise machine learning models, but this also increases the risk of overfitting. An approach to find the optimal sequence length would be to try different lengths in something like a grid search and see which results in the best model.

### Longer sequence of tokens

```
Results in more features
Higher precision of machine learning models
Risk of overing
```

7\. Specifying vocabulary size
------------------------------

02:37 - 04:03

Determining the length of token sequence is not the only way to determine the size of the vocabulary. There are a few parameters in the CountVectorizer that can also do that. You might remember we set the max_features parameter. The max_features can tell the CountVectorizer to take the top most frequent tokens in the corpus. If it is set to None, all the words in the corpus will be included. So this parameter can remove rare words, which depending on the context may or may not be a good idea. Another parameter you can specify is max_df. If given, it tells CountVectorizer to ignore terms with a higher than the given frequency. We can specify it as an integer - which will be an absolute count, or float - which will be a proportion. The default value of max_df is 1.0, which means it does not ignore any terms. Very similar to max_df is min_df. It is used to remove terms that appear too infrequently. It again can be specified either as an integer, in which case it will be a count, or a float, in which case it will be a proportion. The default value is 1, which means "ignore terms that appear in less than 1 document". Thus, the default setting does not ignore any terms.

```
CountVectorizer(max_features, max_df, min_df)

• max_features: if specified, it will include only the top most frequent words in the vocabulary
  ○ If max_features = None, all words will be included

• max_df: ignore terms with higher than specified frequency 
  ○ If it is set to integer, then absolute count; if a float, then it is a proportion
  ○ Default is 1, which means it does not ignore any terms

• min_df: ignore terms with lower than specified frequency
  ○ If it is set to integer, then absolute count; if a float, then it is a proportion 
  ○ Default is 1, which means it does not ignore any terms
```

8\. Let's practice!
-------------------

04:03 - 04:11

Let's go to the exercises where you will specify the token sequence length and the size of the vocabulary.

Specify token sequence length with BOW
======================================

We saw in the video that by specifying different length of tokens - what we called n-grams - we can better capture the context, which can be very important.

In this exercise, you will work with a sample of the Amazon product reviews. Your task is to build a BOW vocabulary, using the `review`column and specify the sequence length of tokens.

Instructions
------------

-   Build the vectorizer, specifying the token sequence length to be uni- and bigrams.
-   Fit the vectorizer.
-   Transform the fitted vectorizer.
-   In the DataFrame, make sure to correctly specify the column names.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify token sequence and fit
vect = CountVectorizer(ngram_range=(1,2))
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

Size of vocabulary of movies reviews
====================================

In this exercise, you will practice different ways to limit the size of the vocabulary using a sample of the `movies` reviews dataset. The first column is the `review`, which is of type `object`and the second column is the `label`, which is `0` for a negative review and `1` for a positive one. 

The three methods that you will use will transform the text column to new numeric columns, capturing the count of a word or a phrase in each review. Each method will ultimately result in building a different number of new features.

Instructions 1/3
----------------

-   Using the `movies` dataset, limit the size of the vocabulary to  `100`.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify size of vocabulary and fit
vect = CountVectorizer(max_features=100)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

Instructions 2/3
----------------

-   Using the `movies` dataset, limit the size of the vocabulary to include terms which occur in no more than 200 documents.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build and fit the vectorizer
vect = CountVectorizer(max_df=200)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

Instructions 3/3
----------------

-   Using the `movies` dataset, limit the size of the vocabulary to ignore terms which occur in less than 50 documents.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build and fit the vectorizer
vect = CountVectorizer(min_df=50)
vect.fit(movies.review)

# Transform the review column
X_review = vect.transform(movies.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

BOW with n-grams and vocabulary size
====================================

In this exercise, you will practice building a bag-of-words once more, using the `reviews`dataset of Amazon product reviews. Your main task will be to limit the size of the vocabulary and specify the length of the token sequence.

Instructions
------------

-   Import the vectorizer from `sklearn`.
-   Build the vectorizer and make sure to specify the following parameters: the size of the vocabulary should be limited to 1000, include only bigrams, and ignore terms that appear in more than 500 documents.
-   Fit the vectorizer to the `review` column.
-   Create a DataFrame from the BOW representation.

In [None]:
#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(reviews.review)

# Transform the review
X_review = vect.transform(reviews.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())
     

1\. Build new features from text
--------------------------------

00:00 - 00:09

When we have a sentiment analysis task, which we will solve with machine learning, having extra features usually results in a better model.

2\. Goal of the video
---------------------

00:09 - 00:18

Our goal in this video is to enrich the dataset of choice with extra features related to the text capturing the sentiment.

3\. Product reviews data
------------------------

00:18 - 00:29

We continue to work with the Amazon product reviews dataset. Remember that the first column contains the numeric score, and the second column - the review itself.

```python
reviews.head()
```

```
            score                                               review
0              1       Stuning even for the non-gamer. This sound tr...
1              1       The best soundtrack ever to anything.: I'm re...
2              1       Amazing!: This soundtrack is my favorite musi...
3              1       Excellent Soundtrack: I truly like this sound...
4              1       Remember, Pull Your Jaw Off The Floor After H...
```

4\. Features from the review column
-----------------------------------

00:29 - 00:48

In my own experience, some very predictive features say something about the complexity of the text column. For example, one could measure how long each review is, how many sentences it contains, or say something about the parts of speech involved, punctuation marks, etc.

```
How long is each review?
How many sentences does it contain?
What parts of speech are involved?
How many punctuation marks?
```

5\. Tokenizing a string
-----------------------

00:48 - 01:38

Remember we employed a BOW approach to transform each review to numeric features, counting how many times a word occurred in the respective review. Here, we stop one step earlier and only split the reviews in individual words (usually called tokens, though a token can be a whole sentence as well.) We will work with the nltk package, and concretely the word_tokenize function. Let's apply the word_tokenize function to our familiar anna_k string. The returned result is a list, where each item is a token from the string. Note that not only words but also punctuation marks are originally assigned as tokens. The same would have been the case with digits, if we had any in our string.

```python
from nltk import word_tokenize
anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'
word_tokenize(anna_k)
['Happy','families','are', 'all','alike',',',
'every','unhappy', 'family', 'is','unhappy','in',
'its','own','way','.']
```

6\. Tokens from a column
------------------------

01:38 - 02:29

Now we want to apply the same logic but to our column of reviews. One fast way to iterate over strings is by using list comprehension. A quick reminder on list comprehensions. They are like flattened-out for loops. The syntax is an operation we perform on each item in an iterable object (such as a list). In our case, a list comprehension will allow us to iterate over the review column, tokenizing every review. The result is going to be a list; if we explore the type of the first item, for example, we see it is also of type list. This means that our word_tokens is a list of lists. Each item stores the tokens from a single review.

```python
# General form of list comprehension
[expression for item in iterable]
word_tokens = [word_tokenize(review) for review in reviews.review]
type(word_tokens)
list
type(word_tokens[ 0 ])
list
```


7\. Tokens from a column
------------------------

02:29 - 03:16

Now that we have our word_tokens list, we only need to count how many tokens there are in each item of word_tokens. We start by creating an empty list, to which we will append the length of each review as we iterate over the word_tokens list. In the first line of the for loop, we find the number of items in the word_tokens list using the len() function. Since we want to iterate over this number, we need to surround the len() by the the range() function. In the second line, we find the length of each iterable, and append that number to our empty list len_tokens. Lastly, we create a new feature for the length of each review.

```python
len_tokens = []
# Iterate over the word_tokens list
for i in range(len(word_tokens)):
len_tokens.append(len(word_tokens[i]))
# Create a new feature for the length of each review
reviews['n_tokens'] = len_tokens
```

8\. Dealing with punctuation
----------------------------

03:16 - 03:47

Note that we did not address punctuation but you can exclude it if it suits your context better. You can even create a new feature that measures the number of punctuation signs. In our context, a review with more punctuation signs could signal a very emotionally charged opinion. It's also good to know that we can follow the same logic and create a feature that counts the number of sentences, where one token will be equal to a sentence and not to a single word.

```
We did not address it but you can exclude it
A feature that measures the number of punctuation signs
A review with many punctuation signs could signal a very emotionally charged opinion
```

9\. Reviews with a feature for the length
-----------------------------------------

03:47 - 03:57

If we check how the product reviews dataset looks like, we see the 'n_tokens' column we created. It shows the number of words in each review.

```python
reviews.head()
```

```
            score                                               review  n_tokens
0              1       Stuning even for the non-gamer. This sound tr...        87
1              1       The best soundtrack ever to anything.: I'm re...       109
2              1       Amazing!: This soundtrack is my favorite musi...       165
3              1       Excellent Soundtrack: I truly like this sound...       145
4              1       Remember, Pull Your Jaw Off The Floor After H...       109
```

10\. Let's practice!
--------------------

03:57 - 04:01

Let's solve some exercises to practice what we've learned.

Tokenize a string from GoT
==========================

A first standard step when working with text is to tokenize it, in other words, split a bigger string into individual strings, which are usually single words (tokens). 

A string `GoT` has been created for you and it contains a quote from George R.R. Martin's *Game of Thrones*. Your task is to split it into individual tokens.

Instructions
------------

-   Import the word tokenizing function from `nltk`.
-   Transform the `GoT` string to word tokens.

In [None]:
# Import the required function
from nltk import word_tokenize

# Transform the GoT string to word tokens
print(word_tokenize(GoT))

Word tokens from the Avengers
=============================

Now that you have tokenized your first string, it is time to iterate over items of a list and tokenize them as well. An easy way to do that with one line of code is with a list comprehension.

A list `avengers` has been created for you. It contains a few quotes from the *Avengers*movies. You can explore it in the IPython Shell.

Instructions
------------

-   Import the required function and package.
-   Apply the word tokenizing function on each item of our list.

In [None]:
# Import the word tokenizing function
from nltk import word_tokenize

# Tokenize each item in the avengers 
tokens_avengers = [word_tokenize(item) for item in avengers]

print(tokens_avengers)
     

A feature for the length of a review
====================================

You have now worked with a string and a list with string items, it is time to use a larger sample of data.

Your task in this exercise is to create a new feature for the length of a review, using the familiar `reviews` dataset.

Instructions 1/2
----------------

-   Import the word tokenizing function from the required package.
-   Apply the function to the `review` column of the `reviews` dataset.

In [None]:
# Import the needed packages
from nltk import word_tokenize

# Tokenize each item in the review column 
word_tokens = [word_tokenize(review) for review in reviews.review]

# Print out the first item of the word_tokens list
print(word_tokens[0])

Instructions 2/2
----------------

-   Iterate over the created `word_tokens` list. 
-   As you iterate, find the length of each item in the list and append it to the empty `len_tokens` list. 
-   Create a new feature `n_words` in the `reviews` for the length of the reviews.

In [None]:
# Create an empty list to store the length of reviews
len_tokens = []

# Iterate over the word_tokens list and determine the length of each item
for i in range(len(word_tokens)):
     len_tokens.append(len(word_tokens[i]))

# Create a new feature for the lengh of each review
reviews['n_words'] = len_tokens 

1\. Can you guess the language?
-------------------------------

00:00 - 00:12

Often in real applications not all documents carrying sentiment will be in English. We might want to detect what language is used or build specific features related to language.

2\. Language of a string in Python
----------------------------------

00:12 - 01:20

In Python there are a few libraries that can detect the language of a string. In this course, we will use langdetect because it is one of the better performing packages. But you can follow the same structure using another package. We first import the detect_langs function from the langdetect package. Now imagine we have a string called foreign, which is a sentence in another language. Our goal is to identify its language. We apply the detect_langs function to our string. This function will return a list. Each item of the list contains a pair of a language and a number saying how likely it is that the string is in this particular language. In this case, we observe only 1 item in the list, namely Spanish. That's because the function is fairly certain the language is Spanish. In other cases we might get longer lists, where the most likely candidate languages will appear first, followed by less likely ones.

```python
from langdetect import detect_langs
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'
detect_langs(foreign)
[es:0.9999945352697024]
```

3\. Language of a column
------------------------

01:20 - 01:51

In real applications, we usually work not with a single string but with many strings, often contained in a column of a dataset. A common problem is to detect the language of each of the strings and capture the most likely language in a new column. How to do that? We again start by importing the detect_langs function from the langdetect package. We import our familiar dataset with product reviews.

- Problem: Detect the language of each of the strings and capture the most likely language in
a new column
```python
from langdetect import detect_langs
reviews = pd.read_csv('product_reviews.csv')
reviews.head()
```

```python
reviews.head()
```

| score | review |
|-------|---------|
| 0 | 1 | Stuning even for the non-gamer. This sound tr... |
| 1 | 1 | The best soundtrack ever to anything.: I'm re... |
| 2 | 1 | Amazing!: This soundtrack is my favorite musi... |
| 3 | 1 | Excellent Soundtrack: I truly like this sound... |
| 4 | 1 | Remember, Pull Your Jaw Off The Floor After H... |

4\. Building a feature for the language
---------------------------------------

01:51 - 02:56

The steps we follow next are quite similar to our approach when capturing the length of a review. First, we create an empty list, called languages. We want to iterate over the rows of our dataset using a for loop. In the first line of the loop, we apply the len() function to our dataset, which returns the number of rows. We still need to call the range() function since we want to iterate over the number of rows. In the second line of the loop, we apply the detect_lang function on the review column of the dataset, which is the second column in our case, while selecting one row at a time. We want to store each detected language as an item in a list, therefore we append the result of detect_langs to the empty list languages. When we print languages, we see that it is a list of lists, where each element contains the detected language of the respective row and how likely that language is. In some cases, the individual lists contain more than one item.

```python
languages = []
for row in range(len(reviews)):
languages.append(detect_langs(reviews.iloc[row, 1 ]))
languages
[it:0.9999982541301151],
[es:0.9999954153640488],
[es:0.7142833997345875, en:0.2857160465706441],
[es:0.9999942365605781],
[es:0.999997956049055] ...
```

5\. Building a feature for the language
---------------------------------------

02:56 - 03:46

We have one more step before we create our language feature. We saw that languages is a list of lists. We want to extract the first element of each list within languages since the first item is always the most likely language. One fast way to do that is by list comprehension. Let's break down the command in steps. For example, let's take the first element of the languages and split it on a colon sign. After that, we extract the first element of the resulting split, returning '[es'. Finally,since there is a left bracket before the language, we select everything from the 2nd element onwards, resulting in 'es' for Spanish.

```python
# Transform the first list to a string and split on a colon
str(languages[ 0 ]).split(':')
['[es', '0.9999954153640488]']
str(languages[ 0 ]).split(':')[ 0 ]
'[es'
str(languages[ 0 ]).split(':')[ 0 ][ 1 :]
'es'
```

6\. Building a feature for the language
---------------------------------------

03:46 - 03:59

To write the list comprehension, we put these steps together by iterating over each item in our list of lists. Lastly, we assign the cleaned list to a new feature, called language.

```python
languages = [str(lang).split(':')[ 0 ][ 1 :] for lang in languages]
reviews['language'] = languages
```

7\. Let's practice!
-------------------

03:59 - 04:05

I know this is a lot of code but the exercises will help you digest it.

Identify the language of a string
=================================

Sometimes you might need to analyze the sentiment of non-English text. Your first task in such a case will be to identify the foreign language. 

In this exercise you will identify the language of a single string. A string called `foreign` has been created for you. It has been randomly extracted from the `reviews` dataset and may contain some grammatical errors. Feel free to explore it in the IPython Shell.

Instructions
------------

-   Import the required function from the language detection package.
-   Detect the language of the `foreign` string.

In [None]:
# Import the language detection function and package
from langdetect import detect_langs

# Detect the language of the foreign string
print(detect_langs(foreign))

Detect language of a list of strings
====================================

Now you will detect the language of each item in a list. A list called `sentences` has been created for you and it contains 3 sentences, each in a different language. They have been randomly extracted from the product reviews dataset.

Instructions
------------

-   Iterate over the sentences in the list.
-   Detect the language of each sentence and append the detected language to the empty list `languages`.

In [None]:
from langdetect import detect_langs

languages = []

# Loop over the sentences in the list and detect their language
for sentence in sentences:
    languages.append(detect_langs(sentence))
    
print('The detected languages are: ', languages)

Language detection of product reviews
=====================================

You will practice language detection on a small dataset called `non_english_reviews`. It is a sample of non-English reviews from the Amazon product reviews. 

You will iterate over the rows of the dataset, detecting the language of each row and appending it to an empty list. The list needs to be cleaned so that it only contains the language of the review such as `'en'` for English instead of the regular output `en:0.9987654`. Remember that the language detection function might detect more than one language and the first item in the returned list is the most likely candidate. Finally, you will assign the list to a new column. 

The logic is the same as used in the slides and the exercise before but instead of applying the function to a list, you work with a dataset.

Instructions
------------

-   Iterate over the rows of the `non_english_reviews` dataset. 
-   Inside the loop, detect the language of the second column of the dataset.
-   Clean the string by splitting on a `:` inside the list comprehension expression.
-   Finally, assign the cleaned list to a new column.

In [None]:
from langdetect import detect_langs
languages = [] 

# Loop over the rows of the dataset and append  
for row in range(len(non_english_reviews)):
    languages.append(detect_langs(non_english_reviews.iloc[row, 1]))

# Clean the list by splitting     
languages = [str(lang).split(':')[0][1:] for lang in languages]
print(languages)

# Assign the list to a new feature 
non_english_reviews['language'] = languages

print(non_english_reviews.head())