# Bag-of-words


## Whatis a bag-of-words (BOW) ?
- Describes the occurrence of words within a document or a collection of documents (corpus)
- Builds a vocabulary ofthe words and a measure oftheir presence


![](https://i.imgur.com/mBZZwLU.png)

![](https://i.imgur.com/w5Ku7jU.png)


## Count vectorizer

In [2]:
filename = 'data/IMDB_sample.xls'
data = pd.read_csv(filename,error_bad_lines=False)
data.drop(['Unnamed: 0'],axis=1,inplace=True)
data.head()

Unnamed: 0,review,label
0,This short spoof can be found on Elite's Mille...,0
1,A singularly unfunny musical comedy that artif...,0
2,"An excellent series, masterfully acted and dir...",1
3,The master of movie spectacle Cecil B. De Mill...,1
4,I was gifted with this movie as it had such a ...,0


In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Intel(R) Data Analytics Acceleration Library (Intel(R) DAAL) solvers for sklearn enabled: https://intelpython.github.io/daal4py/sklearn.html


In [4]:
vect = CountVectorizer(max_features=1000)
vect.fit(data.review)
X = vect.transform(data.review)

In [5]:
X

<7501x1000 sparse matrix of type '<class 'numpy.int64'>'
	with 681082 stored elements in Compressed Sparse Row format>

## Transforming the vectorizer


In [6]:
# Transform to an array
my_array = X.toarray()

In [7]:
# Transform back to a dataframe, assign column names
X_df = pd.DataFrame(my_array, columns=vect.get_feature_names())

In [9]:
X_df.head()

Unnamed: 0,10,20,30,80,able,about,above,absolutely,across,act,...,year,years,yes,yet,york,you,young,your,yourself,zombie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,2,0,0,1,0,3,0,2,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# PRACTICE

# Your first BOW
A bag-of-words is an approach to transform text to numeric form.

In this exercise, you will apply a BOW to the annak list before moving on to a larger dataset in the next exercise.

Your task will be to work with this list and apply a BOW using the CountVectorizer(). This transformation is your first step in being able to understand the sentiment of a text. Pay attention to words which might carry a strong sentiment.

Remember that the output of a CountVectorizer() is a sparse matrix, which stores only entries which are non-zero. To look at the actual content of this matrix, we convert it to a dense array using the .toarray() method.

Note that in this case you don't need to specify the max_features argument because the text is short.

In [47]:
# Import the required function
from sklearn.feature_extraction.text import CountVectorizer

annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())

[[1 1 1 0 1 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 1 0 1 1 1 1 2 1]]


# BOW using product reviews
You practiced a BOW on a small dataset. Now you will apply it to a sample of Amazon product reviews. The data has been imported for you and is called reviews. It contains two columns. The first one is called score and it is 0 when the review is negative, and 1 when it is positive. The second column is called review and it contains the text of the review that a customer wrote. Feel free to explore the data in the IPython Shell.

Your task is to build a BOW vocabulary, using the review column.

Remember that we can call the .get_feature_names() method on the vectorizer to obtain a list of all the vocabulary elements

In [48]:
ls data

[0m[01;32mamazon_reviews_sample.xls[0m*  [01;32mIMDB_sample.xls[0m*  [01;32mTweets.xls[0m*


In [49]:
filename = 'data/amazon_reviews_sample.xls'
reviews = pd.read_csv(filename,error_bad_lines=False)
reviews.drop(['Unnamed: 0'],axis=1,inplace=True)
reviews.head()

Unnamed: 0,score,review
0,1,Stuning even for the non-gamer: This sound tr...
1,1,The best soundtrack ever to anything.: I'm re...
2,1,Amazing!: This soundtrack is my favorite musi...
3,1,Excellent Soundtrack: I truly like this sound...
4,1,"Remember, Pull Your Jaw Off The Floor After H..."


In [50]:
from sklearn.feature_extraction.text import CountVectorizer 

# Build the vectorizer, specify max features 
vect = CountVectorizer(max_features=100)
# Fit the vectorizer
vect.fit(reviews.review)

# Transform the review column
X_review = vect.transform(reviews.review)

# Create the bow representation
X_df=pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   about  after  all  also  am  an  and  any  are  as  ...  what  when  which  \
0      0      0    1     0   0   0    2    0    0   0  ...     0     0      0   
1      0      0    0     0   0   0    3    1    1   0  ...     0     0      0   
2      0      0    3     0   0   1    4    0    1   1  ...     0     0      1   
3      0      0    0     0   0   0    9    0    1   0  ...     0     0      0   
4      0      1    0     0   0   0    3    0    1   0  ...     0     0      0   

   who  will  with  work  would  you  your  
0    2     0     1     0      2    0     1  
1    0     0     0     0      1    1     0  
2    1     0     0     1      1    2     0  
3    0     0     0     0      0    0     0  
4    0     0     0     0      0    3     1  

[5 rows x 100 columns]


# Getting granular with n-grams

## Context matters
> I am happy, not sad.

 >I am sad, not happy.

- Putting 'not' in front of a word (negation) is one example of how context matters.

## Capturing context with a BOW
- Unigrams : single tokens
- Bigrams: pairs oftokens
- Trigrams:triples oftokens
- n-grams: sequence of n-tokens

## Capturing context with BOW
>The weather today is wonderful.

- Unigrams : { The, weather,today, is wonderful }
- Bigrams: {The weather, weather today,today is, is wonderful}
- Trigrams: {The weather today, weather today is,today is wonderful}

## n-grams with the CountVectorizer


```python
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer(ngram_range=(min_n, max_n))

# Only unigrams
ngram_range=(1, 1)

# Uni- and bigrams
ngram_range=(1, 2)
```

## Whatis the best n?
### Longer sequence oftokens
- Results in more features
- Higher precision of machine learning models
- Risk of overtting


![](https://i.imgur.com/OTBfv2C.png)


# PRACTICE

# Build new features from text

In [13]:
data.head()

Unnamed: 0,review,label
0,This short spoof can be found on Elite's Mille...,0
1,A singularly unfunny musical comedy that artif...,0
2,"An excellent series, masterfully acted and dir...",1
3,The master of movie spectacle Cecil B. De Mill...,1
4,I was gifted with this movie as it had such a ...,0


## Features from the review column
-  How long is each review?
- How many sentences does it contain?
- What parts of speech are involved?
- How many punctuation marks?

## Tokenizing a string

In [17]:
from nltk import word_tokenize
anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'
print(word_tokenize(anna_k))

['Happy', 'families', 'are', 'all', 'alike', ',', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own', 'way', '.']


In [18]:
# import nltk
# nltk.download('punkt')

## Tokens from a column
```python
# General form of list comprehension
[expression for item in iterable]
```

In [19]:
word_tokens = [word_tokenize(review) for review in data.review]
type(word_tokens)

list

In [20]:
type(word_tokens[0])

list

In [21]:
len_tokens = []
# Iterate over the word_tokens list
for i in range(len(word_tokens)):
    len_tokens.append(len(word_tokens[i]))

    
# Create a new feature for the length of each review
data['n_tokens'] = len_tokens

In [22]:
data.head()

Unnamed: 0,review,label,n_tokens
0,This short spoof can be found on Elite's Mille...,0,155
1,A singularly unfunny musical comedy that artif...,0,646
2,"An excellent series, masterfully acted and dir...",1,121
3,The master of movie spectacle Cecil B. De Mill...,1,128
4,I was gifted with this movie as it had such a ...,0,248


## Dealing with punctuation
- We did not address it but you can exclude it
- A feature that measures the number of punctuation signs
    - A review with many punctuation signs could signal a very emotionally charged opinion

# PRACTICE

# Can you guess thelanguage?

In [24]:
from langdetect import detect_langs
foreign = 'Este libro ha sido uno de los mejores libros que he leido.'

In [25]:
detect_langs(foreign)

[es:0.999994587883386]

## Language of a column
- Problem: Detect the language of each ofthe strings and capture the most likely language in a new column

## Building a feature for the language


In [31]:
data.head()

Unnamed: 0,review,label,n_tokens
0,This short spoof can be found on Elite's Mille...,0,155
1,A singularly unfunny musical comedy that artif...,0,646
2,"An excellent series, masterfully acted and dir...",1,121
3,The master of movie spectacle Cecil B. De Mill...,1,128
4,I was gifted with this movie as it had such a ...,0,248


In [29]:
from langdetect import detect_langs
languages = []

In [32]:
for row in range(len(data)):
    languages.append(detect_langs(data.iloc[row, 0]))

## Building a feature for the language


In [34]:
# Transform the first list to a string and split on a colon
str(languages[0]).split(':')


['[en', '0.9999961793479177]']

In [38]:
data.iloc[0,0]

'This short spoof can be found on Elite\'s Millennium Edition DVD of "Night of the Living Dead". Good thing to as I would have never went even a tad out of my way to see it.Replacing zombies with bread sounds just like silly harmless fun on paper. In execution, it\'s a different matter. This short didn\'t even elicit a chuckle from me. I really never thought I\'d say this, but "Night of the Day of the Dawn of the Son of the Bride of the Return of the Revenge of the Terror of the Attack of the Evil, Mutant, Alien, Flesh Eating, Hellbound, Zombified Living Dead Part 2: In Shocking 2-D" was a VERY better parody and not nearly as lame or boring.<br /><br />My Grade: F'

In [39]:
detect_langs(data.iloc[0, 0])

[en:0.9999972518433689]

In [40]:
str(languages[0]).split(':')[0][1:]

'en'

In [41]:
languages = [str(lang).split(':')[0][1:] for lang in languages]

In [42]:
data['language'] = languages

In [43]:
data.head()

Unnamed: 0,review,label,n_tokens,language
0,This short spoof can be found on Elite's Mille...,0,155,en
1,A singularly unfunny musical comedy that artif...,0,646,en
2,"An excellent series, masterfully acted and dir...",1,121,en
3,The master of movie spectacle Cecil B. De Mill...,1,128,en
4,I was gifted with this movie as it had such a ...,0,248,en


In [45]:
data.language.value_counts()

en    7501
Name: language, dtype: int64

# PRACTICE