# 02-BOW

Now that we master the preprocessing, let's make our first Bag Of Words (BOW).

We will reuse our dataset of Coldplay songs to make a BOW.

As usual, the first step is to import some libraries. So import *nltk* as well as all the libraries you will need.

In [6]:
# Import NLTK and all the needed libraries
import nltk
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Load now the dataset in *coldplay.csv* using pandas.

In [7]:
# TODO: Load the dataset in coldplay.csv
df = pd.read_csv('coldplay.csv')
df.head()

Unnamed: 0,Artist,Song,Link,Lyrics
0,Coldplay,Another's Arms,/c/coldplay/anothers+arms_21079526.html,Late night watching tv \nUsed to be you here ...
1,Coldplay,Bigger Stronger,/c/coldplay/bigger+stronger_20032648.html,I want to be bigger stronger drive a faster ca...
2,Coldplay,Daylight,/c/coldplay/daylight_20032625.html,"To my surprise, and my delight \nI saw sunris..."
3,Coldplay,Everglow,/c/coldplay/everglow_21104546.html,"Oh, they say people come \nThey say people go..."
4,Coldplay,Every Teardrop Is A Waterfall,/c/coldplay/every+teardrop+is+a+waterfall_2091...,"I turn the music up, I got my records on \nI ..."


You already know this dataset, but you can check it again if you want to refresh your memory.

In [8]:
# TODO: Explore the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120 entries, 0 to 119
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Artist  120 non-null    object
 1   Song    120 non-null    object
 2   Link    120 non-null    object
 3   Lyrics  120 non-null    object
dtypes: object(4)
memory usage: 3.9+ KB


Now using the *CountVectorizer* of scikit-learn, make a BOW of all the lyrics of Coldplay, and print the result.

In [19]:
lyrics = []
for w in df.Lyrics:
    lyrics.append(w)
lyrics

["Late night watching tv  \nUsed to be you here beside me  \nUsed to be your arms around me  \nYour body on my body  \n  \nWhen the world means nothing to me  \nAnother's arms, another's arms  \nWhen the pain just rips right through me  \nAnother's arms, another's arms  \n  \nLate night watching tv  \nUsed to be you here beside me  \nIs there someone there to reach me?  \nSomeone there to find me?  \n  \nWhen the pain just rips right through me  \nAnother's arms, another's arms  \nAnd that's just torture to me  \nAnother's arms, another's arms  \n  \nPull yourself into me  \nAnother's arms, another's arms  \nWhen the world means nothing to me  \nAnother's arms, another's arms  \n  \nGot to pull you close into me  \nAnother's arms, another's arms  \nPull yourself right through me  \nAnother's arms, another's arms  \n  \nLate night watching tv  \nWish that you were here beside me  \nWish that your arms were around me  \nYour body on my body\n\n",
 'I want to be bigger stronger drive a fa

In [26]:
# TODO: Compute a BOW
vectorizer = CountVectorizer(max_features=2000, stop_words='english')
BOW = vectorizer.fit_transform(lyrics).toarray()
print(BOW)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


Now that we have the BOW matrix, we would like to have a new dataframe having the BOW for each song, and as columns the corresponding words (just as we did in the lecture at the end).

So that at the end we would end up with a dataframe containing something like the following (120 raws for 120 songs, and as many columns as words):

| | ah | adventure | ... | yeah 
|---|---|---|---|---| 
| 0 | 0 | 1 | ... | 4 |
| 1 | 8 | 0 | ... | 2 |
|...|...|...|...|...|
| 119 | 5 | 0 | ... | 8 |

In [27]:
# TODO: Create a new dataframe containing the BOW outputs and the corresponding words as columns. And print it
tokens = vectorizer.get_feature_names()
BOW_matrix = pd.DataFrame(data=BOW, columns=tokens)
BOW_matrix

Unnamed: 0,10,2000,2gether,76543,aaaaaah,aaaaah,aaaah,achin,adventure,advice,...,x2,x7,ya,yeah,years,yellow,yes,yesterday,young,yuletide
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
116,0,0,0,0,0,0,0,0,0,0,...,0,0,0,11,0,0,0,0,0,0
117,0,0,1,0,0,0,0,0,0,0,...,0,0,0,3,0,0,0,0,0,0
118,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Well as you see we're still having some issue, we have some tokens that are not words, like '10' or '2000'.

To get rid of that, we could use directly regular expressions within the function. Another solution would be to make preprocessing before using the function *CountVectorizer*.

For the moment, we won't pay attention to this issue. But if you are curious and have time, you can find on google how to remove those words using the *CountVectorizer*.

Now we would like to see what are the most used words by Coldplay.

In [28]:
sum_bow = BOW_matrix.sum()
sum_bow.idxmax()

'oh'

So what is the most used word? Are you surprised?

Now make a sort in order to show the 10 most used.

In [36]:
# TODO: print the 10 most used word by Coldplay
sum_bow.sort_values(ascending=False).head(10)

oh      334
don     190
know    137
just    136
ll      132
come    126
yeah    111
love     95
ooh      95
want     86
dtype: int64

Here it is! You know the Coldplay lyrics more than the singers now!