<a href="https://colab.research.google.com/github/acastellanos-ie/NLP-MBD-EN-PT/blob/main/tagging_parsing_practice/bag_of_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Google Colab Configuration

**Execute this steps to configure the Google Colab environment in order to execute this notebook. It is not required if you are executing it locally and you have properly configured your local environment according to what explained in the Github Repository.**

The first step is to clone the repository to have access to all the data and files

In [1]:
repository_name = "NLP-MBD-EN-PT"
repository_url = 'https://github.com/acastellanos-ie/' + repository_name

In [2]:
! git clone $repository_url

Cloning into 'NLP-MBD-EN-PT'...
remote: Enumerating objects: 4452, done.[K
remote: Counting objects: 100% (112/112), done.[K
remote: Compressing objects: 100% (108/108), done.[K
remote: Total 4452 (delta 71), reused 7 (delta 3), pack-reused 4340[K
Receiving objects: 100% (4452/4452), 14.38 MiB | 20.68 MiB/s, done.
Resolving deltas: 100% (172/172), done.


Install the requirements

In [None]:
! pip install -Uqqr $repository_name/requirements.txt

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone


Now you have everything you need to execute the code in Colab

# Bag-of-words

In [None]:
import nltk
nltk.download('shakespeare')
nltk.download('stopwords')

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
import numpy as np

[nltk_data] Downloading package shakespeare to /root/nltk_data...
[nltk_data]   Unzipping corpora/shakespeare.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


The `nltk` library includes several corpus for experimentation. In this markdown we are going to make use of the corpus including the set of Shakespeare's plays.

In the following cell, I will load the corpus and create a dataframe with the name of the book and the textual content.

In [None]:
shakespeare_df = pd.DataFrame(columns=["book", "words"])
for ii, book in enumerate(nltk.corpus.shakespeare.fileids()):
    shakespeare_df.loc[ii] = (book, " ".join(nltk.corpus.shakespeare.words(book)))
print(shakespeare_df)

           book                                              words
0   a_and_c.xml  The Tragedy of Antony and Cleopatra Dramatis P...
1     dream.xml  A Midsummer Night ' s Dream Dramatis Personae ...
2    hamlet.xml  The Tragedy of Hamlet , Prince of Denmark Dram...
3  j_caesar.xml  The Tragedy of Julius Caesar Dramatis Personae...
4   macbeth.xml  The Tragedy of Macbeth Dramatis Personae DUNCA...
5  merchant.xml  The Merchant of Venice Dramatis Personae The D...
6   othello.xml  The Tragedy of Othello , the Moor of Venice Dr...
7   r_and_j.xml  The Tragedy of Romeo and Juliet Text placed in...


While this representation can be useful for humans, it is of no use if you want to use these data for an NLP system.

As we discussed in class, we need to create the document-term matrix which will be the input for any NLP system we need to create on top of it. In the document term matrix we have a row for each one of the different documents (the Shakespeare's plays) and a column for each one of the words in the dataset. At each cell, you will find the weight of the word in the document (for example, how many times does the word appear in the document).

In class we presented several weighting approaches, let's see how we can create them.

Let's start with the simplest one: The Binary weighting. Binary weighting only defines if a word appears (1) or does not appear (0) in a document

In [None]:
binary_weighting = CountVectorizer(binary=True)
binary_shakespeare = binary_weighting.fit_transform(shakespeare_df.words)
binary_dt_matrix = pd.DataFrame(binary_shakespeare.A, columns=binary_weighting.get_feature_names())
print(binary_dt_matrix)

   1992  1996  1998  1999  abandon  ...  youthful  youths  zeal  zone  zounds
0     0     0     0     0        0  ...         0       0     0     0       0
1     0     0     0     0        0  ...         0       0     0     0       0
2     0     0     0     0        0  ...         0       0     0     1       0
3     0     0     0     0        0  ...         1       1     0     0       0
4     0     0     0     0        0  ...         0       1     0     0       0
5     0     0     0     0        0  ...         1       0     1     0       0
6     0     0     0     0        1  ...         0       0     0     0       1
7     1     1     1     1        0  ...         1       0     0     0       1

[8 rows x 11316 columns]


Let's inspect the most and least important terms related to the document 6 (Othello)

In [None]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(binary_dt_matrix.iloc[:, np.argsort(binary_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(binary_dt_matrix.iloc[:, np.argsort(binary_dt_matrix.loc[document])[::-1]].iloc[document][-25:])



25 most important terms for document othello.xml
zounds         1
practiser      1
potent         1
potential      1
determined     1
determine      1
potting        1
pottle         1
pour           1
determinate    1
poverty        1
destruction    1
power          1
powerful       1
powers         1
pox            1
practise       1
potations      1
devesting      1
device         1
devotion       1
portents       1
dew            1
devout         1
position       1
Name: 6, dtype: int64
25 least important terms for document othello.xml
outrage        0
origin         0
orient         0
organs         0
organ          0
ore            0
ordnance       0
osier          0
osric          0
ossa           0
ostent         0
ostentation    0
ostents        0
ought          0
ounce          0
ourself        0
ousel          0
outbrave       0
outbreak       0
outcries       0
outcry         0
outface        0
outlawry       0
outlives       0
1992           0
Name: 6, dtype: int64


As you can see, the representation is not very useful as it is. By only telling us if a word appears or not in a document is not giving us a lot of information. **Can you think on a situation where this binary weighting can be sufficient?**

The next thing to know will be whether the word appears only once or several times.

In [None]:
tf_weighting = CountVectorizer()
tf_shakespeare = tf_weighting.fit_transform(shakespeare_df.words)
tf_dt_matrix = pd.DataFrame(tf_shakespeare.A, columns=tf_weighting.get_feature_names())
print(tf_dt_matrix)

   1992  1996  1998  1999  abandon  ...  youthful  youths  zeal  zone  zounds
0     0     0     0     0        0  ...         0       0     0     0       0
1     0     0     0     0        0  ...         0       0     0     0       0
2     0     0     0     0        0  ...         0       0     0     1       0
3     0     0     0     0        0  ...         1       1     0     0       0
4     0     0     0     0        0  ...         0       1     0     0       0
5     0     0     0     0        0  ...         1       0     1     0       0
6     0     0     0     0        1  ...         0       0     0     0       3
7     1     1     1     1        0  ...         3       0     0     0       2

[8 rows x 11316 columns]


Ok, now we have the words weighted according to how many times they appear in the document.

Let's check now the most and least important words in Othello

In [None]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_dt_matrix.iloc[:, np.argsort(tf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_dt_matrix.iloc[:, np.argsort(tf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

25 most important terms for document othello.xml
and          794
the          761
to           629
you          486
of           475
my           416
that         395
iago         360
in           341
othello      336
not          318
it           317
is           309
me           281
cassio       254
he           246
for          240
desdemona    230
be           226
with         221
but          221
this         220
do           219
her          215
have         207
Name: 6, dtype: int64
25 least important terms for document othello.xml
outrage        0
origin         0
orient         0
organs         0
organ          0
ore            0
ordnance       0
osier          0
osric          0
ossa           0
ostent         0
ostentation    0
ostents        0
ought          0
ounce          0
ourself        0
ousel          0
outbrave       0
outbreak       0
outcries       0
outcry         0
outface        0
outlawry       0
outlives       0
1992           0
Name: 6, dtype: int64


**What problem do you see with the most important words? Are they really representative?**



Let's check now how to create the TF-IDF weighting to see if we can improve this representation

In [None]:
tf_idf_weighting = TfidfVectorizer()
tf_idf_shakespeare = tf_idf_weighting.fit_transform(shakespeare_df.words)
tf_idf_dt_matrix = pd.DataFrame(tf_idf_shakespeare.A, columns=tf_idf_weighting.get_feature_names())
print(tf_idf_dt_matrix)

       1992      1996      1998  ...      zeal      zone    zounds
0  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
1  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
2  0.000000  0.000000  0.000000  ...  0.000000  0.000869  0.000000
3  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
4  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
5  0.000000  0.000000  0.000000  ...  0.001329  0.000000  0.000000
6  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.002445
7  0.001163  0.001163  0.001163  ...  0.000000  0.000000  0.001950

[8 rows x 11316 columns]


In [None]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_idf_dt_matrix.iloc[:, np.argsort(tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(tf_idf_dt_matrix.iloc[:, np.argsort(tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

25 most important terms for document othello.xml
iago         0.350125
othello      0.326783
and          0.308385
the          0.295568
cassio       0.247033
to           0.244300
desdemona    0.223691
you          0.188760
of           0.184487
my           0.161572
that         0.153416
emilia       0.134215
in           0.132442
not          0.123509
it           0.123121
is           0.120014
me           0.109139
roderigo     0.101147
he           0.095545
for          0.093215
be           0.087777
but          0.085835
with         0.085835
this         0.085447
do           0.085058
Name: 6, dtype: float64
25 least important terms for document othello.xml
outrage        0.0
origin         0.0
orient         0.0
organs         0.0
organ          0.0
ore            0.0
ordnance       0.0
osier          0.0
osric          0.0
ossa           0.0
ostent         0.0
ostentation    0.0
ostents        0.0
ought          0.0
ounce          0.0
ourself        0.0
ousel          0.0
outb

**What do you see now in the representation? Have we solved all the problems?**

# StopWords

In the previous section we have experimenting some problems related to stopwords, such as `and` or `of`. These words do not carry any meaning and are unlikely to provide any advantage for any subsequent NLP task and, therefore, we are safe to remove them.

Let's see how to do it via NLTK.

Since stopwords are language-dependant, NLTK provides a list for several languages.

In [None]:
from nltk.corpus import stopwords
print("Languages for which NLTK provides an stopword list:", ", ".join(stopwords.fileids()))

Languages for which NLTK provides an stopword list: arabic, azerbaijani, danish, dutch, english, finnish, french, german, greek, hungarian, indonesian, italian, kazakh, nepali, norwegian, portuguese, romanian, russian, slovene, spanish, swedish, tajik, turkish


We are just interested in the English stopword list

In [None]:
print("Example of 25 English stopwords:", ", ".join(stopwords.words("english")[:25]))

Example of 25 English stopwords: i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers


We can use this list to remove these words from our representation and create the document term matrix without them. Let's check.

In [None]:
sw_free_tf_idf_weighting = TfidfVectorizer(stop_words='english')
sw_free_tf_idf_shakespeare = sw_free_tf_idf_weighting.fit_transform(shakespeare_df.words)
sw_free_tf_idf_dt_matrix = pd.DataFrame(sw_free_tf_idf_shakespeare.A, columns=sw_free_tf_idf_weighting.get_feature_names())
print(sw_free_tf_idf_dt_matrix)

       1992      1996      1998  ...      zeal      zone    zounds
0  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
1  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
2  0.000000  0.000000  0.000000  ...  0.000000  0.001609  0.000000
3  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
4  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
5  0.000000  0.000000  0.000000  ...  0.002945  0.000000  0.000000
6  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.003787
7  0.001902  0.001902  0.001902  ...  0.000000  0.000000  0.003189

[8 rows x 11048 columns]


In [None]:
document = 6
print("25 most important terms for document", shakespeare_df.iloc[document]['book'])
print(sw_free_tf_idf_dt_matrix.iloc[:, np.argsort(sw_free_tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][:25])

print("25 least important terms for document", shakespeare_df.iloc[document]['book'])
print(sw_free_tf_idf_dt_matrix.iloc[:, np.argsort(sw_free_tf_idf_dt_matrix.loc[document])[::-1]].iloc[document][-25:])

25 most important terms for document othello.xml
iago            0.542255
othello         0.506104
cassio          0.382591
desdemona       0.346441
emilia          0.207864
roderigo        0.156651
thou            0.086619
brabantio       0.070794
lodovico        0.066276
moor            0.064270
venice          0.059331
shall           0.058348
good            0.055340
montano         0.054225
tis             0.051130
come            0.050528
let             0.049927
lord            0.048723
thy             0.047520
love            0.046919
ll              0.045716
handkerchief    0.045188
thee            0.045114
know            0.043310
bianca          0.042175
Name: 6, dtype: float64
25 least important terms for document othello.xml
overhear        0.0
outbreak        0.0
outbrave        0.0
ousel           0.0
ourself         0.0
ounce           0.0
ought           0.0
outrun          0.0
outside         0.0
outstare        0.0
outstretched    0.0
outstrike       0.0
outswear    

It's much better now, isn't it?

Try to play with the previous code, change the document to see how the different weightings affect their representation or to use a different corpus from the ones included in NLTK