## LEGALST-190 Lab 3/6

---

In this lab, students will learn about dominant language models in natural language processing and the basics of how to implement it in Python. We'll be using the data you extracted from the last lab (un-debates-2001-clean.csv).


In [3]:
# dependencies
from datascience import *
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("talk")
%matplotlib inline

### Overview

Here we will discuss one widely used representations of text
- <b>Bag-of-Words Encoding</b>: encodes text by the frequency of each word

This model was very popular in early text analysis, and continues to be used today. In fact, the models that have replaced it are still very difficult to actually interpret, giving the BoW approach a slight advantage if we want to understand why the model makes certain decisions. Once we have our BoW model we can analyze it in a high-dimensional vector space, which gives us more insights into the similarities and clustering of different texts.

In [4]:
## retrieve our data
data = Table.read_table(..., index_col=0)
data.show(5)

session,year,country,text,tokens
56,2001,COM,"﻿On behalf of the Comorian delegation, which I have the ...",﻿on behalf comorian deleg i honour lead behalf i offer s ...
56,2001,RWA,"﻿It is a great honour for me, on behalf of the Rwandan  ...",﻿it great honour behalf rwandan deleg join previous spea ...
56,2001,MMR,"﻿On behalf of the delegation of the Union of Myanmar, I ...",﻿on behalf deleg union myanmar i wish extend warmest con ...
56,2001,PHL,"﻿Let me begin by congratulating Your Excellency, Mr. Ha ...",﻿let begin congratul your excel mr han seungsoo elect pr ...
56,2001,MRT,"﻿I am delighted to be able to congratulate you, Sir, on ...",﻿i delight abl congratul sir behalf deleg islam republ m ...


In [11]:
## let's store our text and our tokens into a list
text_list = ...
tokens_list = ...

## Bag-of-Words Encoding

The bag-of-words encoding is widely used and a standard representation for text in many of the popular text clustering algorithms.

__Key Things to Note:__

1. __Stop words are removed.__ Stop-words are words like is and about that in isolation contain very little information about the meaning of the sentence. Here is a good list of stop-words in many languages.
2. __Word order information is lost.__ Nonetheless the vector still suggests that the sentence is about fun, machines, and learning. Thought there are many possible meanings learning machines have fun learning or learning about machines is fun learning ...
3. __Capitalization and punctuation__ are typically removed.
4. __Sparse Encoding:__ is necessary to represent the bag-of-words efficiently. There are millions of possible words (including terminology, names, and misspellings) and so instantiating a 0 for every word that is not in each record would be incredibly inefficient.

Why is it called a __bag-of-words__?

__SOLUTION:__ A bag is another term for a __multiset__: _an unordered collection which may contain multiple instances of each element._

### Implementing the Bag-of-words Model

We can use sklearn to construct a bag-of-words representation of text. Create an instance of CountVectorizer.

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

# Construct the tokenizer.
cv = ...

### Review of Tokens

If you remember from the last lab, we created tokens and added it to our table. Normally at this point of the stage, you would create tokens for yourself to use, so let's introduce a new term called `Counter`.

The easiest way to count tokens is using the `Counter` object from `collections`. This will give you back a dictionary with the token counts:

In [None]:
# no need to worry too much in detail about using counter in this lab
from collections import Counter
Counter(tokens_list) # run this to better understand where and why you would use counter

## Document-Term Matrix

After creating the CountVectorizer object, use `fit_transform` it on our list of documents.

In [15]:
dtm = ...
dtm

<189x11883 sparse matrix of type '<class 'numpy.int64'>'
	with 113584 stored elements in Compressed Sparse Row format>

What's this? A sparse matrix just means that some cells in the table don't have value. Why is this?

__SOLUTION__: Not every word is in each of the texts.

In [16]:
# de-sparsify by turning dtm into an array
desparse = ...

# create labels for columns.
word_list = cv.get_feature_names()

# create a new table with pandas data frame
dtm_df = ...
dtm_df

Unnamed: 0,000,10,100,106,11,117,1192,12,120,1208,...,zimbabwean,zimbabweans,zine,zionist,zonal,zone,zones,zuma,état,être
0,0,0,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
6,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,5,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,1,0,0,3,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9,0,1,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This is what we call a Document Term Matrix, a core concept in NLP and text analysis.

As you can see, there are columns for each word in the entire list. Each row is for each text. The values are the word count for that word in the corresponding text. Note that there are many 0s, indicating that word just doesn't show up in that text!

In [19]:
# can easily find the frequencies for each of the given words
dtm_df['zone']

0      0
1      0
2      0
3      0
4      0
5      2
6      0
7      0
8      1
9      0
10     0
11     0
12     0
13     0
14     0
15     1
16     0
17     0
18     0
19     0
20     3
21     0
22     0
23     0
24     0
25     0
26     0
27     0
28     0
29     1
      ..
159    0
160    0
161    0
162    0
163    0
164    1
165    0
166    0
167    0
168    0
169    0
170    0
171    0
172    0
173    0
174    0
175    0
176    0
177    0
178    1
179    0
180    0
181    2
182    0
183    0
184    0
185    0
186    0
187    0
188    0
Name: zone, Length: 189, dtype: int64

In [21]:
# what's the total number of times the word 'zone' pops up?
...

189

## Normalization

Let's see if we can take another step and try to make equal comparisons across each of the texts. We can normalize the values by dividing each word count by the total number of words in the text. We'll need to sum on axis=1, or summing the row, as each row is a text), as opposed to summing up the column.

Once we have the total number of words in the text, we can get a percentage of words that one particular word accounts for, thus applying this method to every other word across the matrix.

In [22]:
# see if you can fill this out on your own following the steps listed above.
row_sums = ... # sum up the desparse on axis=1
normed = ... # divide this over the total number of row_sums
dtm_df = ... # creae a data frame using the word_list and the new data
dtm_df

Unnamed: 0,000,10,100,106,11,117,1192,12,120,1208,...,zimbabwean,zimbabweans,zine,zionist,zonal,zone,zones,zuma,état,être
0,0.000000,0.000000,0.000000,0.0,0.001833,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
1,0.000000,0.000000,0.000000,0.0,0.001225,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
2,0.000000,0.001029,0.000000,0.0,0.000000,0.000000,0.000000,0.002058,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
3,0.001316,0.001316,0.000000,0.0,0.002632,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
4,0.000000,0.001838,0.000000,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
5,0.000000,0.000000,0.000000,0.0,0.001391,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.002782,0.000000,0.0,0.0,0.000000
6,0.000000,0.000000,0.000000,0.0,0.001087,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
7,0.000000,0.000000,0.000000,0.0,0.006321,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000
8,0.000000,0.000832,0.000000,0.0,0.002496,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000832,0.000000,0.0,0.0,0.000000
9,0.000000,0.000961,0.000000,0.0,0.000961,0.000000,0.000000,0.000000,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000


## Streamlining

Overall, this was a lot of work and if it is as common as we say it is in NLP, shouldn't someone have streamlined it before? In fact, we can simply instruct CountVectorizer not to include stopwords at all and another function, TfidfTransformer, normalizes easily.

In [29]:
from sklearn.feature_extraction.text import TfidfTransformer
# fill out this beginning part
cv = ...
dtm = ...

# this is what allows us to easily streamline
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)
dtm_tf

<189x11883 sparse matrix of type '<class 'numpy.float64'>'
	with 113584 stored elements in Compressed Sparse Row format>

Fantastic! There's no need to directly answer this question, but think about how we could perhaps remove the numbers from the matrix in addition to the stop words.