## LEGALST-190 Lab 3/6

---

In this lab, students will learn about dominant language models in natural language processing and the basics of how to implement it in Python. We'll be using the data you extracted from the last lab (un-debates-2001-clean.csv).


In [1]:
# dependencies
from datascience import *
import numpy as np
import pandas as pd


### Overview

Here we will discuss one widely used representation of text:
- <b>Bag-of-Words Encoding</b>: encodes text by the frequency of each word

This model was very popular in early text analysis, and continues to be used today. In fact, the models that have replaced it are still very difficult to actually interpret, giving the BoW approach a slight advantage if we want to understand why the model makes certain decisions. Once we have our BoW model we can analyze it in a high-dimensional vector space, which gives us more insights into the similarities and clustering of different texts.

In [2]:
## retrieve our data
data = Table.read_table('data/un-debates-2001-clean.csv', index_col=0)
data.show(5)

session,year,country,text,tokens
56,2001,COM,"﻿On behalf of the Comorian delegation, which I have the  ...",﻿on behalf comorian deleg i honour lead behalf i offer s ...
56,2001,RWA,"﻿It is a great honour for me, on behalf of the Rwandan d ...",﻿it great honour behalf rwandan deleg join previous spea ...
56,2001,MMR,"﻿On behalf of the delegation of the Union of Myanmar, I ...",﻿on behalf deleg union myanmar i wish extend warmest con ...
56,2001,PHL,"﻿Let me begin by congratulating Your Excellency, Mr. Han ...",﻿let begin congratul your excel mr han seungsoo elect pr ...
56,2001,MRT,"﻿I am delighted to be able to congratulate you, Sir, on  ...",﻿i delight abl congratul sir behalf deleg islam republ m ...


In [3]:
## let's store our text and our tokens into a list
text_list = data['text']
tokens_list = data['tokens']

## Bag-of-Words Encoding

The bag-of-words encoding is widely used and a standard representation for text in many of the popular text clustering algorithms.

__Key Things to Note:__

1. __Stop words are removed.__ Stop-words are words like is and about that in isolation contain very little information about the meaning of the sentence. 
2. __Word order information is lost.__ 
3. __Capitalization and punctuation__ are typically removed.
4. __Sparse Encoding:__ is necessary to represent the bag-of-words efficiently. There are millions of possible words (including terminology, names, and misspellings) and so instantiating a 0 for every word that is not in each record would be incredibly inefficient.

Why is it called a __bag-of-words__?

__SOLUTION:__ A bag is another term for a __multiset__: _an unordered collection which may contain multiple instances of each element._

### Implementing the Bag-of-words Model

### Review of Tokens

If you remember from the last lab, we created tokens and added it to our table. Normally at this point of the stage, you would create tokens for yourself to use, so let's introduce a new term called `Counter`.

The easiest way to count tokens is using the `Counter` object from `collections`. This will give you back a dictionary with the token counts:

In [16]:
from collections import Counter
first_speech = tokens_list[0].split()
counter = Counter(first_speech) 
counter

Counter({'11': 2,
         '1999': 1,
         '20': 1,
         '2000': 1,
         '2001': 2,
         '2010': 1,
         '21': 1,
         '22': 1,
         '30': 1,
         'a': 1,
         'abid': 1,
         'abov': 1,
         'accept': 1,
         'access': 1,
         'accid': 1,
         'accompani': 1,
         'accord': 1,
         'account': 1,
         'achiev': 1,
         'act': 3,
         'action': 2,
         'activ': 1,
         'addit': 2,
         'administ': 1,
         'admit': 1,
         'adopt': 1,
         'advoc': 1,
         'aeroplan': 1,
         'affect': 4,
         'afflict': 2,
         'africa': 1,
         'african': 2,
         'aid': 4,
         'all': 1,
         'allow': 3,
         'alongsid': 1,
         'alqud': 1,
         'alsharif': 1,
         'also': 5,
         'american': 2,
         'among': 3,
         'anarchi': 1,
         'anjouan': 1,
         'annan': 1,
         'announc': 1,
         'anoth': 1,
         'appeal': 1,
      

The `most_common()` method can be called on a Counter to return the most common tokens and their counts. What are the most common tokens in the first speech? What do these common words tell you about the content or tone of the speech?

In [17]:
counter.most_common()

[('nation', 20),
 ('countri', 19),
 ('comoro', 17),
 ('govern', 16),
 ('i', 13),
 ('world', 13),
 ('peopl', 12),
 ('the', 11),
 ('intern', 11),
 ('organ', 10),
 ('peac', 10),
 ('unit', 10),
 ('terror', 9),
 ('last', 8),
 ('in', 8),
 ('respect', 8),
 ('region', 8),
 ('respons', 7),
 ('order', 7),
 ('this', 7),
 ('islam', 7),
 ('session', 6),
 ('we', 6),
 ('entir', 6),
 ('communiti', 6),
 ('thus', 6),
 ('state', 6),
 ('republ', 6),
 ('must', 6),
 ('today', 6),
 ('’', 6),
 ('one', 6),
 ('problem', 6),
 ('diseas', 6),
 ('island', 6),
 ('also', 5),
 ('us', 5),
 ('feder', 5),
 ('way', 5),
 ('concern', 5),
 ('right', 5),
 ('everi', 5),
 ('situat', 5),
 ('human', 5),
 ('develop', 5),
 ('made', 5),
 ('lead', 4),
 ('express', 4),
 ('led', 4),
 ('great', 4),
 ('interest', 4),
 ('new', 4),
 ('mani', 4),
 ('make', 4),
 ('point', 4),
 ('framework', 4),
 ('effort', 4),
 ('held', 4),
 ('commit', 4),
 ('establish', 4),
 ('author', 4),
 ('part', 4),
 ('solut', 4),
 ('affect', 4),
 ('aid', 4),
 ('comoran

## Document-Term Matrix

We can use sklearn to construct a bag-of-words representation of text. Create an instance of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Construct the tokenizer.
cv = CountVectorizer()

After creating the CountVectorizer object, use `fit_transform` it on our list of documents. `fit_transform` takes in the list of documents we want to represent (in this case, the list of tokenized text).

In [7]:
dtm = cv.fit_transform(tokens_list)
dtm

<189x7874 sparse matrix of type '<class 'numpy.int64'>'
	with 109126 stored elements in Compressed Sparse Row format>

What's this? A sparse matrix just means that some cells in the table don't have value. 

We can get a better look at what's going on by turning the sparse matrix into a data frame. First, get the list of words in our 'bag-of-words' by using `get_feature_names()` on your CountVectorizer.

In [8]:
# create labels for columns. look up the documentation on how to create this word_list.
word_list = cv.get_feature_names()
word_list

['024',
 '03',
 '033',
 '04',
 '07',
 '071',
 '10',
 '100',
 '1000',
 '10000',
 '100000',
 '105',
 '106',
 '10817',
 '10year',
 '11',
 '1135',
 '117',
 '1192',
 '12',
 '120',
 '120000',
 '1244',
 '125',
 '127',
 '1278',
 '12month',
 '12step',
 '13',
 '1300',
 '130000',
 '1306',
 '1314',
 '1325',
 '1333',
 '134',
 '1343',
 '1345',
 '135',
 '1359',
 '1365',
 '1368',
 '1371',
 '1373',
 '1375',
 '1376',
 '1377',
 '1378',
 '14',
 '140',
 '14000',
 '1419',
 '142',
 '1440s',
 '145',
 '147',
 '1492',
 '15',
 '150',
 '15000',
 '150000',
 '1514',
 '1580',
 '15th',
 '15year',
 '16',
 '160',
 '164',
 '167',
 '17',
 '1700',
 '17000',
 '18',
 '1800',
 '181',
 '182',
 '1833',
 '187',
 '1884',
 '189',
 '1890',
 '19',
 '1907',
 '1917',
 '1920s',
 '1930s',
 '194',
 '1940s',
 '1942',
 '1945',
 '1946',
 '1947',
 '1948',
 '1949',
 '1950s',
 '1952',
 '1955',
 '1958',
 '19591960',
 '1960s',
 '1961',
 '1963',
 '1964',
 '1967',
 '1970',
 '1971',
 '1972',
 '1973',
 '1974',
 '1976',
 '1977',
 '1978',
 '1979',
 '

You can then de-sparsify the sparse matrix by turning it into an array. Try using `toarray()` on it.

In [9]:
# de-sparsify by turning dtm into an array
desparse = dtm.toarray()

You now have everything you need to convert your sparse matrix to a DataFrame. Double-check the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for a reminder of how to construct the frame.

In [10]:
# create a new table with pandas data frame
dtm_df = pd.DataFrame(columns=word_list, data=desparse)
dtm_df.head()

Unnamed: 0,024,03,033,04,07,071,10,100,1000,10000,...,zia,ziaur,zimbabw,zimbabwean,zine,zionist,zone,àvis,état,être
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


This is what we call a Document Term Matrix, a core concept in NLP and text analysis.

As you can see, there are columns for each word in the entire list. Each row is for each text. The values are the word count for that word in the corresponding text. Note that there are many 0s, hence the matrix is 'sparse'. 

Why are there so many zeros?

__SOLUTION__:

To see the frequency of a word across documents, index that word's column.

In [11]:
# can easily find the frequencies for each of the given words
dtm_df['zone']

0      0
1      0
2      0
3      0
4      0
5      2
6      0
7      0
8      1
9      0
10     0
11     0
12     0
13     0
14     0
15     2
16     0
17     0
18     0
19     0
20     3
21     0
22     0
23     0
24     0
25     1
26     0
27     0
28     0
29     2
      ..
159    0
160    0
161    0
162    0
163    0
164    1
165    0
166    0
167    0
168    0
169    0
170    0
171    0
172    0
173    0
174    0
175    0
176    0
177    0
178    1
179    0
180    0
181    2
182    1
183    0
184    0
185    0
186    0
187    0
188    0
Name: zone, dtype: int64

In [12]:
# what's the total number of times the word 'zone' pops up?
sum(dtm_df['zone'])

41

In [13]:
# how many words appear in the 100th document?
np.sum(dtm_df.loc[100])

1553

## Normalization

Let's see if we can take another step and try to make equal comparisons across each of the texts. We can normalize the values by dividing each word count by the total number of words in the text. We'll need to sum on axis=1, or summing the row, as each row is a text), as opposed to summing up the column.

Once we have the total number of words in the text, we can get a percentage of words that one particular word accounts for, thus applying this method to every other word across the matrix.

In [14]:
# see if you can fill this out on your own following the steps listed above.
row_sums = np.sum(desparse, axis=1)
normed = desparse/row_sums[:,None]
dtm_df = pd.DataFrame(columns=word_list, data=normed)
dtm_df.head()

Unnamed: 0,024,03,033,04,07,071,10,100,1000,10000,...,zia,ziaur,zimbabw,zimbabwean,zine,zionist,zone,àvis,état,être
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.000924,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.001126,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.001653,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


When would it be most important to normalize word counts?
1. When you have a lot of documents
2. When you have very few documents
3. When the documents are of many different lengths
4. When the documents are all around the same length

Why?

__SOLUTION__: 3. A document with 10,000 words may use a word the same number of times as a document with 100 words, but the frequencies will be very different

## Streamlining

Overall, this was a lot of work and if it is as common as we say it is in NLP, shouldn't someone have streamlined it before? In fact, we can simply instruct CountVectorizer not to include stopwords at all and another function, TfidfTransformer, normalizes easily.

In [15]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

engl_stop_words = list(ENGLISH_STOP_WORDS)

# fill out this beginning part
cv = CountVectorizer(stop_words=engl_stop_words)
dtm = cv.fit_transform(text_list)

# this is what allows us to easily streamline
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)
dtm_tf

<189x11883 sparse matrix of type '<class 'numpy.float64'>'
	with 113584 stored elements in Compressed Sparse Row format>

Fantastic! There's no need to directly answer this question, but think about how we could perhaps remove the numbers from the matrix in addition to the stop words.

---
## Bibliography

- Document Term Matrix, normalization markdown and code adapted from materials by Chris Hench: https://github.com/henchc/textxd-2017/blob/master/06-DTM.ipynb

---
Notebook developed by: Gibson Chu

Data Science Modules: http://data.berkeley.edu/education/modules