## LEGALST-190 Lab 3/6

---

In this lab, students will learn about dominant language models in natural language processing and the basics of how to implement it in Python. We'll be using the data you extracted from the last lab (un-debates-2001-clean.csv).


In [None]:
# dependencies
from datascience import *
import numpy as np
import pandas as pd

### Overview

Here we will discuss one widely used representation of text:
- <b>Bag-of-Words Encoding</b>: encodes text by the frequency of each word

This model was very popular in early text analysis, and continues to be used today. In fact, the models that have replaced it are still very difficult to actually interpret, giving the BoW approach a slight advantage if we want to understand why the model makes certain decisions. Once we have our BoW model we can analyze it in a high-dimensional vector space, which gives us more insights into the similarities and clustering of different texts.

In [None]:
## retrieve our data
data = Table.read_table(..., index_col=0)
data.show(5)

In [None]:
## let's store our text and our tokens into a list
text_list = ...
tokens_list = ...

## Bag-of-Words Encoding

The bag-of-words encoding is widely used and a standard representation for text in many of the popular text clustering algorithms.

__Key Things to Note:__

1. __Stop words are removed.__ Stop-words are words like 'is' and 'about' that in isolation contain very little information about the meaning of the sentence.
2. __Word order information is lost.__ 
3. __Capitalization and punctuation__ are typically removed.
4. __Sparse Encoding:__ is necessary to represent the bag-of-words efficiently. There are millions of possible words (including terminology, names, and misspellings) and so instantiating a 0 for every word that is not in each record would be incredibly inefficient.

Why is it called a __bag-of-words__?

__SOLUTION:__

### Implementing the Bag-of-words Model

### Review of Tokens

If you remember from the last lab, we created tokens and added it to our table. Normally at this point of the stage, you would create tokens for yourself to use, so let's introduce a new term called `Counter`.

The easiest way to count tokens is using the `Counter` object from `collections`. This will give you back a dictionary with the token counts:

In [None]:
from collections import Counter
# extract the first speech in tokens_list, split it by whitespace, then put it into a Counter
first_speech = ...
counter = Counter(first_speech) 
counter

The `most_common()` method can be called on a Counter to return the most common tokens and their counts. What are the most common tokens in the first speech? What do these common words tell you about the content or tone of the speech?

In [None]:
...

## Document-Term Matrix

We can use sklearn to construct a bag-of-words representation of text. Create an instance of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Construct the tokenizer.
cv = ...

After creating the CountVectorizer object, use `fit_transform` on it. `fit_transform` takes in the list of documents we want to represent (in this case, the list of tokenized text).

In [None]:
dtm = ...
dtm

What's this? A sparse matrix just means that many cells in the table don't have value. 

We can get a better look at what's going on by turning the sparse matrix into a data frame. First, get the list of words in our 'bag-of-words' by using `get_feature_names()` on your CountVectorizer.

In [None]:

# create labels for columns.
word_list = ...
word_list

You can then de-sparsify the sparse matrix by turning it into an array. Try using `toarray()` on it.

In [None]:
# de-sparsify by turning dtm into an array
desparse = ...
desparse

You now have everything you need to convert your sparse matrix to a DataFrame. Double-check the [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for a reminder of how to construct the frame.

In [None]:
# create a dataframe with words as columns and the sparse matrix data as the data
dtm_df = ...
dtm_df.head()

This is what we call a Document Term Matrix, a core concept in NLP and text analysis.

As you can see, there are columns for each word in the entire list. Each row is for each text. The values are the word count for that word in the corresponding text. Note that there are many 0s, hence the matrix is 'sparse'. 

Why are there so many zeros?

__SOLUTION__: 

To see the frequency of a word across documents, index that word's column.

In [None]:
# can easily find the frequencies for each of the given words
dtm_df['zone']

In [None]:
# what's the total number of times the word 'zone' pops up?
...

In [None]:
# how many words appear in the 100th document?
...

## Normalization

Let's see if we can take another step and try to make equal comparisons across each of the texts. We can normalize the values by dividing each word count by the total number of words in the text. We'll need to sum on axis=1, or summing the row, as each row is a text), as opposed to summing up the column.

Once we have the total number of words in the text, we can get a percentage of words that one particular word accounts for, thus applying this method to every other word across the matrix.

In [None]:
# see if you can fill this out on your own following the steps listed above.
row_sums = ... # sum up the desparse on axis=1
normed = ... # divide this over the total number of row_sums
dtm_df = ... # create a data frame using the word_list and the new data
dtm_df.head()

When would it be most important to normalize word counts?
1. When you have a lot of documents
2. When you have very few documents
3. When the documents are of many different lengths
4. When the documents are all around the same length

Why?

__SOLUTION__:

## Streamlining

Overall, this was a lot of work and if it is as common as we say it is in NLP, shouldn't someone have streamlined it before? In fact, we can simply instruct CountVectorizer not to include stopwords at all (so we could use it on our non-tokenized text), and another function, TfidfTransformer, normalizes easily.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

engl_stop_words = list(ENGLISH_STOP_WORDS)

# fill out this beginning part
# when you create your CountVectorizer, set the stop_words argument equal to engl_stop_words
cv = ...
dtm = ...

# this is what allows us to easily streamline
tt = TfidfTransformer(norm='l1',use_idf=False)
dtm_tf = tt.fit_transform(dtm)
dtm_tf

Fantastic! There's no need to directly answer this question, but think about how we could perhaps remove the numbers from the matrix in addition to the stop words.

---
## Bibliography

- Document Term Matrix, normalization markdown and code adapted from materials by Chris Hench: https://github.com/henchc/textxd-2017/blob/master/06-DTM.ipynb

---
Notebook developed by: Gibson Chu

Data Science Modules: http://data.berkeley.edu/education/modules
