## LEGALST-190 Lab 3/6

---

In this lab, students will learn about dominant language models in natural language processing and the basics of how to implement it in Python. We'll be using the data you extracted from the last lab (un-debates-2001-clean.csv).


In [10]:
# dependencies
from datascience import *
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("talk")
%matplotlib inline

### Overview
Encoding text as a real-valued feature is especially challenging and many of the standard transformations are lossy. Moreover, all of the earlier transformations (e.g., one-hot encoding and Boolean representations) preserve the information in the feature. In contrast, most of the techniques for encoding text destroy information about the word order and in many cases key parts of the grammar.

Here we will discuss one widely used representations of text
- <b>Bag-of-Words Encoding</b>: encodes text by the frequency of each word

This model was very popular in early text analysis, and continues to be used today. In fact, the models that have replaced it are still very difficult to actually interpret, giving the BoW approach a slight advantage if we want to understand why the model makes certain decisions. Once we have our BoW model we can analyze it in a high-dimensional vector space, which gives us more insights into the similarities and clustering of different texts.

In [11]:
## retrieve our data
data = Table.read_table('data/un-debates-2001-clean.csv', index_col=0)
data.show(5)

session,year,country,text,tokens
56,2001,COM,"﻿On behalf of the Comorian delegation, which I have the ...",﻿on behalf comorian deleg i honour lead behalf i offer s ...
56,2001,RWA,"﻿It is a great honour for me, on behalf of the Rwandan  ...",﻿it great honour behalf rwandan deleg join previous spea ...
56,2001,MMR,"﻿On behalf of the delegation of the Union of Myanmar, I ...",﻿on behalf deleg union myanmar i wish extend warmest con ...
56,2001,PHL,"﻿Let me begin by congratulating Your Excellency, Mr. Ha ...",﻿let begin congratul your excel mr han seungsoo elect pr ...
56,2001,MRT,"﻿I am delighted to be able to congratulate you, Sir, on ...",﻿i delight abl congratul sir behalf deleg islam republ m ...


## Bag-of-Words Encoding

The bag-of-words encoding is widely used and a standard representation for text in many of the popular text clustering algorithms.

__Key Things to Note:__

1. __Stop words are removed.__ Stop-words are words like is and about that in isolation contain very little information about the meaning of the sentence. Here is a good list of stop-words in many languages.
2. __Word order information is lost.__ Nonetheless the vector still suggests that the sentence is about fun, machines, and learning. Thought there are many possible meanings learning machines have fun learning or learning about machines is fun learning ...
3. __Capitalization and punctuation__ are typically removed.
4. __Sparse Encoding:__ is necessary to represent the bag-of-words efficiently. There are millions of possible words (including terminology, names, and misspellings) and so instantiating a 0 for every word that is not in each record would be incredibly inefficient.

Why is it called a __bag-of-words__?

__SOLUTION:__ A bag is another term for a __multiset__: _an unordered collection which may contain multiple instances of each element._

### Implementing the Bag-of-words Model

We can use sklearn to construct a bag-of-words representation of text:

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# Construct the tokenizer with English stop words
bow = CountVectorizer(stop_words="english")