# Text Analysis 02: Pre-Processing and Word Frequency


---
<img src="https://upload.wikimedia.org/wikipedia/commons/2/2a/ELI-LA-Word-Cloud-2-900x600.jpg" style="width: 500px; height: 275px;" />

### Professor Crystal Chang

This notebook will cover tools and strategies used in data pre-processing to prepare text data for analysis. We will then look at how to get basic word counts.

*Estimated Time: 45 minutes*

---

### Topics Covered
- Text Pre-processing
- Word Frequency


### Table of Contents


1 - [Tokens](#section 1)<br>

2 - [Counting Tokens](#section 2)<br>

3 - [Stop Words](#section 3)<br>

4 - [N-grams](#section 4)<br>




**Dependencies:**

In [None]:
import re
import pandas as pd

## Introduction

We've spent a lot of time in Python dealing with strings, and that's because strings can represent text data, and text data is everywhere. It is the primary form of communication between persons and persons, persons and computers, and computers and computers. 

This notebook will apply the methods you just learned to a fundamental component of text analysis: **pre-processing**, or getting the text into a form Python can understand and manipulate. We will then use our processed text to calculate ** word frequency**, the foundation of the Bag-of-Words model.

---

# 1. Tokens <a id='section 1'></a>

What is a token? In text analysis, we need something to count in order to quantify language. For NLP practitioners, we count tokens. A token is often simply a word. But more accurately, a token is the unit of analysis for an NLP project. This may be a word, but it may be a stem, or lemma too. 

## Tokenizing

Let's first read in an example text to work with. In this section, we'll use the text from President Obama's final speech at the United Nations General Assembly on September 28, 2015:

In [None]:
with open('data/obama_un_2015.txt') as f:
    obama_un = f.read()

print(obama_un)

We may be tempted to use our `str` knowledge and just call `.split()` to get our tokens:

In [None]:
tokens = obama_un.split()
tokens

That looks pretty good. What if we wanted to get some counts? Let's try something we know is in the document: the very first word.

In [None]:
tokens.count('Mr')

Wait, what? We can see two 'Mr's in the first sentence alone! Let's take another look at our tokens

In [None]:
tokens

We see that periods stayed with our token "Mr". The `.split()` method split on white-space only. It won't intuitively know what a word is.

Similarly, Python won't know the difference between "It" and "it":

In [None]:
tokens.count('It')

In [None]:
tokens.count('it')

A quick and dirty way to get some normalized text back would be to remove punctuation and lower-case the entire string:

In [None]:
from string import punctuation
punctuation

In [None]:
no_punc = ''.join([c for c in obama_un if c not in punctuation])
no_punc

In [None]:
no_punc_lower = no_punc.lower()
no_punc_lower

In [None]:
no_punc_lower_tokens = no_punc_lower.split()
no_punc_lower_tokens

In [None]:
no_punc_lower_tokens.count('mr')

In [None]:
no_punc_lower_tokens.count('It'), no_punc_lower_tokens.count('it')

### Challenge

Write a function that tokenizes your text. Your function should do three things:
- remove the punctuation
- lower-case the text
- split the text on white spaces

In [None]:
def tokenize(text):
    no_punc = ... #your code here
    no_punc_lower = ... #your code here
    no_punc_lower_tokens = ... #your code here
    return no_punc_lower_tokens

Test out your function on a snippet of text.

In [None]:
punctuation

In [None]:
snippet = obama_un[25195:]
print(snippet)


In [None]:
''.join([c for c in 'tomorrow’s' if c not in punctuation])

In [None]:
tokenize(snippet)

# 2. Counting Tokens <a id='section 2'></a>

The easiest way to count tokens is using the `Counter` object from `collections`. This will give you back a dictionary with the token counts. Take a look at the docstring:

In [None]:
from collections import Counter
Counter?

Now, try it out on `no_punc_lower_tokens`.

In [None]:
Counter(no_punc_lower_tokens)

Using the `.most_common()` method will sort tokens by how frequent they are.

In [None]:
Counter(no_punc_lower_tokens).most_common()

# 3. Stop words <a id='section 3'></a>

You've probably noticed that a lot of our most frequent words, bigrams, and trigrams aren't very interesting: they contain a lot of **stop words** like 'and', 'the', and 'of'. Frequently, we'll want to remove the stop words in order to get more meaningful results from our analysis.

Fortunately, the scikit-learn package has a list of English stop words we can use:

In [None]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

ENGLISH_STOP_WORDS

### List Comprehensions
When we're dealing with a list of items, we often want to check or alter each item. Instead of indexing each item individually, it's better to use a **list comprehension**. A list comprehension takes a list, goes through each item to change or remove it, and returns a new list.

A basic list comprehension is written in the following order:

`[<do_something(item)> for <item> in <sequence> if <condition>]`

For example, to double the numbers in a list:

In [None]:
numbers = [1, 2, 3, 4, 5]

[x * 2 for x in numbers]

Today, we're going to use list comprehensions to filter items. For example, we used one earlier to filter out punctuation:

In [None]:
characters = ['3', 'a', '.', '!', 'g']

[c for c in characters if c not in punctuation]

This list comprehension says "Keep each item `c` in `characters` exactly as it is unless it is in the list of  `punctuation`".

### Challenge

Write a list comprehension that looks at each token in `tokens` and takes the token out if it is in the list of `ENGLISH_STOP_WORDS`

In [None]:
tokens = tokenize(obama_un)

tokens_no_stops = ... #Your code here

Let's check if it worked:

In [None]:
Counter(tokens_no_stops).most_common()

# 4. N-Grams <a id='section 4'></a>

N-grams help us move beyond looking at one word, and allow us to look at co-occurence. It takes window snapshots of a list of words going one word at a time. It's easiest to understand from an example. First we'll define a function find_ngrams, don't worry about the code here!

In [None]:
def ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

Let's look at a bigram (window of 2):

In [None]:
list(ngrams('I like walking my dog'.split(), 2))

Let's look at bigrams for our tokens:

In [None]:
list(ngrams(no_punc_lower_tokens, 2))

We can use `Counter` for this too!:

In [None]:
Counter(list(ngrams(no_punc_lower_tokens, 2))).most_common()

### Challenge

Write the code to get a list of trigrams (windows of 3) for `no_punc_lower_tokens`. Then, use `Counter` to find the most common trigrams.

### Question
Should we remove stop words before looking at bi- and tri-grams? Why or why not?

What kinds of research could we do using word or N-gram frequency analysis?

**YOUR ANSWER**:

---

## Bibliography

- Tokens and Preprocessing section adapted from materials by Chris Hench. https://github.com/henchc/textxd-2017/blob/master/04-Tokens-and-Pre-Processing.ipynb

---
Notebook developed by: Keeley Takimoto

Data Science Modules: http://data.berkeley.edu/education/modules
