# Text Analysis 02: Pre-Processing and Word Frequency


---
<img src="https://upload.wikimedia.org/wikipedia/commons/2/2a/ELI-LA-Word-Cloud-2-900x600.jpg" style="width: 500px; height: 275px;" />

### Professor Crystal Chang

This notebook will cover tools and strategies used in data pre-processing to prepare text data for analysis. We will then look at how to get basic word counts.

*Estimated Time: 45 minutes*

---

### Topics Covered
- Text Pre-processing
- Word Frequency


### Table of Contents


1 - [Tokens](#section 1)<br>

2 - [Counting Tokens](#section 2)<br>

3 - [Stop Words](#section 3)<br>

4 - [N-grams](#section 4)<br>




**Dependencies:**

In [1]:
import re
import pandas as pd

## Introduction

We've spent a lot of time in Python dealing with strings, and that's because strings can represent text data, and text data is everywhere. It is the primary form of communication between persons and persons, persons and computers, and computers and computers. 

This notebook will apply the methods you just learned to a fundamental component of text analysis: **pre-processing**, or getting the text into a form Python can understand and manipulate. We will then use our processed text to calculate ** word frequency**, the foundation of the Bag-of-Words model.

---

# 1. Tokens <a id='section 1'></a>

What is a token? In text analysis, we need something to count in order to quantify language. For NLP practitioners, we count tokens. A token is often simply a word. But more accurately, a token is the unit of analysis for an NLP project. This may be a word, but it may be a stem, or lemma too. 

## Tokenizing

Let's first read in an example text to work with. In this section, we'll use the text from President Obama's final speech at the United Nations General Assembly on September 28, 2015:

In [1]:
with open('data/obama_un_2015.txt') as f:
    obama_un = f.read()

print(obama_un)

Mr. President, Mr. Secretary General, fellow delegates, ladies and gentlemen: we come together at a crossroads between war and peace; between disorder and integration; between fear and hope. 
Around the globe, there are signposts of progress. The shadow of World War that existed at the founding of this institution has been lifted; the prospect of war between major powers reduced. The ranks of member states has more than tripled, and more people live under governments they elected. Hundreds of millions of human beings have been freed from the prison of poverty, with the proportion of those living in extreme poverty cut in half. And the world economy continues to strengthen after the worst financial crisis of our lives. 
Today, whether you live in downtown New York or in my grandmother’s village more than two hundred miles from Nairobi, you can hold in your hand more information than the world’s greatest libraries. Together, we have learned how to cure disease, and harness the power of t

We may be tempted to use our `str` knowledge and just call `.split()` to get our tokens:

In [2]:
tokens = obama_un.split()
tokens

['Mr.',
 'President,',
 'Mr.',
 'Secretary',
 'General,',
 'fellow',
 'delegates,',
 'ladies',
 'and',
 'gentlemen:',
 'we',
 'come',
 'together',
 'at',
 'a',
 'crossroads',
 'between',
 'war',
 'and',
 'peace;',
 'between',
 'disorder',
 'and',
 'integration;',
 'between',
 'fear',
 'and',
 'hope.',
 'Around',
 'the',
 'globe,',
 'there',
 'are',
 'signposts',
 'of',
 'progress.',
 'The',
 'shadow',
 'of',
 'World',
 'War',
 'that',
 'existed',
 'at',
 'the',
 'founding',
 'of',
 'this',
 'institution',
 'has',
 'been',
 'lifted;',
 'the',
 'prospect',
 'of',
 'war',
 'between',
 'major',
 'powers',
 'reduced.',
 'The',
 'ranks',
 'of',
 'member',
 'states',
 'has',
 'more',
 'than',
 'tripled,',
 'and',
 'more',
 'people',
 'live',
 'under',
 'governments',
 'they',
 'elected.',
 'Hundreds',
 'of',
 'millions',
 'of',
 'human',
 'beings',
 'have',
 'been',
 'freed',
 'from',
 'the',
 'prison',
 'of',
 'poverty,',
 'with',
 'the',
 'proportion',
 'of',
 'those',
 'living',
 'in',
 'e

That looks pretty good. What if we wanted to get some counts? Let's try something we know is in the document: the very first word.

In [3]:
tokens.count('Mr')

0

Wait, what? We can see two 'Mr's in the first sentence alone! Let's take another look at our tokens

In [4]:
tokens

['Mr.',
 'President,',
 'Mr.',
 'Secretary',
 'General,',
 'fellow',
 'delegates,',
 'ladies',
 'and',
 'gentlemen:',
 'we',
 'come',
 'together',
 'at',
 'a',
 'crossroads',
 'between',
 'war',
 'and',
 'peace;',
 'between',
 'disorder',
 'and',
 'integration;',
 'between',
 'fear',
 'and',
 'hope.',
 'Around',
 'the',
 'globe,',
 'there',
 'are',
 'signposts',
 'of',
 'progress.',
 'The',
 'shadow',
 'of',
 'World',
 'War',
 'that',
 'existed',
 'at',
 'the',
 'founding',
 'of',
 'this',
 'institution',
 'has',
 'been',
 'lifted;',
 'the',
 'prospect',
 'of',
 'war',
 'between',
 'major',
 'powers',
 'reduced.',
 'The',
 'ranks',
 'of',
 'member',
 'states',
 'has',
 'more',
 'than',
 'tripled,',
 'and',
 'more',
 'people',
 'live',
 'under',
 'governments',
 'they',
 'elected.',
 'Hundreds',
 'of',
 'millions',
 'of',
 'human',
 'beings',
 'have',
 'been',
 'freed',
 'from',
 'the',
 'prison',
 'of',
 'poverty,',
 'with',
 'the',
 'proportion',
 'of',
 'those',
 'living',
 'in',
 'e

We see that periods stayed with our token "Mr". The `.split()` method split on white-space only. It won't intuitively know what a word is.

Similarly, Python won't know the difference between "It" and "it":

In [5]:
tokens.count('It')

4

In [6]:
tokens.count('it')

22

A quick and dirty way to get some normalized text back would be to remove punctuation and lower-case the entire string:

In [7]:
from string import punctuation
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [8]:
no_punc = ''.join([c for c in obama_un if c not in punctuation])
no_punc

'Mr President Mr Secretary General fellow delegates ladies and gentlemen we come together at a crossroads between war and peace between disorder and integration between fear and hope \nAround the globe there are signposts of progress The shadow of World War that existed at the founding of this institution has been lifted the prospect of war between major powers reduced The ranks of member states has more than tripled and more people live under governments they elected Hundreds of millions of human beings have been freed from the prison of poverty with the proportion of those living in extreme poverty cut in half And the world economy continues to strengthen after the worst financial crisis of our lives \nToday whether you live in downtown New York or in my grandmother’s village more than two hundred miles from Nairobi you can hold in your hand more information than the world’s greatest libraries Together we have learned how to cure disease and harness the power of the wind and sun The 

In [9]:
no_punc_lower = no_punc.lower()
no_punc_lower

'mr president mr secretary general fellow delegates ladies and gentlemen we come together at a crossroads between war and peace between disorder and integration between fear and hope \naround the globe there are signposts of progress the shadow of world war that existed at the founding of this institution has been lifted the prospect of war between major powers reduced the ranks of member states has more than tripled and more people live under governments they elected hundreds of millions of human beings have been freed from the prison of poverty with the proportion of those living in extreme poverty cut in half and the world economy continues to strengthen after the worst financial crisis of our lives \ntoday whether you live in downtown new york or in my grandmother’s village more than two hundred miles from nairobi you can hold in your hand more information than the world’s greatest libraries together we have learned how to cure disease and harness the power of the wind and sun the 

In [10]:
no_punc_lower_tokens = no_punc_lower.split()
no_punc_lower_tokens

['mr',
 'president',
 'mr',
 'secretary',
 'general',
 'fellow',
 'delegates',
 'ladies',
 'and',
 'gentlemen',
 'we',
 'come',
 'together',
 'at',
 'a',
 'crossroads',
 'between',
 'war',
 'and',
 'peace',
 'between',
 'disorder',
 'and',
 'integration',
 'between',
 'fear',
 'and',
 'hope',
 'around',
 'the',
 'globe',
 'there',
 'are',
 'signposts',
 'of',
 'progress',
 'the',
 'shadow',
 'of',
 'world',
 'war',
 'that',
 'existed',
 'at',
 'the',
 'founding',
 'of',
 'this',
 'institution',
 'has',
 'been',
 'lifted',
 'the',
 'prospect',
 'of',
 'war',
 'between',
 'major',
 'powers',
 'reduced',
 'the',
 'ranks',
 'of',
 'member',
 'states',
 'has',
 'more',
 'than',
 'tripled',
 'and',
 'more',
 'people',
 'live',
 'under',
 'governments',
 'they',
 'elected',
 'hundreds',
 'of',
 'millions',
 'of',
 'human',
 'beings',
 'have',
 'been',
 'freed',
 'from',
 'the',
 'prison',
 'of',
 'poverty',
 'with',
 'the',
 'proportion',
 'of',
 'those',
 'living',
 'in',
 'extreme',
 'pover

In [11]:
no_punc_lower_tokens.count('mr')

2

In [12]:
no_punc_lower_tokens.count('It'), no_punc_lower_tokens.count('it')

(0, 28)

### Challenge

Write a function that tokenizes your text. Your function should do three things:
- remove the punctuation
- lower-case the text
- split the text on white spaces

In [17]:
def tokenize(text):
    no_punc = ''.join([c for c in text if c not in punctuation]) #your code here
    no_punc_lower = no_punc.lower() #your code here
    no_punc_lower_tokens = no_punc_lower.split() #your code here
    return no_punc_lower_tokens

Test out your function on a snippet of text.

In [16]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [18]:
snippet = obama_un[25195:]
print(snippet)


Join us in this common mission, for today’s children and tomorrow’s. 



In [19]:
''.join([c for c in 'tomorrow’s' if c not in punctuation])

'tomorrow’s'

In [20]:
tokenize(snippet)

['join',
 'us',
 'in',
 'this',
 'common',
 'mission',
 'for',
 'today’s',
 'children',
 'and',
 'tomorrow’s']

# 2. Counting Tokens <a id='section 2'></a>

The easiest way to count tokens is using the `Counter` object from `collections`. This will give you back a dictionary with the token counts. Take a look at the docstring:

In [21]:
from collections import Counter
Counter?

Now, try it out on `no_punc_lower_tokens`.

In [22]:
Counter(no_punc_lower_tokens)

Counter({'mr': 2,
         'president': 5,
         'secretary': 1,
         'general': 1,
         'fellow': 2,
         'delegates': 2,
         'ladies': 1,
         'and': 151,
         'gentlemen': 1,
         'we': 91,
         'come': 11,
         'together': 12,
         'at': 17,
         'a': 90,
         'crossroads': 2,
         'between': 7,
         'war': 18,
         'peace': 13,
         'disorder': 1,
         'integration': 1,
         'fear': 2,
         'hope': 2,
         'around': 2,
         'the': 218,
         'globe': 3,
         'there': 10,
         'are': 25,
         'signposts': 1,
         'of': 146,
         'progress': 5,
         'shadow': 1,
         'world': 31,
         'that': 71,
         'existed': 1,
         'founding': 2,
         'this': 49,
         'institution': 4,
         'has': 19,
         'been': 11,
         'lifted': 1,
         'prospect': 1,
         'major': 3,
         'powers': 2,
         'reduced': 1,
         'ranks': 2,
 

Using the `.most_common()` method will sort tokens by how frequent they are.

In [23]:
Counter(no_punc_lower_tokens).most_common()

[('the', 218),
 ('and', 151),
 ('to', 151),
 ('of', 146),
 ('we', 91),
 ('a', 90),
 ('in', 78),
 ('that', 71),
 ('is', 58),
 ('–', 57),
 ('this', 49),
 ('our', 45),
 ('will', 43),
 ('be', 40),
 ('for', 40),
 ('have', 33),
 ('world', 31),
 ('as', 30),
 ('people', 28),
 ('it', 28),
 ('or', 26),
 ('not', 26),
 ('are', 25),
 ('can', 24),
 ('their', 24),
 ('with', 23),
 ('by', 23),
 ('who', 23),
 ('america', 21),
 ('but', 20),
 ('has', 19),
 ('war', 18),
 ('from', 18),
 ('you', 18),
 ('at', 17),
 ('they', 16),
 ('on', 15),
 ('no', 15),
 ('more', 14),
 ('so', 14),
 ('peace', 13),
 ('those', 13),
 ('us', 13),
 ('an', 13),
 ('nations', 13),
 ('do', 13),
 ('must', 13),
 ('all', 13),
 ('where', 13),
 ('together', 12),
 ('new', 12),
 ('come', 11),
 ('been', 11),
 ('young', 11),
 ('when', 11),
 ('there', 10),
 ('i', 10),
 ('because', 10),
 ('states', 9),
 ('power', 9),
 ('problems', 9),
 ('into', 9),
 ('see', 9),
 ('conflict', 9),
 ('if', 9),
 ('muslim', 9),
 ('today', 8),
 ('united', 8),
 ('time'

# 3. Stop words <a id='section 3'></a>

You've probably noticed that a lot of our most frequent words, bigrams, and trigrams aren't very interesting: they contain a lot of **stop words** like 'and', 'the', and 'of'. Frequently, we'll want to remove the stop words in order to get more meaningful results from our analysis.

Fortunately, the scikit-learn package has a list of English stop words we can use:

In [24]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

### List Comprehensions
When we're dealing with a list of items, we often want to check or alter each item. Instead of indexing each item individually, it's better to use a **list comprehension**. A list comprehension takes a list, goes through each item to change or remove it, and returns a new list.

A basic list comprehension is written in the following order:

`[<do_something(item)> for <item> in <sequence> if <condition>]`

For example, to double the numbers in a list:

In [25]:
numbers = [1, 2, 3, 4, 5]

[x * 2 for x in numbers]

[2, 4, 6, 8, 10]

Today, we're going to use list comprehensions to filter items. For example, we used one earlier to filter out punctuation:

In [26]:
characters = ['3', 'a', '.', '!', 'g']

[c for c in characters if c not in punctuation]

['3', 'a', 'g']

This list comprehension says "Keep each item `c` in `characters` exactly as it is unless it is in the list of  `punctuation`".

### Challenge

Write a list comprehension that looks at each token in `tokens` and takes the token out if it is in the list of `ENGLISH_STOP_WORDS`

In [29]:
tokens = tokenize(obama_un)

tokens_no_stops = [c for c in tokens if c not in ENGLISH_STOP_WORDS]

Let's check if it worked:

In [30]:
Counter(tokens_no_stops).most_common()

[('–', 57),
 ('world', 31),
 ('people', 28),
 ('america', 21),
 ('war', 18),
 ('peace', 13),
 ('nations', 13),
 ('new', 12),
 ('come', 11),
 ('young', 11),
 ('states', 9),
 ('power', 9),
 ('problems', 9),
 ('conflict', 9),
 ('muslim', 9),
 ('today', 8),
 ('united', 8),
 ('time', 8),
 ('children', 8),
 ('human', 7),
 ('small', 7),
 ('terrorists', 7),
 ('look', 7),
 ('violent', 7),
 ('effort', 7),
 ('address', 7),
 ('economy', 6),
 ('lives', 6),
 ('forces', 6),
 ('international', 6),
 ('extremism', 6),
 ('reject', 6),
 ('ukraine', 6),
 ('russia', 6),
 ('support', 6),
 ('that’s', 6),
 ('security', 6),
 ('isil', 6),
 ('president', 5),
 ('progress', 5),
 ('live', 5),
 ('global', 5),
 ('borders', 5),
 ('syria', 5),
 ('iraq', 5),
 ('make', 5),
 ('future', 5),
 ('like', 5),
 ('years', 5),
 ('join', 5),
 ('path', 5),
 ('prepared', 5),
 ('century', 5),
 ('fight', 5),
 ('leaders', 5),
 ('communities', 5),
 ('ideology', 5),
 ('sectarian', 5),
 ('region', 5),
 ('violence', 5),
 ('civil', 5),
 ('ins

# 4. N-Grams <a id='section 4'></a>

N-grams help us move beyond looking at one word, and allow us to look at co-occurence. It takes window snapshots of a list of words going one word at a time. It's easiest to understand from an example. First we'll define a function find_ngrams, don't worry about the code here!

In [31]:
def ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

Let's look at a bigram (window of 2):

In [32]:
list(ngrams('I like walking my dog'.split(), 2))

[('I', 'like'), ('like', 'walking'), ('walking', 'my'), ('my', 'dog')]

Let's look at bigrams for our tokens:

In [33]:
list(ngrams(no_punc_lower_tokens, 2))

[('mr', 'president'),
 ('president', 'mr'),
 ('mr', 'secretary'),
 ('secretary', 'general'),
 ('general', 'fellow'),
 ('fellow', 'delegates'),
 ('delegates', 'ladies'),
 ('ladies', 'and'),
 ('and', 'gentlemen'),
 ('gentlemen', 'we'),
 ('we', 'come'),
 ('come', 'together'),
 ('together', 'at'),
 ('at', 'a'),
 ('a', 'crossroads'),
 ('crossroads', 'between'),
 ('between', 'war'),
 ('war', 'and'),
 ('and', 'peace'),
 ('peace', 'between'),
 ('between', 'disorder'),
 ('disorder', 'and'),
 ('and', 'integration'),
 ('integration', 'between'),
 ('between', 'fear'),
 ('fear', 'and'),
 ('and', 'hope'),
 ('hope', 'around'),
 ('around', 'the'),
 ('the', 'globe'),
 ('globe', 'there'),
 ('there', 'are'),
 ('are', 'signposts'),
 ('signposts', 'of'),
 ('of', 'progress'),
 ('progress', 'the'),
 ('the', 'shadow'),
 ('shadow', 'of'),
 ('of', 'world'),
 ('world', 'war'),
 ('war', 'that'),
 ('that', 'existed'),
 ('existed', 'at'),
 ('at', 'the'),
 ('the', 'founding'),
 ('founding', 'of'),
 ('of', 'this'),
 

We can use `Counter` for this too!:

In [34]:
Counter(list(ngrams(no_punc_lower_tokens, 2))).most_common()

[(('of', 'the'), 26),
 (('we', 'will'), 20),
 (('the', 'world'), 18),
 (('in', 'the'), 16),
 (('those', 'who'), 12),
 (('we', 'have'), 11),
 (('at', 'the'), 9),
 (('to', 'be'), 9),
 (('we', 'can'), 9),
 (('for', 'the'), 9),
 (('america', 'is'), 9),
 (('have', 'been'), 8),
 (('and', 'the'), 8),
 (('of', 'our'), 8),
 (('people', 'of'), 8),
 (('it', 'is'), 8),
 (('is', 'a'), 7),
 (('young', 'people'), 7),
 (('this', 'is'), 7),
 (('there', 'is'), 7),
 (('and', 'we'), 7),
 (('to', 'the'), 7),
 (('is', 'not'), 7),
 (('america', 'will'), 7),
 (('of', 'this'), 6),
 (('the', 'united'), 6),
 (('united', 'states'), 6),
 (('is', 'the'), 6),
 (('of', 'a'), 6),
 (('to', 'do'), 6),
 (('we', 'see'), 6),
 (('must', 'be'), 6),
 (('will', 'be'), 6),
 (('we', 'are'), 6),
 (('the', 'people'), 5),
 (('–', 'a'), 5),
 (('they', 'are'), 5),
 (('with', 'a'), 5),
 (('that', 'has'), 5),
 (('prepared', 'to'), 5),
 (('in', 'this'), 5),
 (('for', 'a'), 5),
 (('a', 'new'), 5),
 (('the', 'region'), 5),
 (('this', 'ins

### Challenge

Write the code to get a list of trigrams (windows of 3) for `no_punc_lower_tokens`. Then, use `Counter` to find the most common trigrams.

In [36]:
Counter(list(ngrams(no_punc_lower_tokens, 3))).most_common()


[(('of', 'the', 'world'), 6),
 (('the', 'united', 'states'), 6),
 (('the', 'people', 'of'), 5),
 (('the', 'middle', 'east'), 4),
 (('it', 'is', 'time'), 4),
 (('we', 'see', 'it'), 4),
 (('see', 'it', 'in'), 4),
 (('of', 'this', 'institution'), 3),
 (('people', 'of', 'the'), 3),
 (('be', 'able', 'to'), 3),
 (('world', 'in', 'which'), 3),
 (('we', 'are', 'prepared'), 3),
 (('are', 'prepared', 'to'), 3),
 (('to', 'address', 'the'), 3),
 (('in', 'this', 'effort'), 3),
 (('al', 'qaeda', 'and'), 3),
 (('middle', 'east', 'and'), 3),
 (('look', 'at', 'the'), 3),
 (('we', 'come', 'together'), 2),
 (('around', 'the', 'globe'), 2),
 (('the', 'ranks', 'of'), 2),
 (('human', 'beings', 'have'), 2),
 (('beings', 'have', 'been'), 2),
 (('and', 'the', 'world'), 2),
 (('this', 'institution', 'is'), 2),
 (('this', 'is', 'the'), 2),
 (('are', 'more', 'likely'), 2),
 (('outbreak', 'of', 'ebola'), 2),
 (('move', 'rapidly', 'across'), 2),
 (('rapidly', 'across', 'borders'), 2),
 (('the', 'brutality', 'of'), 

### Question
Should we remove stop words before looking at bi- and tri-grams? Why or why not?

What kinds of research could we do using word or N-gram frequency analysis?

**YOUR ANSWER**:

---

## Bibliography

- Tokens and Preprocessing section adapted from materials by Chris Hench. https://github.com/henchc/textxd-2017/blob/master/04-Tokens-and-Pre-Processing.ipynb

---
Notebook developed by: Keeley Takimoto

Data Science Modules: http://data.berkeley.edu/education/modules
