# Text Analysis 02: Pre-Processing and Word Frequency


---
<img src="https://upload.wikimedia.org/wikipedia/commons/2/2a/ELI-LA-Word-Cloud-2-900x600.jpg" style="width: 500px; height: 275px;" />

### Professor Crystal Chang

This notebook will cover tools and strategies used in data pre-processing to prepare text data for analysis. We will then 
*Estimated Time: 45 minutes*

---

### Topics Covered
- Regular expressions
- Tokens and pre-processing
- Document Term Frequency

### Table of Contents


1 - [Regular Expressions](#section 1)<br>

2 - [Tokens and Word Counts](#section 2)<br>




**Dependencies:**

In [1]:
import re
import pandas as pd

## Introduction



## 1. Regular Expressions (regex) <a id='section 1'></a>

### Overview

Regular expressions (regex or regexp for short) are special sequences of characters that define patterns
to search for in text. They're often used in find-and-replace operations, or to add up the number of words
or phrases matching a particular pattern.

Regular expressions are useful in a variety of applications, and can be used in different programs and
programming languages. We will start by learning the general components of regular expressions, using a
simple online tool, RegExr. We'll also demonstrate how to use them in Python.

To get started:

1. Go to this site: [http://regexr.com](http://regexr.com).
2. Copy and paste the New York Times leads from the file `nyt_leads.txt` into the __Text__ window on the website.
3. Delete what you see in the __Expression__ field. This is where we'll insert our own regular expressions
to find sequences in the headlines below.

~~~ {.input}
New York Times											October 19, 2016
Retaking Mosul From ISIS May Pale to What Comes Next
By TIM ARANGO and RICK GLADSTONE  11:52 ET
If the recaptures of Ramadi, Tikrit and Falluja are a guide,
Iraqi officials will confront devastation and unexploded bombs
once Mosul is reclaimed.

New York Times											October 18, 2016
Short-Term Cease-Fire in Yemen Appears Likely
By BEN HUBBARD  10:05 ET
The rebels known as the Houthis said they would abide by the
cease-fire if the Saudi military coalition halted attacks and
lifted a blockade.
~~~

### a. Special Characters

Strings are composed of characters, and we are writing patterns to match specific sequences of characters.
Various characters have special meaning in regular expressions. When we use these characters in an expression,
we aren't matching the identical character, we're using the character as a placeholder for some other character(s)
or part(s) of a string.

If you want to match a character that happens to be a special character, you have to escape it with a backslash
`\`. Try typing the following special characters into the __Expression__ field on the regexr.com site. What happens
when you type `New York Times` vs. `^New York Times`? How about `.`, `\.`, or `\.$`?

~~~ {.input}
.         any single character
^         start of string
$         end of string
\n        new line
\r        carriage return
\t        tab
~~~

### b. Quantifiers

Some special characters refer to optional characters, to a specific number of characters, or to an open-ended
number of characters matching the preceding pattern. Try looking for the letter 'o' followed by a number of 'f's:
what happens if you type `of`, `of*`, `of+`, `of{1}`, `of{1,2}`?

~~~ {.input}
*        0 or more of the preceding character/expression
+        1 or more of the preceding character/expression
?        0 or 1 of the preceding character/expression
{n}      n copies of the preceding character/expression 
{n,m}    n to m copies of the preceding character/expression 
~~~

### c. Sets

Regular expressions also allow you to define sets of characters. Within a set of square brackets, you may list
characters individually, e.g. `[aeiou]`, or in a range, e.g. `[A-Z]` (note that all regular expressions are case
sensitive).

You can also create a complement set by excluding certain characters, using `^` as the first character
in the set. The set `[^A-Za-z]` will match any character except a letter. All other special characters loose
their special meaning inside a set, so the set `[.?]` will look for a literal period or question mark.

The set will match only one character contained within that set, so to find sequences of multiple characters from
the same set, use a quantifier like `+` or a specific number or number range `{n,m}`.

~~~ {.input}
[0-9]        any numeric character
[a-z]        any lowercase alphabetic character
[A-Z]        any uppercase alphabetic character
[aeiou]      any vowel (i.e. any character within the brackets)
[0-9a-z]     to combine sets, list them one after another 
[^...]       exclude specific characters
~~~

### d. Special sequences

Several special characters denote special sequences. These begin with a `\` followed by a letter.
Note that the uppercase version is usually the complement of the lowercase version.

~~~ {.input}
\d        Any digit
\D        Any non-digit character
\w        Any alphanumeric character [0-9a-zA-Z_] 
\W        Any non-alphanumeric character
\s        Any whitespace (space, tab, new line)
\S        Any non-whitespace character
\b        Matches the beginning or end of a word (does not consume a character)
\B        Matches only when the position is not the beginning or end of a word (does not consume a character)
~~~

### e. Groups and Logical OR

Parentheses are used to designate groups of characters, to aid in logical conditions, and to be able to retrieve the
contents of certain groups separately.

The pipe character `|` serves as a logical OR operator, to match the expression before or after the pipe. Group parentheses
can be used to indicate which elements of the expression are being operated on by the `|`.

~~~ {.input}
|            Logical OR opeator
(...)        Matches whatever regular expression is inside the parentheses, and notes the start and end of a group
(this|that)  Matches the expression "this" or the expression "that"
~~~

## regex in Python

Important methods:

In [2]:
my_string = '''New York Times											October 19, 2016
Retaking Mosul From ISIS May Pale to What Comes Next
By TIM ARANGO and RICK GLADSTONE  11:52 ET
If the recaptures of Ramadi, Tikrit and Falluja are a guide,
Iraqi officials will confront devastation and unexploded bombs
once Mosul is reclaimed.

New York Times											October 18, 2016
Short-Term Cease-Fire in Yemen Appears Likely
By BEN HUBBARD  10:05 ET
The rebels known as the Houthis said they would abide by the
cease-fire if the Saudi military coalition halted attacks and
lifted a blockade.'''

In [3]:
re.compile?

In [4]:
re.search?

In [5]:
re.match?

In [6]:
re.sub?

In [7]:
re.findall?

In [8]:
re.split?

# Challenge

Write some code using a regex that returns all capitalized names from `my_string`.

Your answer should be *exactly*:

```python
['TIM ARANGO', 'RICK GLADSTONE', 'BEN HUBBARD']
```

---

## 2. Tokens <a id='section 2'></a>

What is a token? In text analysis, we need something to count in order to quantify language. For NLP practitioners, we count tokens. A token is often simply a word. But more accurately, a token is the unit of analysis for an NLP project. This may be a word, but it may be a stem, or lemma too. 

### Tokenizing

Let's first read in an example text to work with. In this section, we'll use the text from President Obama's final speech at the United Nations General Assembly on September 28, 2015:

In [11]:
with open('data/obama_un_2015.txt') as f:
    obama_un = f.read()

print(obama_un)

Mr. President, Mr. Secretary General, fellow delegates, ladies and gentlemen: we come together at a crossroads between war and peace; between disorder and integration; between fear and hope. 
Around the globe, there are signposts of progress. The shadow of World War that existed at the founding of this institution has been lifted; the prospect of war between major powers reduced. The ranks of member states has more than tripled, and more people live under governments they elected. Hundreds of millions of human beings have been freed from the prison of poverty, with the proportion of those living in extreme poverty cut in half. And the world economy continues to strengthen after the worst financial crisis of our lives. 
Today, whether you live in downtown New York or in my grandmother’s village more than two hundred miles from Nairobi, you can hold in your hand more information than the world’s greatest libraries. Together, we have learned how to cure disease, and harness the power of t

We may be tempted to use our `str` knowledge and just call `.split()` to get our tokens:

In [None]:
tokens = obama_un.split()
tokens

That looks pretty good. What if we wanted to get some counts? Let's try something we know is in the document: the very first word.

In [None]:
tokens.count('Mr')

Wait, what? We can see two 'Mr's in the first sentence alone! Let's take another look at our tokens

In [None]:
tokens

We see that periods stayed with our token "Mr". The `.split()` method split on white-space only. It won't intuitively know what a word is.

Similarly, Python won't know the difference between "It" and "it":

In [None]:
tokens.count('It')

In [None]:
tokens.count('it')

A quick and dirty way to get some normalized text back would be to remove punctuation and lower-case the entire string:

In [None]:
from string import punctuation
punctuation

In [None]:
no_punc = ''.join([c for c in obama_un if c not in punctuation])
no_punc

In [None]:
no_punc_lower = no_punc.lower()
no_punc_lower

In [None]:
no_punc_lower_tokens = no_punc_lower.split()
no_punc_lower_tokens

In [None]:
no_punc_lower_tokens.count('mr')

In [None]:
no_punc_lower_tokens.count('It'), no_punc_lower_tokens.count('it')

### Challenge

Write a function that tokenizes your text. Your function should do three things:
- remove the punctuation
- lower-case the text
- split the text on white spaces

In [29]:
def tokenize(text):
    #your code here
    return ...

Test out your function on a snippet of text.

In [28]:
snippet = obama_un[25195:]
print(snippet)

Join us in this common mission, for today’s children and tomorrow’s. 




In [None]:
tokenize(snippet)

### Counting Tokens

The easiest way to count tokens is using the `Counter` object from `collections`. This will give you back a dictionary with the token counts:

In [57]:
from collections import Counter
Counter?

In [58]:
Counter(no_punc_lower_tokens)

Counter({'mr': 2,
         'president': 5,
         'secretary': 1,
         'general': 1,
         'fellow': 2,
         'delegates': 2,
         'ladies': 1,
         'and': 151,
         'gentlemen': 1,
         'we': 91,
         'come': 11,
         'together': 12,
         'at': 17,
         'a': 90,
         'crossroads': 2,
         'between': 7,
         'war': 18,
         'peace': 13,
         'disorder': 1,
         'integration': 1,
         'fear': 2,
         'hope': 2,
         'around': 2,
         'the': 218,
         'globe': 3,
         'there': 10,
         'are': 25,
         'signposts': 1,
         'of': 146,
         'progress': 5,
         'shadow': 1,
         'world': 31,
         'that': 71,
         'existed': 1,
         'founding': 2,
         'this': 49,
         'institution': 4,
         'has': 19,
         'been': 11,
         'lifted': 1,
         'prospect': 1,
         'major': 3,
         'powers': 2,
         'reduced': 1,
         'ranks': 2,
 

In [59]:
Counter(no_punc_lower_tokens).most_common()

[('the', 218),
 ('and', 151),
 ('to', 151),
 ('of', 146),
 ('we', 91),
 ('a', 90),
 ('in', 78),
 ('that', 71),
 ('is', 58),
 ('–', 57),
 ('this', 49),
 ('our', 45),
 ('will', 43),
 ('be', 40),
 ('for', 40),
 ('have', 33),
 ('world', 31),
 ('as', 30),
 ('people', 28),
 ('it', 28),
 ('or', 26),
 ('not', 26),
 ('are', 25),
 ('can', 24),
 ('their', 24),
 ('with', 23),
 ('by', 23),
 ('who', 23),
 ('america', 21),
 ('but', 20),
 ('has', 19),
 ('war', 18),
 ('from', 18),
 ('you', 18),
 ('at', 17),
 ('they', 16),
 ('on', 15),
 ('no', 15),
 ('more', 14),
 ('so', 14),
 ('peace', 13),
 ('those', 13),
 ('us', 13),
 ('an', 13),
 ('nations', 13),
 ('do', 13),
 ('must', 13),
 ('all', 13),
 ('where', 13),
 ('together', 12),
 ('new', 12),
 ('come', 11),
 ('been', 11),
 ('young', 11),
 ('when', 11),
 ('there', 10),
 ('i', 10),
 ('because', 10),
 ('states', 9),
 ('power', 9),
 ('problems', 9),
 ('into', 9),
 ('see', 9),
 ('conflict', 9),
 ('if', 9),
 ('muslim', 9),
 ('today', 8),
 ('united', 8),
 ('time'

### N-Grams

N-grams help us move beyond looking at one word, and allow us to look at co-occurence. It takes window snapshots of a list of words going one word at a time. It's easiest to understand from an example. First we'll define a function find_ngrams, don't worry about the code here!

In [32]:
def ngrams(input_list, n):
    return zip(*[input_list[i:] for i in range(n)])

Let's look at a bigram (window of 2):

In [34]:
list(ngrams('I like walking my dog'.split(), 2))

[('I', 'like'), ('like', 'walking'), ('walking', 'my'), ('my', 'dog')]

### Challenge

Write the code to get a list of trigrams (windows of 3) for the string `'I like walking my dog'`.

In [62]:
# your code here

[('I', 'like', 'walking'), ('like', 'walking', 'my'), ('walking', 'my', 'dog')]

Let's look at bigrams for our tokens:

In [63]:
list(ngrams(no_punc_lower_tokens, 2))

[('mr', 'president'),
 ('president', 'mr'),
 ('mr', 'secretary'),
 ('secretary', 'general'),
 ('general', 'fellow'),
 ('fellow', 'delegates'),
 ('delegates', 'ladies'),
 ('ladies', 'and'),
 ('and', 'gentlemen'),
 ('gentlemen', 'we'),
 ('we', 'come'),
 ('come', 'together'),
 ('together', 'at'),
 ('at', 'a'),
 ('a', 'crossroads'),
 ('crossroads', 'between'),
 ('between', 'war'),
 ('war', 'and'),
 ('and', 'peace'),
 ('peace', 'between'),
 ('between', 'disorder'),
 ('disorder', 'and'),
 ('and', 'integration'),
 ('integration', 'between'),
 ('between', 'fear'),
 ('fear', 'and'),
 ('and', 'hope'),
 ('hope', 'around'),
 ('around', 'the'),
 ('the', 'globe'),
 ('globe', 'there'),
 ('there', 'are'),
 ('are', 'signposts'),
 ('signposts', 'of'),
 ('of', 'progress'),
 ('progress', 'the'),
 ('the', 'shadow'),
 ('shadow', 'of'),
 ('of', 'world'),
 ('world', 'war'),
 ('war', 'that'),
 ('that', 'existed'),
 ('existed', 'at'),
 ('at', 'the'),
 ('the', 'founding'),
 ('founding', 'of'),
 ('of', 'this'),
 

We can use `Counter` for this too!:

In [64]:
Counter(list(ngrams(no_punc_lower_tokens, 2))).most_common()

[(('of', 'the'), 26),
 (('we', 'will'), 20),
 (('the', 'world'), 18),
 (('in', 'the'), 16),
 (('those', 'who'), 12),
 (('we', 'have'), 11),
 (('at', 'the'), 9),
 (('to', 'be'), 9),
 (('we', 'can'), 9),
 (('for', 'the'), 9),
 (('america', 'is'), 9),
 (('have', 'been'), 8),
 (('and', 'the'), 8),
 (('of', 'our'), 8),
 (('people', 'of'), 8),
 (('it', 'is'), 8),
 (('is', 'a'), 7),
 (('young', 'people'), 7),
 (('this', 'is'), 7),
 (('there', 'is'), 7),
 (('and', 'we'), 7),
 (('to', 'the'), 7),
 (('is', 'not'), 7),
 (('america', 'will'), 7),
 (('of', 'this'), 6),
 (('the', 'united'), 6),
 (('united', 'states'), 6),
 (('is', 'the'), 6),
 (('of', 'a'), 6),
 (('to', 'do'), 6),
 (('we', 'see'), 6),
 (('must', 'be'), 6),
 (('will', 'be'), 6),
 (('we', 'are'), 6),
 (('the', 'people'), 5),
 (('–', 'a'), 5),
 (('they', 'are'), 5),
 (('with', 'a'), 5),
 (('that', 'has'), 5),
 (('prepared', 'to'), 5),
 (('in', 'this'), 5),
 (('for', 'a'), 5),
 (('a', 'new'), 5),
 (('the', 'region'), 5),
 (('this', 'ins

## Stop words

You've probably noticed that a lot of our most frequent words, bigrams, and trigrams aren't very interesting: they contain a lot of **stop words** like 'and', 'the', and 'of'. Frequently, we'll want to remove the stop words in order to get more meaningful results from our analysis.

Fortunately, the scikit-learn package has a list of English stop words we can use:

In [30]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

---

## Bibliography

- Regex section taken from materials by Chris Hench. https://github.com/henchc/textxd-2017/blob/master/03-regex.ipynb
- Tokens and Preprocessing section taken from materials by Chris Hench. https://github.com/henchc/textxd-2017/blob/master/04-Tokens-and-Pre-Processing.ipynb

---
Notebook developed by: Keeley Takimoto

Data Science Modules: http://data.berkeley.edu/education/modules
