## Introduction to Regular Expressions in Python

On Tuesday, the class utilized Python's NLTK in order to perform text analysis. Tuesday's module focused greatly on tokenizing parts of strings and then using the library's in-built functions to analyze those tokens. Today, we will be learning about another technique of text analysis that involves using a special type of code called "regular expressions". For this module, we will be using functions from the 're' library of Python. 

### Learning Goals:
The goal of this lesson is to gain an introductory understanding of how regular expressions can be used with large portions of text to cleanly pull data from the text. Regular expressions can seem overwhelming at first, but with practice, they become easier to use. The goal is to add the usage of regular expressions to your text analysis 'toolbox'; from there, how you choose to analyze your own data is based on your own design and vision! Don't worry if you don't understand everything! Ask questions if you require assistance.


### Lesson Outline:
- Q&A regarding Tuesday's module content
- Structure of Strings in code
- Regular Expression Overview
- findall, sub, and more
- Examples (with Clayton's data)
- Practice!


## Structure of Strings in Python

As you may recall from Tuesday, a string (in coding jargon) is a type of variable that consists of a sequence of characters in a particular order. Characters can be both letters and numbers, and the sequence in a string is ordered. Even though strings are fixed variables, it is still possible to manipuate them and gain more information from within them. In Python, strings exist in a fashion analogous to lists. Similar to how we can pick elements in order from an ordered list, we can do the same for characters within a Python string. One important fact to keep in mind is that Python indexing starts at 0; that is to say that the first element in any Python list structure is actually indexed at 0.

Try running the code cell below to gain an understanding of string structure. Try and predict what will print before you print an output!

In [1]:
example = 'California'
print(len(example)) #This function provides the length of a string. Tip: The length and final index of a string are different.

10


In [2]:
print(example[0]) #We can reach an indexed value using the [] with the index number in the middle
print(example[5])

C
o


In [3]:
for elem in example:
    print(elem)

C
a
l
i
f
o
r
n
i
a


To extract a portion of a string, utilize the colon within your square brackets. For example [0:2] wil give you the 0th and 1st indexed characters in a string (but not the second!). As you can see, the splicing ability includes the first index, but not the last. If you would like to get everything from the start up to a certain index, leave the left side of the colon empty. If you would like to get everything from a certain index to the end of a string, leave the right side of the colon blank.

In [4]:
print(example[0:2])

Ca


In [5]:
print(example[2:5])

lif


In [6]:
print(example[:5])

Calif


In [7]:
print(example[5:])

ornia


Do you feel comfortable with strings? Ask questions if not!

## Regular Expressions

On Tuesday, you worked with Python's NLTK to tokenize words and sort/structure/manipulate them using the built functions. Regular expressions are an alternative way to 'search' for information within strings. At their most basic level, regular expressions are sequences of characters that define a pattern with which the computer searches. Regular expressions give us immense power by allowing us to search within extremly large portions of text for very specific types of text/information. 

Imagine that we are given a farm of various animals. Think of regular expressions as defining features that help us find exactly what we are looking for. In my farm, I want to search for animals that are brown, have 4 legs, and weigh more than 10 pounds. Each of those animal characteristics is analogous to a "sequences of characters" in the context of regular expressions. I may be looking for words that contain capital letters, or words that specifically start with a certain sequence of characters, similarly to how I am looking for brown animals, or animals with 4 legs.

In the cells below, we will introduce some basic regular expression code, and the findall function within the 're' Python library. This function will allow us to transform our confusing regex code into tangible results.



In [8]:
#importing the package for regular expressions
import re

### re.findall

You can use `re.findall` to find all instances of some string/regex/pattern within a larger string.

It is used with the syntax `re.findall(pattern, string)`, where `pattern` is the pattern that you want to look for in `string`. It returns all instances of that pattern in a list.

In [9]:
# this is our example string
example = 'The dog and cat and muskrat and snake and cow and mouse and moose and mare and deer and macaw and bear all went to the store.'

In [10]:
# you can put the string that you want to look for
re.findall('and', example)

['and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and']

In [11]:
# if you call len on the list, it will tell you how many items there are
len(re.findall('and', example))

10

In [12]:
len(re.findall(' ', example))

26

In [13]:
# using '.' will return any character
print(re.findall('.', example))

['T', 'h', 'e', ' ', 'd', 'o', 'g', ' ', 'a', 'n', 'd', ' ', 'c', 'a', 't', ' ', 'a', 'n', 'd', ' ', 'm', 'u', 's', 'k', 'r', 'a', 't', ' ', 'a', 'n', 'd', ' ', 's', 'n', 'a', 'k', 'e', ' ', 'a', 'n', 'd', ' ', 'c', 'o', 'w', ' ', 'a', 'n', 'd', ' ', 'm', 'o', 'u', 's', 'e', ' ', 'a', 'n', 'd', ' ', 'm', 'o', 'o', 's', 'e', ' ', 'a', 'n', 'd', ' ', 'm', 'a', 'r', 'e', ' ', 'a', 'n', 'd', ' ', 'd', 'e', 'e', 'r', ' ', 'a', 'n', 'd', ' ', 'm', 'a', 'c', 'a', 'w', ' ', 'a', 'n', 'd', ' ', 'b', 'e', 'a', 'r', ' ', 'a', 'l', 'l', ' ', 'w', 'e', 'n', 't', ' ', 't', 'o', ' ', 't', 'h', 'e', ' ', 's', 't', 'o', 'r', 'e', '.']


In [14]:
# you can combine special characters like '.' with plain letters
re.findall('m.', example)

['mu', 'mo', 'mo', 'ma', 'ma']

In [15]:
# '\w' is the special character for any letter
# '+' indicates that we want instances where there are one or more in a row
print(re.findall('\w+', example))

['The', 'dog', 'and', 'cat', 'and', 'muskrat', 'and', 'snake', 'and', 'cow', 'and', 'mouse', 'and', 'moose', 'and', 'mare', 'and', 'deer', 'and', 'macaw', 'and', 'bear', 'all', 'went', 'to', 'the', 'store']


In [16]:
# you can also specify that you want a certain amount of repeats of a character using {}
print(re.findall('\w{1,3}', example))

['The', 'dog', 'and', 'cat', 'and', 'mus', 'kra', 't', 'and', 'sna', 'ke', 'and', 'cow', 'and', 'mou', 'se', 'and', 'moo', 'se', 'and', 'mar', 'e', 'and', 'dee', 'r', 'and', 'mac', 'aw', 'and', 'bea', 'r', 'all', 'wen', 't', 'to', 'the', 'sto', 're']


In [17]:
# '\s' is the character for whitespace
re.findall('m\w+\s', example)

['muskrat ', 'mouse ', 'moose ', 'mare ', 'macaw ']

In [18]:
# you can use [] to indicate that the next character can come from any of the options within the brackets
re.findall('m[u,a]\w+\S', example)

['muskrat', 'mare', 'macaw']

In [19]:
# '?' means that the before character is optional

### re.sub

If you wanted to substitute something in for all of the patterns that you found with `re.findall`, you could use `re.sub`. 

It is used with the syntax `re.sub(pattern, repl, string)`, where `pattern` is the pattern that you are looking for within `string`. The string that you want to replace `pattern` with is `repl`.

In [None]:
re.sub('and m[u,a]\w+\S ', '',example)

In [None]:
re.sub(' and', ',',example)

In [None]:
re.sub(' and', ',',example, count=len(re.findall('and', example))-1)

### Split

The above functions are returning a single string, but sometimes you want to break you string up into smaller strings. You can use the method `.split()` to break up a string by a specific string and put them into a list.

If you want to split up a string by a regular expression, you can use `re.split`, which takes the same arguments as `re.findall`, first the pattern that you want to split by, then the string that is to be split.

In [None]:
#if you don't put anything in the parenthesis after .split, it will default to splitting by spaces
split_by_spaces = example.split()

print(split_by_spaces)


#sometimes it will be more helpful to split by a specific string
split_by_and = example.split('and')

print(split_by_and)

### Reading in your text files

In [None]:
# we are opening the file 'coffee_133_152.txt' in read mode ('r'), and calling that item f
with open('coffee_133_152.txt', 'r') as f:
    #we then read all of the lines of that file (f.read) and assign that text to read_data
    read_data = f.read()
    
print(read_data)

### Combining those text files

There are many different ways to combine the files into one corpus, but one way is to just concatenate each string directly into the other

In [None]:
with open('coffee_1_43.txt', 'r') as f:
    #we then read all of the lines of that file (f.read) and assign that text to read_data
    read_data2 = f.read()

with open('coffee_44_133.txt', 'r') as f:
    #we then read all of the lines of that file (f.read) and assign that text to read_data
    read_data3 = f.read()

In [None]:
corpus = read_data2 + '\n' + read_data3 + '\n' + read_data
print(corpus)

## Applying all of this

In [None]:
# importing additional packages

# datascience is a library that was created for Data 8
# it is mainly used for creating and manipulating tables
from datascience import *
%matplotlib inline

# pandas is similar to datascience, but has more features, 
# though with those features come a steeper learning curve
import pandas as pd

# pprint (pronounced pretty-print) allows us to print with
# special formatting that makes things easier to read
import pprint
pp = pprint.PrettyPrinter()

# Counter counts the number of times that an item appears in a list
from collections import Counter

### Basic Analysis

In [None]:
# figuring out how much coffee actually comes up
# note that we changed the corpus to be lower case 
# when we want to see how many times coffee appears
coffee_mentions = re.findall('coffee', corpus.lower())

print(len(coffee_mentions))
print(coffee_mentions)

In [None]:
# Use findall to quickly get the context surrounding certain words
re.findall('.{,30}coffee.{,30}', corpus)

In [None]:
# the text also is about inter-american trade, 
# so we may want to know about references to america

america_mentions = re.findall('america', corpus.lower())
print('America count:')
print(len(america_mentions))

pan_america_mentions = re.findall('pan america', corpus.lower())
print('\nPan America count:')
print(len(pan_america_mentions))

In [None]:
# Use findall to quickly get the context surrounding certain words
re.findall('.{,30}America.{,30}', corpus)

Next, we can look at the frequency of certain words. This is an alternate way to do it, though it would probably be better to use a package that is explicitly made for this sort of thing, like NLTK

In [None]:
word_dictionary = Counter(re.findall("[A-Z]{2,}(?![a-z])|[A-Z][a-z]+(?=[A-Z])|[\'\w\-]+",corpus.lower()))
pp.pprint(word_dictionary)

# if you would rather have it as a table, you can uncomment the 
# pd.DataFrame.from_dict(word_dictionary, orient='index').sort(0, ascending=False)

### Creating a dictionary of pages

First, we are going to try to make it so that we have a way of sorting through our text. One way to do that is to make a dictionary for each page. A dictionary is a data structure that consists of key-value pairs. Each key is linked to a value, so when you present the dictionary with a key, it returns the key's corresponding value. In our case, we are going to want to make page numbers our keys and the text that they link to our values.

<img src='page_number.png' style='width:200px;'>

Looking at how the pages are numbered, how do you think we can use regular expressions to pick out the page number?

In [None]:
re.findall('\[\d+\]', corpus)

If you inspect the values in the list above, you will notice that we're missing page values, so let's try a different regex

In [None]:
page_numbers = re.findall('\[.{1,5}\]', corpus)
page_numbers

This one still isn't complete, but it's better. Try inspecting the actual text to see what we can change to get more pages!

Now that we know what the page numbers are, we can split the text between those points. We are going to use `re.split` instead of the built in `.split()` method because we want to search by more complex regular expressions.

In [None]:
split_pages = re.split('\[.{1,5}\]', corpus)[:-1]
print(split_pages)

Now that we have our page numbers and their corresponding content split within lists, we are ready to combine them into a dictionary. We make use of the built in functions `zip` and `dict`, which combine corresponding elements in lists (`zip`) and then creates the dictionary (`dict`)

In [None]:
page_dictionary = dict(zip(page_numbers,split_pages))
pp.pprint(page_dictionary)

In [None]:
# this is how you call out a value of a dictionary
print(page_dictionary['[5]'])

Right now, we have to put in the page number in brackets that it is stored in the text as, so we can clean that up with some regular expression work.

In [None]:
clean_page_numbers = [x.replace('[', '').replace(' ', '').replace(']', '') for x in page_numbers]
clean_dictionary = dict(zip(clean_page_numbers,split_pages))

print(clean_dictionary['5'])

Compare the above dictionary entry for page 5 and the actual page below!

<img src='page_5.png'>

### Making a dictionary for each entry in the appendix

Below is the pattern that seperates each entry of the appendix.

<img src='appendix_37.png'>

In [None]:
# pulling out the appendix portion of the text
appendix = corpus.split('V. APPENDICES')[1]

# we split the text up by the appendix headers
# the parenthesis around our regex mean that it will
# remain in the returned list, as its own string
split_by_appendix = re.split('(APPENDIX N[o,O,0)]. \d+)', appendix)
print(split_by_appendix)

In [None]:
# we pick out the appendix numbers to be our keys
app_num = split_by_appendix[1::2]

# picking out the content of the appendices
app_content = split_by_appendix[2::2]

# creating our dictionary
appendix_dictionary = dict(zip(app_num,app_content))
print(appendix_dictionary['APPENDIX NO. 37'])

### Reproducing a table that exists in the text

<img src='page_125.png'>

In [None]:
# at the bottom of page 125, there is a table
page_125 = clean_dictionary['125']
print(page_125)

In [None]:
# using a list comprehension to select the lines where there are lines that match the pattern
# re.search returns True if the pattern is within the given string
rows = [line for line in page_125.split('\n') if re.search('\w+\s+\d+,+', line)]
rows

In [None]:
table_rows = [re.split('(.+) (.+)', line)[1:3] for line in rows]
table_rows

In [None]:
#building a datascience table with the information we have above
quota_table = Table(['Country', 'Quota']).with_rows(table_rows)

# cleaning up that table so that the quota numbers are integers
refined_table = quota_table.with_column('Quota', [int(value.replace(',','')) for value in quota_table['Quota']])
refined_table.show(refined_table.num_rows)

In [None]:
# plotting a barchart with the built in barh function
refined_table.barh('Country')

## Resources

A helpful website to practice all of these regular expressions is https://regexone.com

Here is the documentation for re https://docs.python.org/2/library/re.html#module-re

And here is a quick cheatsheet for regex http://www.rexegg.com/regex-quickstart.html