# Introduction to Regular Expressions in Python

### Workshop Format

* A Very Brief Intro to Essential Regular Expressions

* Use case 1: Replacing text errors in a problematic dataset
* Use case 2: Interacting with files with parsable names 
* Use case 3: Find most frequently occuring words in Jane Eyre

* Own-time: Explore existing use cases & datasets

### A Very Brief Intro to Essential Regular Expressions

Regular expressions are "a sequence of characters that define a search pattern." They allow you to isolate portions of text or numeric data, and do various operations on them. 

They will occur in many contexts across various software, etc Excel, the command-line, Stata, and major programming languages, etc Ruby, Java and C. The principles are for the most part very similar, with minor modifications depending on the platform. We concentrate on the use of regular expressions in Python in this tutorial. 

To start, open up Pythex an online Python regular expression editor. This allows us to quickly start practising regular expressions, and provides real-time verification and tips. 

[Pythex Example 1](http://pythex.org/?regex=.&test_string=%22.%22%20and%20%22%3F%22%20are%20special%20characters%20in%20regex.%20%0A%0AYou%20need%20to%20escape%20them%20with%20backslash%20-%20%22%5C%22.%20%0A%0ABackslash%20is%20also%20a%20special%20character%2C%20you%20need%20to%20escape%20it%20twice%20-%20%22%5C%5C%22%20to%20find%20it.%20%0A&ignorecase=0&multiline=0&dotall=0&verbose=0): Try searching for phrases within the text string. What happens? 

### 1. Special characters 

Various characters have special meanings in regular expressions. For instance, "." will match any character besides newline. Try it out. 

If you want to match a character that happens to be a special character, you have to escape it with backslash - "\". Starting out, the easiest way to identify special characters are to try them out.

[Pythex Example 1](http://pythex.org/?regex=.&test_string=%22.%22%20and%20%22%3F%22%20are%20special%20characters%20in%20regex.%20%0A%0AYou%20need%20to%20escape%20them%20with%20backslash%20-%20%22%5C%22.%20%0A%0ABackslash%20is%20also%20a%20special%20character%2C%20you%20need%20to%20escape%20it%20twice%20-%20%22%5C%5C%22%20to%20find%20it.%20%0A&ignorecase=0&multiline=0&dotall=0&verbose=0): Select the various special characters in the test string. 

### 2.  Sets & Quantifiers

Regular expressions also allow you to define sets of characters to allow more customized parsing. The syntax is a set of possible characters within square brackets, etc [a-zA-Z], and quantifers directy after - "*" for 0 or more, "+" for 1 or more, and "?" for 0 or more for how many characters you want from the set. In combination, your regular expression might look like [0-9]{2} to select all 2 digit characters, for example. 

[0-9] - all numeric values <br />
[a-z] - all lower-case values <br />
[A-Z] - all upper-case values <br />
[0-9a-z] - to combine sets, list them one after another <br />
[ ]  - whitespace is whitespace <br />
\* - 0 or more from the set or expression <br />
\+ - 1 or more from the set or expression <br />
\? - 0 or more from the set or expression <br />
{n} - n from the set or expression <br />

Refer to your cheatsheet for more examples. 

[Pythex Example 2](http://pythex.org/?regex=%5B0-9%5D%7B2%7D&test_string=Adding%20odd%20sequences%20always%20yields%20squares%3A%201%20%3D%201%20x%201%2C%201%20%2B%203%20%3D%204%20%3D%202%20x%202%2C%201%20%2B%203%20%2B%205%20%3D%209%20%3D%203%20x%203%2C%201%20%2B%203%20%2B%205%20%2B%207%20%3D%2016%20%3D%204%20x%204&ignorecase=0&multiline=0&dotall=0&verbose=0): Try to search for specific sequences, such as "1 + 3", or "4 = 2 X 2". How would you select all sequences with the same structure? 

### 3. Special sequences 

Finally, special sequences provide shortcuts to predefined sets of characters. Some common special sequences are:

\d - Digit <br />
\D - non-Digit <br />
\q - Alphanumeric [0-9a-zA-Z_] <br />
\s - whitespaces

[Pythex Example 3](http://pythex.org/?regex=%5Cw&test_string=%22.%22%20and%20%22%3F%22%20are%20special%20characters%20in%20regex.%20You%20need%20to%20escape%20them%20with%20backslash%20-%20%22%5C%22.%20Backslash%20is%20also%20a%20special%20character%2C%20you%20need%20to%20escape%20it%20twice%20-%20%22%5C%5C%22%20to%20find%20it.%20%0A%0ARandom%20math%20fact%20-%20adding%20odd%20sequences%20always%20yields%20squares%3A%201%20%3D%201%20x%201%2C%201%20%2B%203%20%3D%204%20%3D%202%20x%202%2C%201%20%2B%203%20%2B%205%20%3D%209%20%3D%203%20x%203%2C%201%20%2B%203%20%2B%205%20%2B%207%20%3D%2016%20%3D%204%20x%204&ignorecase=0&multiline=0&dotall=0&verbose=0): How would you write a regular expression to select everything in the test string? Notice that the only special characters within a set are "]", "-", and "^".

### [ 15 minute break to try out above expressions. ]


### Usecase 1: Replacing text errors in a problematic dataset

You've received a problematic dataset from a fellow researcher, with some data entry errors/discrepancies. How would you use regular expressions to correct these errors?

1. Replace all instances of "district" or "District" with "County". 
2. Replace all instances of "Not available" or "[Name] looking up" with numeric codes.  

In [3]:
import re

rfile = open("data/usecase1/problem_dataset.csv", "r")
text = rfile.read()

# Introducing re.sub: pattern, replace, string

# Replace all instances of "district" or "District" with "County".
newtext = re.sub('[Dd]{1}istrict', "County", text)

# Replace all instances of "Not available" or "[Name] looking up" with numeric codes. 
newtext = re.sub("Not [aA]{1}vailable", "-999", newtext)
newtext = re.sub("[a-zA-Z]+ looking up", "-888", newtext)

wfile = open("data/usecase1/cleaned_dataset.csv", "w")
wfile.write(newtext)
wfile.close()

### Usecase 2: Interacting with files with parsable names 

You are working on a nationwide study taking place with different population samples at various times. The PI on the project has sent you a folder with multiple files. Because these studies are led by different researchers in different cities, the data is not collated in a single dataset. The PI wants you to conduct analyses on various subsets of the data, and ultimately create a single dataset combining data from the different files. 

1. How do we select only files with the .txt extension?
2. How do we only select csv files from Boston or Oakland?
3. How do we select files from a range of years? 

In [None]:
import re
import shutil
import glob
import os

# Introducing glob
# Standard * wildcard will get us pretty far 
# * is not the same as its usage in regular expressions 

# help(glob)
glob.glob("*")
os.getcwd()

# Select only files with .txt extension
glob.glob("data/usecase2/*.txt")

# Select only csv files from Boston or Oakland
glob.glob("data/usecase2/boston*.csv")

#  How do we select files from a range of years? 
all_files_list = glob.glob("data/usecase2/*")

# Introducing re.match 
# help(re.match)
# re.match: pattern, string

# try on first element of list
first_filename = all_files_list[0]

# Matching on first_filename
re.match("[a-zA-z0-9/]+\.csv", first_filename)

# Matching on a range of years 
re.match("[a-zA-z0-9/]+201[12]\.csv", first_filename)

for filename in all_files_list:
    if re.match("[a-zA-z0-9/]+201[12]\.csv", filename):
        shutil.copy(filename, "data/usecase2/moved")

# Aside: Introducing groups in re.match - allows you to retrieve part of the text string 
re.match("[a-zA-z0-9/]+(201[12])\.csv", first_filename).groups()Out[25]: <_sre.SRE_Match at 0x1089815e0>

### Usecase 3: Find most frequently occuring words in Jane Eyre

You are writing a paper doing textual analysis of Jane Eyre. You are interested in the relative frequencies of specific words, and extending this analysis to other texts, to track changes in language over time. 

1. How do we count the number of occurrences of a single word in the text?
2. Using nltk, how do we collate frequencies of all words in Jane Eyre? 

In [None]:
import re
from nltk.tokenize import word_tokenize
import nltk

rfile = open("data/usecase3/jane_eyre.txt", "r")
text = rfile.read()

# Introducing re.findall
# help(re.findall)
# re.findall: pattern, string

len(re.findall("Rochester", text))
len(re.findall("fire", text))
len(re.findall("melancholy", text))
len(re.findall("blood", text))
len(re.findall("heart", text))

# A more efficient solution: nltk 
words = word_tokenize(text)
fdist = nltk.FreqDist(words)
# List most common words in Jane Eyre 
fdist.items()[0:500]

### Own-time: Explore existing use cases & datasets

We've introduced a set of essential regular expressions and how we can use them in various contexts. Pick a usecase that is most relevant or interesting to you and explore the datasets provided. 