# Regular Expressions (Regex)

Regular Expressions refer to an abstract and describe language. For example, regex provides a way to distinguish between upper- and lower-case alphabetical characters [A-z] and numerical characters [0-9]. This formal description of language makes it possible to perform powerful functions on text, especially for cleaning and extracting data, and can help avoid manually handkeying changes.

We might define a word as any string of characters separated by white space on either side using regex. But, looking at the previous sentence, this regex pattern will condiser "regex." as a word rather than "regex" because "." is a non-space character. Dates are also recognizable patterns, 04-12-2010. Here, we might formalize year as four digit characters followed by a hyphen. When we begin working with data, we can use regex to clean up text columns or even extract years from date columns. There are many, many more uses of regex.

Rather than try to memorize regular expressions, it is perhaps better to find reliable cheat sheets that are useful for referancing. Below, we will cover the basics of regular expressions. But, if you ever need to test regular expressions before running codes, which can take awhile, or just want a quick reference guide, I recommend using Pythex (https://pythex.org/). While this tutorial will only cover the basics, there are also many sophisticated regular expressions available online.

## Basic Operators
### *,  +,  ?

* \* [asterisk] matches the preceding character 0 or more times
* \+ [plus sign] matches the preceding character 1 or more times
* ? [question mark] matches the preceding character 0 or 1 time

### Range [ ]

Square brackets allow you to search for a range of possibilities. "Color" may or may not have a "u" depending on whether it's an American or British spelling. To account for various possibilities, you can use "colo[u]\*r," which says there may be a "u" there (once) or may not be (zero times).

### Groups ( )

It can also be useful to declare a group within a regex pattern. Setting parenthesis around a pattern will tell regex to set that match as its own group. This can be useful if you want to return something within a match. For example, assuming you have the date mentioned earlier, 04-12-2010, you might return the year with the following regex:

* [0-9]+-[0-9]+-([0-9]+)

### Exact Occurrences

Sometimes, like a year, you know exactly how many occurrences to expect. If you want to be more specific than the basic operators (\*, +, ?), you can use squiggly brackets { }. Assuming you know every four numbers in a row is a year, you might use the following regex.

* [0-9]{4}

### Match Any

The period '.' matches any character. This special character can be used with '\*' (match 0 or more times) to match the entire text (.\*).

## Escape Characters

But, what if you want to match an actual period or a question mark? Because these characters perform certain functions, it is necessary to "escape" them if you want to use them. Putting a backslash, "\\", in front a special character will tell computers to recognize that character as a regular character. For instance, if you encounter a plus sign ("+") in a text and you want to match it, then you'll need to escape it with a second backslash: "\\+".

## Special Sequences

While these basic operators will form the basis of regex, there are other sequences that can be useful as well. This list does not cover every special sequence but will cover the most common.

### Numbers
* \\d = any digit
* \\D = any non-digit

If we consider our regex for year, we might re-write that as (\\d){4}, which conveys a group of digits exactly four characters in length.

### Alphanumeric
* \\w = any alphanumeric (0-9 and A-z); A-z indicates any upper- or lowercase letter.
* \\W = any non-alphanumeric (such as punctuation)

### Whitespace
* \\s = any whitespace
* \\S = non-whitespace

There are more ways to identify more specific whitespaces. Some of the more common examples are
* \\t = tabs
* \\n = newlines

This tutorial, of course, only covers the very basics of regular expressions. There are much more sophisticated examples online that may prove useful for your own data cleaning. Regex is a powerful tool that you may use throughout your coding practices. 
____

Sources:

1. Schmidt, Ben. Regex worksheet from Humanities Data Analysis, 2015.
2. *Pythex*, a regular expression tester. https://pythex.org/.

# Regex Workshop

There are a couple of useful libraries to help use regex in Python: re and string.
> To install re using conda, open a new terminal:<br>
> Type: conda install -c conda-forge regex

Before we can use the library "re," we have to import the library. Importing libraries tells Python to retrieve their built-in functions.

In [1]:
import re

Although there is no output or update beneath the cell after running "import re," we can now use it. First, though, we need a string to work with. Continuing with our date example, we can use the regex we created earlier to extract the year.

In [16]:
# First, we'll create a string variable named date.
date = "04-12-2010"

# We can then find the year using re and our regular expression.
re.search(r'\d{4}', date).group(0)

'2010'

There are a few things happening in the last line of the previous cell that are not obvious. Moving left to right, I'll explain what that line of code does.

#### re.search

Re.search() is doing two things in unison. Essentially, we are calling the "search" function. But, other libraries might have a search() function as well. So, we clarify that we want the re.search() by writing re. beforehand.

#### (r'\d{4}', date)

r'\d{4}' is our regular expression. I sometimes add an 'r' before the quotation marks to specify that the string I'm searching for is a regex. This is probably unnecessary for re, but other libraries tend to work better with it, so I've made it a habit. 

Our regular expression is also the first "argument" of the function. Some functions require multiple arguments so that it understands how to behave. Re.search takes two arguments: the regex and the string in which to look for the regex. The second argument, then, is the string we expect to find our regular expression.

#### .group(0)

Re.search does not return a string object on its own. Without group(0), re.search will return a "Match" object. By adding .group(0), we're telling re.search to return the first matched text within the "Match" object, which is our year.

As a note, I find that I'm constantly having to look up how "re" works. It can seem a little finicky without using it regularly. It is completely normal to have to look up how something works in Python even if you've used it hundreds of times.

## Multiple Matches

Let's say we're trying to gather more complex sets of data than a single year, though. Re provides a way to return a list of matches. For example, what if we want to discover all the words that begin with a capitalized letter?

In [18]:
# First we need a string.
# Here, I'll use three-single quotation marks at the beginning and end to create a string block.
blockstring = '''
"Money burns the pocket, pocket hurts,
Bootleggers in silken shirts,
Ballooned, zooming Cadillacs,
Whizzing, whizzing down the street-car tracks."

"Seventh Street," Cane, Jean Toomer.
'''

# We're using a slightly different re function here.
# re.findall() will return a list object of matches.
# Like re.search, re.findall takes two arguments.
re.findall(r'[A-Z]+[a-z]*', blockstring)

['Money',
 'Bootleggers',
 'Ballooned',
 'Cadillacs',
 'Whizzing',
 'Seventh',
 'Street',
 'Cane',
 'Jean',
 'Toomer']

#### Function & Arguments

Much like our earlier example, this regex function takes two arguments: the regex and the text we expect to find that regex.

#### r'[A-Z]+[a-z]*'

We can break down our regex into two components.

1. The first part ([A-Z]+) looks for an uppercase letter that appears one or more times. 
2. The second part ([a-z]\*) looks for any lowercase letter that appears any number of times.

Together, both parts look for a string of alphabetical characters (and nothing else) that starts with at least one uppercase letter. If, for example, there was a typo, and the first line read: "MOney burns...," re.findall would return 'MOney' because it follows the two rules we have set.

Try changing this regular expression below by using different operators to see how slight changes can quickly change the output.

In [24]:
re.findall(r'[A-Z]+[a-z]*', blockstring)

# Note, you can change the appearance of the output data by clicking the left margin underneath "Out[...]."

['Money',
 'Bootleggers',
 'Ballooned',
 'Cadillacs',
 'Whizzing',
 'Seventh',
 'Street',
 'Cane',
 'Jean',
 'Toomer']

#### Unstructured Data

You may have noticed that blockstring is not just a poem\*, properly speaking. The string contains the poem, which "ends" with "tracks," but then we get bibliographic information, such as the title and author. While the distinction between poetry and bibliographic information might be clear to us, the regular expression cannot differentiate between the two. Instead, it sees all of this information as a single string and, therefore, returns matches that do not belong only in the poem.

This string of text can be considered unstructured data. It is unstructured because there is no metadata (data that describes different categories of data, like poem or bibliography). If we only wanted to return capitalized words within the poem, we would have to re-write our regular expression.

\*Technically, "Seventh Street," is a prose-poem hybrid that does not fit into a generic definition. The part I've copied here is only part of the entire work.

## Conclusion

Regular Expressions are very powerful tools for cleaning, finding, and extracting information. Learning how they work and the ways they might fail is incredibly important. Regular expressions are often used when someone wants to make changes to a large portion of text, too large to curate by hand. That means that the regex writer, if not careful, can introduce errors into a data set by trying to clean up a few examples. It is always a good idea to test out and review any regular expressions that you use. And, again, the internet has plenty of sophisticated regular expressions available--just be careful those changes work with (and not against) your particular data set.

____

## Exercises

In [25]:
# Using the "blockstring" variable above, return only capitalized words that appear in the poem.
# Remember, it is good practice to handkey your re.findall function below to gain muscle memory.


In [None]:
# Now, try returning all the words (capitalized and uncapitalized) outside of the poem.
