# The Key Terms for Wednesday

* regular expression
* regular expression methods: `findall()`
* regular expression flag
* regular expression special character
* regular expression character sets
* regular expression quantifiers
* regular expression groups

# Introduction

Today we are continuing with our theme of doing as much NLP as possible *with only* plain python.

Regular expressions can be a great, efficient, interpretable way to do NLP tasks. They are especially good for situations where you need to *convert* text to a structured format (text to numbers, text to dates, text to phone numbers, etc). However, like all rule-based systems, if you find yourself with more than 50 regular expressions for a *detection* task a machine learning approach may be better!

Regular expressions can be used to locate particular characters or sequences of characters in a string. 

# Your First Regular Expressions

A **regular expression** is a pattern that we use to find things in strings. 

You already know how to find *exact matches* in strings. For example, you know how to find:
* '`yes'`
* `1.25`
* `9/29/2023`

Let's review! In the code cell below, write code to find each of those strings in a string.


In [3]:
# Find the strings above in the string `yes, I did buy 1.25 shares of IBM on 9/29/2023.`

But what if you want to find *fuzzy matches* in strings? For example, what if you want to find:
* `'yes'` or `'Yes'`
* any floating point number 
* any date (written in month/day/year format)

To do tasks like this, we use regular expressions.

There is a python package called `re` that adds the power to find and match regular expressions to python.

The re module offers a great deal of flexibility in working with regular expressions. The workflow for using `re` generally follows this format:

1. Import the `re` module and put the text being searched into a string
2. Create a Regex object with `re.compile()`
3. Pass the string into the compiled Regex object using a method such as `.findall()`
4. Return the matches

Let's examine these steps in a little more detail.

In the code cell below, import `re`.

In [2]:
import re

In the code cell below, assign the variable `text_to_search` to the text `'yes, I did buy 1.25 shares of IBM on 9/29/2023.'` 

In [7]:
# make a string
text_to_search = 'yes, I did buy 1.25 shares of IBM on 9/29/2023.'

The most basic regular expression is just a string. Here are some basic regular expressions:
* `'yes'`
* `'1.25'`
* `'9/29/2023'`

Let's make a pattern of `'yes'` and use `findall()` to find all occurrences of `'yes'` in `text_to_search`.

In [8]:
# make a pattern
pattern = re.compile(r'yes')

# use findall()
re.findall(pattern, text_to_search)

['yes']

Now, in the code cell below, look for `'1.25'` and `'9/29/2023'`.

It is not *strictly* necessary to compile a regular expression (pattern) before using it, but it does run *faster* if you do so. So for a regular expression you will only use once, you don't need to compile it first.

In [9]:
re.findall(r'yes', text_to_search)

['yes']

# Regular Expression Flags

Earlier, we said we wnat to find *fuzzy matches* in strings. For example, what if you want to find:
* `'yes'` or `'Yes'`
* any floating point number 
* any date (written in month/day/year format)

To create regular expressions for these, we will use special patterns.


Let's start with `'yes'` or `'Yes'`. To match *case insensitive* we use a **flag** in the `compile()` method. In the code cell below, notice how we find `'yes'` even though the pattern was `'YES'`, because of the flag.

In [14]:
# make a pattern to recognize any variation on 'yes'
pattern = re.compile(r'YES', re.IGNORECASE)

# use findall()
re.findall(pattern, text_to_search)


['yes']

There is a complete list of flags in the [python RE docs](https://docs.python.org/3/library/re.html) but I mostly use the `IGNORECASE` one.

# Regular Expression Special Characters

To match any floating point number, we will use a couple of regular expression **special characters**, in particular:

* `\d` - matches any digit (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
* `+` - matches any single character *one or more times*

Since a floating point number has a `.` in it and `.` is a regular expression special character, we have to *escape* it just like we did with fancy strings last class:

* `\.`

Using these tips, in the code cell below, make a pattern to match any floating point number. Then apply it to `text_to_search`.

In [None]:
# make a pattern to match floating point numbers 

# use findall() to apply it to text_to_search

Now make a pattern to match dates in the format month/day/year. You don't need to handle special cases like February; if your pattern matches 2/31/23 that's fine.


In [None]:
# make a pattern to match dates in the format month/day/year

# use findall() to apply it to text_to_search

What does the regular expression special character `.` do? Use the code cell below to fill in this table for common regular expression special characters.

| Special character | What does it match? |
| ----------------- | ------------------- |
| \w                |                     |
| \W                |                     |
| \s                |                     |
| \S                |                     |
| \d                |                     |
| \D                |                     |
| .                 |                     |


# Regular Expression Character Sets

Sometimes we want to define our own "set" of characters to match on (other than generic ones like digits, letters or white space). We can define a set of potential characters to match by putting them in brackets `[]`. 

|Expression|Matches|
|---|---|
|[ ]| Characters in brackets |
|[^ ]| Characters not in brackets |

We can specify exact characters to match:

* `[.,-]` Match a period, comma, or dash
* `[rs]` Match the lowercase letter r or s
* `[^t]` Match any character that is not lowercase t

or we can specify a range to match, such as:

* `[A-Z]` Match any capital letter, from A to Z
* `[A-F]` Match any capital letter, from A to F
* `[a-z]` Match any lowercase letter, a to z
* `[A-fa-f]` Match any letter, regardless of case from A to F
* `[0-3]` Match any number, from 0-3


In the code cell below, write and test a pattern to recognize names, where a name could be a `First M. Last`, a `First Last`, a `Mr. First M. Last`, a `Ms. First M. Last`, a `Mx. First Last`, etc.

In [None]:
# write the pattern

# test the pattern

# Regular Expression Quantifiers

**Quantifiers** let us repeat a character match for some additional number of characters. 

|Expression|Matches|
|---|---|
|\*| 0 or more |
|+| 1 or more |
|?| 0 or 1 |
|{4}| Exact number |
|{3,6}| Minimum to maximum range |

For example, we could match phone numbers by using:

`\d\d\d.\d\d\d.\d\d\d\d`

or we could write this with a quantifier as:

`\d{3}.\d{3}.\d{4}`


In the code cell below, write and test a pattern to match dates in the month/day/year format, where the month can be any 1 or 2 digit number, the day any 1 or 2 digit number, and the year any 4 digit number.

# Regular Expression Groups

We can also use **groups** to specify sections of a regular expression. This can be handy if we want to return parts of the regular expression in chunks or if we want to specify a set of possibilities for a particular character or set of characters.

|Expression|Matches|
|---|---|
|(A\|B\|C)| Capital A or capital B or capital C|

For example, the pattern `(Mrs|Ms|Mr|Mx|Dr).?` would match a variety of honirifics with or without a trailing period.

In the code cell below, write and test a regular expression that can match email addresses. In this example, an email address:

* has stuff before `@` and stuff after `@`. 
* before `@` can be anything other than `@`.
* after `@` there can be stuff before `.` and after `.`.
* before `.` can only be alphanumeric or `-`.
* after `.` can only be `org`, `com`, or `edu`.

Don't forget to escape `.`!

# Resources

We have just touched the surface of regular expressions! They are an incredibly powerful tool for text processing. Even better, you can use them in python, in Java, in other programming languages *and on the command line*!

This notebook is inspired by [this Constellate tutorial](https://github.com/ithaka/constellate-notebooks/blob/master/regular-expressions.ipynb).

Crafting the right regular expression can be difficult, but can often save hours of labor for many menial tasks. When crafting a regular expression, it can be very helpful to use a tool like [RegExr](https://regexr.com/) that demonstrates how expressions are being matched on a few sample texts as you type them.

Full documentation for python regular expressions can be found [here](https://docs.python.org/3/library/re.html). There is even a complete example for implementing a tokenizer!

