In [162]:
# This is all we need!
import re 
import string

# Introduction
---
Learning the way of regular expressions is a daunting task, that being said, it is a useful skill for all text processing related tasks. Regular expressions are what power our belove `command + F` we use on a daily basis to avoid reading the extra fat of an article. Cut to the chase!!

This notebook walks through the fundamentals of the regex module, and serves as a brief introduction. It by no means is a comprehensive walkthrough of RegEx and all of its capabilities, but it should be enough to get you on your way.

# What are Regular Expressions?
---
In my own words, regular expressions are used for scraping strings for specific patterns. These patterns are determined through a syntax defined by the regular expression search language. I like to think of it as a more flexible implementation of the `string.find()` method. If you don't know what `string.find()` does, find out by playing with the code cell below. It will search the sentence `s` (a haystack) in attempt to find `needle`. If the `needle` is found, it returns the index of the string locating its occurance. If there are no occurrences, the `find()` method will return -1.

In [2]:
s = 'it is like searching for a needle in a haystack'
answer = s.find('needle') > -1
print(f'Question: Is there a needle in the haystack? \nAnswer: {answer}')

Question: Is there a needle in the haystack? 
Answer: True


The find method is really simple and easy to use (elegance!!). I use this all the time for quick and dirty scripts, but it does not generalize well.

## A Quick Case Study
----
Suppose you want to extract a phone number from a paragraph of text. You could try and write some code using the `string.find()` method.

You can see right away that a simple `p.find('867-5309')` will do the trick. The solution is simple, and only takes up one line of code. The downside is, you need to know the exact phone number you are looking for. It can't search for ANY phone number, just a specific instance. To become a better developer, you must think generally. How can more people use this software to their advantage?

When it comes to word / string applications, regex is is right the tool for generalizing code.

In [3]:
lyrics = 'Eight six seven five three oh nine \n' \
'Jenny jenny you\'re the girl for me \n' \
'You don\'t know me but you make me so happy \n' \
'I tried to call you before but I lost my nerve \n' \
'I tried my imagination but I was disturbed \n' \
'8675309'

find_jenny = lambda x: 'We found jenny' if x.find('8675309') > 0 else 'Hullo?'
find_jenny(lyrics)

'We found jenny'

## Generalize s.find() to Any Phone Number
---
If you wanted to extract any instance of a phone number, it would need to run through ALL possible permutations of a phone number which is approximately $9 * 10^7$ (assuming you can't have 0 as a valid first entry for a phone number).

```python
paragraph.find('1112223333') # --> Nope
paraghraph.find('1112223334') # --> Try again
# You get the point...
```

I refuse to write such a function, when I can instead invest a bit of time to learn to regex module. While we learn the daunting syntax of Regular Expressions, please keep this use case in the back of your mind. Hopefully by the end of this notebook we will have some simple code capable of finding ANY phone number in a paragraph of text. 

# Square One
---
We have already imported the regular expression module (see the very first cell). When working with regular expressions, we must operate on [raw strings](https://www.codevscolor.com/python-raw-string#:~:text=raw%20strings%20are%20raw%20string,n%E2%80%9D%20as%20a%20normal%20character.). To define a raw string, we just put an `r` on the outer left side of a string quotation. See for example:

```python
raw =    r'This is a raw string'
normal = 'This is a normal string'
```

Raw strings allow us to treat back slashes as *literal* characters. The code cell below shows by example the difference in behaviour between a raw string and a normal string when a backslash is present.

In [4]:
s = r'The backslash is a literal string: \n'
print(s,'\n')
s = 'The backslash creates a new line in a normal string: \nPretty cool!'
print(s)

The backslash is a literal string: \n 

The backslash creates a new line in a normal string: 
Pretty cool!


# Findall
---
The most common regex function to use is the `findall(pattern, string)` function. Given a *raw* string as a `pattern`, it will search for all sub strings in `string` that match the `pattern`. Let's use this function to discover some digits in a string. 

In [380]:
re.findall(r'[0-9]','1,2,3,4 tell me that you love me more!')

['1', '2', '3', '4']

Using some simple syntax, we found all instances of SINGLE digit numbers using the following line of code. Just imagine the code we would have had to come up with using `string.find()` to do this simple task! Let's break it down:

```python
pattern = r'[0-9]' # regular expression pattern
```
1. The `'r'` signifies the instantiation of a raw string (sounding like a broken record here, but repetition matters)
2. The square brackets represent a character class (i.e. a grouping of a specific characters)
3. The hyphen denotes a range (i.e. any digit from 0 to 9). This is called the *Range Operator* (remember this term, because we are going to talk about it shortly).

I find all of this to be straight forward except maybe the *character class* idea. Let's run through some more examples to solidify this idea. 

## Character Class Example 1
---
We can use the character class to spell check alternative ways of writing the same word. For example, did you know we can spell ambience as *ambiance*. If we wanted to find all instances of this word in a document, we would have to account for both ways of spelling it. That is made easy with regex and character classes! 

In [6]:
p = 'Snobby British Person: The ambience is this room is to die for.'\
'I just looooove the word ambiance.' # An absurd piece of text
result = re.findall(r'ambi[ae]nce',p)

print('Find me all instances of ambience/ambiance in this absurd text')
print(f'Here you go: {result}')

Find me all instances of ambience/ambiance in this absurd text
Here you go: ['ambience', 'ambiance']


## Character Class Example 2
---
To explain character classes using a different approach, we will write the equivalent of `r[0-9]` in its full form. Clearly the range operator (the hyphen) is more concise, but I think this example sufficiently explains what the character class is asking of the regex module.
> `r'[0123456789]'` -->  *find me all occurences where the numbers in the character class [0123456789] occur in a string*

In [7]:
pattern = r'[0123456789]'
re.findall(pattern, '01-01-2020')

['0', '1', '0', '1', '2', '0', '2', '0']

# Extending the Range Operator to Letters
---
The range operator is a concise notation to include all elements within an ordered sequence. Examples of ordered sequences include numbers ranging from 0 to 9, or all lowercase letters a to z.

**The Takeaway:** The range operator isn't only for digits, it applies to alphabetical ordering. For concreteness, see the following example. The pattern `'r[a-z]'` will return every single lowercase letter from a to z.

In [9]:
anyletter = r'[a-z]'
print(re.findall(anyletter,'every letter'))

['e', 'v', 'e', 'r', 'y', 'l', 'e', 't', 't', 'e', 'r']


# Combining Expressions
---
We know how to extract single digits and letters. Now let's combine two regular expressions to retrieve two digit numbers. The code below will extract '99' from the '999' substring. Of course if we wanted to change to extract 3 digits, we can just add another `r'[0-9]'` to the mix. 

In [10]:
beer = re.findall(r'[0-9][0-9]','999 bottles of beer on the wall, 999 bottles of beer...')
print(beer)

['99', '99']


## Example: Compound Expressions for Phone Numbers
---
By now it should be setting in that we have enough information to scrape some phone numbers. I can't guarantee this will be a pretty solution, but it gets us a little closer to where we want to go.

In [10]:
print('Sing it with me now!',''.join(re.findall(r'[0-9]'*7,lyrics)))

Sing it with me now! 8675309


Here we multiplied the string by `7` to perform a search for 7 single digits. It was that easy! Now we have some rough code that can find 7 digit numbers within a mountain of text. Problem is, what happens if a number appears as such: *867-5309* or *867 5309*? Both are common ways of spelling phone numbers... Looks like we need to keep learning some more syntax to be able to handle those edge cases.

# New Operator: '+' (One or More)
---
The `'+'` operator indicates *one or more* of some specified characters. It allows us to match up to an infinite number of characters. A couple of examples could be:

```python
r'[0-9]+' # Match a sequence of digits containing number 0 to 9, of any length
r'[a-z]+' # Match a sequence of letters from a-z, of any length
r'a+'     # Match a sequence of a's, of any length
```

Let's search `lyrics` for Jenny's phone number using this new syntax.

In [11]:
re.findall(r'[0-9]+',lyrics) # we don't have to repeat the string 7 times... we're getting stronger...

['8675309']

## Example: Finding an e-mail Address
---
An e-mail address follows the form *xxxx@xxxx.xxx* (I hate how this returns a link...). Let's assume that no numbers are allowed to be included in an e-mail (this makes our life easier). Using the `'+'` operator and compound expressions, this should be a breeze! Feel free to toy around with each character class of the pattern to learn more about the individual components. 

In [18]:
pattern = r'[a-z]+[@][a-z]+[.][a-z]+'
email = 'If you can\'t reach jenny by phone, here email poor@jenny.com is a great second option!'
print("Give me jenny's email while you're at it ... ",'\n',re.findall(pattern,email))

Give me jenny's email while you're at it ...  
 ['poor@jenny.com']


# Character Class vs. Matching Directly
---
If you recall from some previous sections ago, we discussed the notion of character classes. A character class is a grouping of specified characters (i.e. numbers, letters, numbers and letters etc.). You can also choose to match directly if you exclude brackets around the pattern string. These behaviours are quite different if we want to match more than just a single character. Consider the examples below, and notice their different outputs.

```python
pattern = re.findall(r'[ab]','abcdefg') # returns ['a','b']
pattern = re.findall(r'ab','abcdefg') # returns 'ab'
```

In [13]:
print(re.findall(r"Exact","Match Exactly"),
      re.findall(r"[class]","Character Class"))

['Exact'] ['a', 'a', 'c', 'l', 'a', 's', 's']


# New Operator: '|' (Or)
---
Formally, the new operator is called disjunction, but I find that to be confusing. Insead, whenever I see the '|' symbol, I read it as "or". The pseudocode below will show some code and the comment will describe how I would read it out loud.

```python
r'[a-z]|[0-9]' # give me a SINGLE letter OR a SINGLE digit
r'[a-z]+|[0-9]+' # give me a sequence of any length of all letters OR a sequence of any length of all digits
```

An example is provided below. Notice that it does not mix letters and numbers. It only matches substrings of ANY length that are either ALL numbers **OR** ALL letters.

In [19]:
re.findall(r'[0-9]+|[a-z]+','hello r2d2')

['hello', 'r', '2', 'd', '2']

## Example: Getting to know R2D2 Better
---
If we wanted to mix letters and numbers, we just have to move the "|" to a different place. By placing it within a character class such as `r'[a-z|0-9]+'`, it will grab a sequence of ANY length (hence the "+") operator that includes digits OR letters. Check it out below!

In [15]:
re.findall(r'[0-9|a-z]+','hello r2d2')

['hello', 'r2d2']

# New Operator: '?' (Optional)
---
The question mark makes the preceding character in the regular expression optional. We can specify which character can be optionally present. If we consider phone numbers, we may write them as:
* 867 5309
* 867-5309
* 8675309

Using this optional operator we can account for all of these cases! I will walk through the inner workings of the all encompassing phone grabber regex expression `r'[0-9]+[ -]?[0-9]+'`:
* `'[0-9]+'` searches for any collection of digits between 0 and 9, of ANY length. 
* `'[ -]?` makes use of the optional character saying a SINGLE space or hyphen may optionally exist
* Then we end with what we started, another expression grabbing a sequence of digits between 0 and 9 of ANY length.

We test the code below, and see that it is working fairly well! All numbers are extracted correctly. We have some pretty simple code to extact ANY 7 digit phone number. That was fast! But let's keep learning!

In [21]:
pattern = r'[0-9]+[ -]?[0-9]+'
print(re.findall(pattern,'867-5309 is the magic number'))
print(re.findall(pattern,'8675309 we get it... you like jenny'))
print(re.findall(pattern,'This number 867 5309 is stuck in my head now'))

['867-5309']
['8675309']
['867 5309']


# New Operator: '\*'  (Zero or More)
---
The asterisks will match to zero or more copies of a specified pattern. In a sense, it is the combination of the optional operator ('?') and the One or More operator ('+').

A good use case for this operator is handling floats/deccimals such as $0.5$ or $.5$. In this case, they are both equal but one does not have a leading $0$. This can be phrased as:

> A float is a number **0 or more** leading numbers before the decimal point.

The code below should be able to identify all float-like numbers (numbers that a have a decimal point).

In [35]:
pattern = r"[0-9]*[.][0-9]+"
print(re.findall(pattern,"A total of .793 of Candians love maple syrup!")[0],
      "is a large percentage!")

.793 is a large percentage!


# A Note on Escape Characters
---
There may be times where we want to match characters that are used in place for RegEx operations. For example, suppose we want to extract the `+` symbol from text. This requires us to inform the regex compiler to ignore the `+` as the *ONE OR MORE* operation, and instead treat it as a literal character. Doing so isn't hard, it just requires the addition of a backslash before the special character you want to match. 

I couldn't think of a fun example for escape characters since they aren't that cool. Run the code below to extract the addition symbol from the mathematical formula. 

Try deleting the backslash, and bad things will happen...

In [386]:
res = re.findall(r"[\+\-]","A battery consists of a positive (+) cathode and negative (-) anode.")
print(f"The yin ({res[0]}) and yang ({res[1]}) of battery power")

The yin (+) and yang (-) of battery power


# New Operator: '.' (Any Character)
---
Time to introduce the wildcard of the operator family. The `.` operator will match ANY character except the *new line character*. As a reminder, the new line character is the `\n` escape sequence that tells a string to move to a new line. This operator will certainly clean up our code to extract phone numbers! 

In [130]:
pattern = r"[0-9]+.[0-9]+"
re.findall(pattern,"Alright Mrs. 867-5309, let's stop playing number games")

['867-5309']

# New Operator '^' (Exclude)
---
We can choose to avoid matching specific characters using the exclude operator (`^`) inside of a character class. This is the only time the caret works as an exclude operator! When it is outside of square brackets, it does not serve the same purpose. 

For example, if we want to exclude all vowels from the alphabet to obtain the list of all consonants, we may want to use the caret.

In [232]:
pattern = r"[^aeiou][a-z]+"
consonants = re.findall(pattern,string.ascii_lowercase)
print(f"Read it five times fast: {consonants[0]}")

Go ahead, try and pronounce it: bcdefghijklmnopqrstuvwxyz


# Adding Parentheses
---
This is where regular expressions becomes rather confusing in my opinion. There is this notion of *capturing* and *non-capturing* expressions. 
* A *capturing* expression looks like "I love (cats|dogs)". This will return either "cats" or "dogs"
* A *non-capturing* expression looks like "I love (?:cats|dogs)". This will return the full pattern match of "I love cats" or "I love dogs"

**Rule of Thumb:** As far as I can gather, you will use (?:) 99% of the time. 

In [299]:
s = "I am an animal lover, and I love dogs the most!"
print('Capturing output: ',re.findall(r"I love (dogs)",s))
print('Non-Capturing output:', re.findall(r"I love (?:dogs)",s))

Capturing output:  ['dogs']
Non-Capturing output: ['I love dogs']


To better understand this, try running the code below with and without the `?:` operator. By removing the `?:`, we only retrieve the match with the string pattern `//`. By including, we receive the entire string.

In [342]:
s = "This is an example of a website url https://www.google.com/ , kind of ugly tbh"
pattern1 = r"(?://)[a-z.]{4}"
re.findall(pattern1,s)

['//www.']

# Conclusion
---
Well that's all I have for my brief introduction to regular expressions using Python. We managed to write some pretty clean code that searches for phone numbers! Along the way we learned the following operators:
* `[abc]`: Square brackets denote a character class which tells RegEx to pick up all elements contain within that list
* `+` : The one or more operator picks up an infinite number of sequences that contain the preceding elements
* `?`: Optional inclusion of all characters preceding it
* `*` : Same as the `+` oeprator but it is **Zero or More**
* `\` : Backslashes allow regex patterns to pick up special characters such as `+` or `-`
* `.` : The wildcard picks up any character except the newline character defined as `\n`
* `^` : We can choose to exclude character using carets within a character class (i.e. `[^ignoremore]`)
* `(?: )` : We can group patterns together like we can group mathematical  expression. The use of the `?:` is highly encourage for most applications. 