# Module 3: Regular Expressions
<br>


## Table of Contents
<br>

## Module 3: Regular Expressions
<ol>
  <li>What are regular expressions for??</li>
  <li>Case Study</li>
  <li>Matching Letters</li>
  <li>Matching Multiple Characters</li>
  <li>The "any" Character</li>
  <li>Escaping Special Characters</li>
  <li>Grouping Expressions</li>
  <li>Starting and Ending Strings</li>
  <li>Programming with REGEX</li>
  <li>Special Character Glossary</li>

</ol>

**Learning Outcomes:** 
By the end of the module you will:
* Be able to understand when regular expressions are useful 

* Be able to write expressions that use:
    + Characters
    + Multiple characters
    + Unknown characters
    + Special character
    + Groups of expressions
    + Starting and ending characters
 
 
*  Be able to write expressions using combinations of the above.
<br>

Additionally you should be able to:

* Use regular expressions in python with the `re` and `pandas` libraries.

<br>





## About this module

This module will introduce basic concepts in regular expressions through using a case study. The regular expression theory in this course is largely language agnostic, in that all major programming language implement them in a similar manner. 

The beginning of the course will be conducted using a website tool, [regex101.com](https://regex101.com/) that helps us easily test expressions. Later on we will look at how to use regular expressions in python and with data sets in `pandas`.


## What are they for?

Regular expressions are a useful tool when we have text data. 

Often when using text data we want to find out if the text meets some criteria. 

Some example things we might want to check the text data for:

* if it starts with the letter "B"
* if the all the letters in it are capitalised
* if the text ends in a question mark "?"
* whether the text contains a specific keyword, like a name or location

These checks can mostly be done using our programming knowledge about strings and looping through the data. To check for the starting letting in a text we can check the first element, we can loop over all text and check if each letter is capitalised, and so on. 

However, there are more complicated checks we may need to make on our text data. These are going to be harder to do by indexing and looping though the data.

Some harder tasks to achieve are checking:
* whether the text is a valid email
* if a password meets some security requirement
* whether an phone number follows a valid format
* if a mistake has been made when inputting some data manually

Using this password security example if we had four security requirements: must contain at least one uppercase and lowercase letter, must have at least one digit and must be longer than 8 characters. 

<img src="../pics/password.jpg" alt="Password checker"
	title="Password verification" width="400" height="200" />


To check a given password for this criteria we would need to program a very specific solution such as:
* loop over every element and check whether each is uppercase or lowercase, passes if at least one element is of each
* loop through the text and check if each element is a digit, if at least one is pass the criteria
* count the number of elements in the string, if the length is greater than 8 pass the criteria

From this it's clear that there is going to be some iteration and potentially a lot of conditions in the program to get it to work for our requirements

Regular expressions allow us to tackle the second group of examples and similar problems more easily than using a typical string programming approach. They are used widely to search and replace parts of text data.

A regular expression lets us design a pattern for our specific problem. This pattern will describe what we need from the text data without needing to loop through the string ourselves, or add complex conditional statements (`if/else` statements). 

The above password problem can be fully described by the following regular expression:

`"^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[a-zA-Z]).{8,}$"`

**Note** you do not need to understand the above yet.

The above text may be quite intimidating, but what they can do is extremely powerful. Regular expressions are a language themselves, but we are going to start small and the build up to more complicated expressions like the above.

Below is an example of the same regular expression broken into groups, each showing what we want to match from our criteria.

<img src="../pics/annotatedregex.png" alt="password regex"
	title="passwordregex" width="800" height="400" />

## Set up regex101

It is encouraged to get a feel for [regex101.com](https://regex101.com/) yourself as it is a very handy tool for quickly testing the expressions you have designed. Throughout the course you will be instructed to In this section we will briefly go through the relevant areas of the site.

One of the added benefits of regex101 is that it will also give you an explanation of what your regular expression does in words.

Open up [regex101.com](https://regex101.com/)

<img src="../pics/regex101wide.png" alt="view of regex101"
	title="regex101" width="800" height="400" />
    
The page will look something like the above.

Change the flavour of your regular expressions to **Python**. The below image shows the main areas of the site we will be using.

<img src="../pics/regex101annotated.png" alt="view of regex101 annotated"
	title="regex101 annotated" width="800" height="400" />

# The Case Study

Throughout this module we are going to work through a specific problem, introducing regular expression (regex) concepts as we go. The first few concepts we are going to test using the regex101.com website, after that we will bring them all together in python.

## The Task

Your team is running a large conference which is only available to current civil servants. Your team has decided that it is going to check that individuals are eligible for the event by checking their registered email.

Broadly there are two things which you will need to check:
* That the person enrolled has a validly formatted email
* That the email has a government domain.

Specifically, you need to check for each email:
* the email follows "firstname" "." "lastname" convention in the username, such as "Jenny.Smith@ons.gov.uk". Each name should just be letters. 
* there is a "@" character between the username and  email domain
* the email ends in ".gov.uk"
* the domain contains a department name (it doesn't need to be a real department) before the ".gov.uk", which is all letters.

You have been given the addition information:
* all letters in the emails can be upper or lowercase
* the text will not contain any spaces

## Your role

While you could try and check all the email addresses by hand, or by writing a program with lots of conditional statements, you have decided to tackle this problem with regular expressions. 

You have been told that the best approach to building regular expressions is to tackle each criteria of the pattern, a sub-pattern, then combine them. At each step you're going to check the sub pattern against some realistic partial test text. 

Once the expression is all brought together there is a file containing test emails that can be used to validate whether your expression gets the matching right.



## Matching Letters

The first part of the problem we are going to tackle is to check that the first name in the email contains just letters.

To do this we need two things:
* To be able to match when something is a letter
* To match multiple letters


### Match a letter

We can create a "set" of values we want to match by using square brackets `[]` and putting the elements we want to match inside.

For example, if I wanted to match the letters "a", "b" or "c" I could create the set `[abc]`. This will check an individual element in a string, and see if that element is a member of the set. 

The set `[abc]` when applied to each element in "abracadabra" would give the following matches.

![abracadabra matching](../pics/abracadabra.png)

We can create a more general set by indicating a range of values we want. For example we can:

* match the set of all lowercase letters: `[a-z]`
* match the set of all uppercase letters: `[A-Z]`
* match the set of all digits: `[0-9]`

More strings can be included in the set to widen what it will match. For example if we want to match all lowercase letters **AND** spaces, we could write `[a-z ]`, notice the space now inside the set.




### Exercise 1

In regex101.com, in the string test section write your full name in the form
> Jenny Smith

Within the regex section, write a regex set such that all characters are matched in your name, including spaces.

In [None]:
# Import the appropriate function to view images in the exercise answers
from IPython.display import Image 

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise1.py

If, instead of matching one specific letter or character, we wanted to match an exact string of characters, we can specify that as our regular expression instead. For example we can write:
> "hi"

If we want to match those exact strings in that order.
We can even combine the specific strings with sets so that we can match a range of words. 

> "hi[tp]" - will match both "hit" and "hip"


### Exercise 2

In regex101.com, in the string test section write the following words
* help
* hell
* held

Within the regex section, write a regex which matches **help** and **held** but not **hell**.

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise2.py

### Match multiple letters

We have so far matched only individual letters, or a specified string of characters. In order to be able to match a first name from our emails we are going to need to specify how many letters there could be.

To change the number of occurrences of a character in our expression we add a "special character" after the character we want occuring. 

A special character is a character which has a particular meaning other than itself. 

There are three special characters we will look at that can modify the number of occurrences of a character, these are:

* "+" - preceding character appears once or more times
* "*" - preceding character appears 0 or more times
* "?" - preceding character appears 0 or 1 times
* "{n}" - preceeding character appears `n` times
* "{n,}" - preceeding character appears `n` or more times
* "{n,m}" - preceeding character appears `n` or more times, but no more tham `m` times

These specific "special characters" are called "Quantifiers" as they determine how many times a string occurs.

![Quantifier table](../pics/quantifiers.png)


By using these in combination with the characters we want we can fit the right pattern to our needs.

If we definitely know we want a character to be present, we will use the "+" special character.

For example, if I want to match the letter "a" followed by at least one "h", I would use the regex `"ah+"`.

Try out `"ah+"` in regex101.com using the below test strings:

* ah
* ahhhh
* ahhhhhhh
* a
* h
* ahhhhha

The strings will only match the whole first 3 strings. The last case is partially matched, but note that it doesn't pick up on the final "a". 


If we know that a specific character may or may not appear, we often use the "?" character. This is similar to saying "the preceding character might be there, and either way we want to match it". 

**Be careful** when using occurance that could be present zero times, it may not pick up what you want!

**Note:** These characters are "greedy" - they will match the maximum numbers of characters that they can.

Write the following regex into regex101.com and add the below test strings `c?at`
* cat
* bat
* chat
* at

Add in some more words ending in "at" and see which match fully. In the above examples only "cat" and "at" match because due to the "?" character the "c" may or may not be present for a match to happen.

The last special character in this area is "*". This denotes that the preceding character occurs either one or many times. For this reason it is often said to denote "occurs any amount of times". It is like a combination of the two previous characters.

Write the following regex into regex101.com and add the below test strings `t*he`
* he
* the
* tttttttttttthe

As you can see the regex will fully match all three of these examples, it is very tolerant in that it will take any number of occurrences. Use it only if you're sure your task can accept having 0 occurrences of some character.


### Exercise 3

In regex101.com write a regular expression that matches:
* Hello
* Helloooo
* Hllo

But not:
* Heeeelloo
* Hell
    

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise3.py

## Back to emails

We are now equipped to start thinking about our case study. We can now match multiple occurrences of a set of characters. If we look back at our criteria:

* the email follows "firstname" "." "lastname" convention in the username, such as "Jenny.Smith@ons.gov.uk". Each name should just be letters.
* there is a "@" character between the username and email domain
* the email ends in ".gov.uk"
* the domain contains a department name (it doesn't need to be a real department) before the ".gov.uk", which is all letters.

We can start piecing together parts of our final regular expression.

To match the firstname, lastname and government department we are actually going to use the same sub expression. 

What we need is to match a string of letters (one or more letter), that is possibly a mix of uppercase and lowercase.



### Exercise 4

In regex101 write an expression that matches fully single word names, or department names / abbreviations. 

The below test strings should be matches.

* Jenny
* jenkins
* HaRold
* F
* DCMS
* ons


These strings should not be matched fully.

* Jon smith
* !EMPTY STRING!
* sasha123
* caroline.davies

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise4.py

### Any Character

So far we have looked at sets, strings and multiple occurrences. These are useful, but we need to have information about the what that character will be.

To overcome this there is a special character which will match "anything", no matter what character is present. 

This can be a really useful addition to our tool belt as it allows us the add in a generic placeholders where we don't know what our string will look like.

In regex101 write the following regex and test text, `[.*]`

"This CAN hav3 an%thing"

As you can see, this matches the whole string fully, and it will match anything we throw at it, even if we give it an empty string. Test this by adding more text strings with different punctuation and characters. 

### Exercise 5

In regex101 write an expression that matches all text which starts with "AB_", followed by at least one character. Test it on the strings below.

These should fully match:
* AB_123
* AB_!!!
* AB_
* AB_check
* AB_AB_

These should not
* ab_
* AB

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise5.py

### Escaping special characters

Introducing these special characters gives us more flexibility, but what happens if we want to match one of the special characters? For example, what if we want our expression to match a "+" symbol itself, not it's special character meaning?

To do this we "escape" the special character by adding in a new special character: "\". By putting this before the character we want to escape.

For example, to escape "+" we would write "\+". Or to escape "*" we would write "\*". If we even wanted to escape "\" we can use "\\".

### Exercise 6

Write a regular expression in regex101 to fully match a string ending with a question mark.

The following should match:
* Are you there?
* What am I?
* How are you?

The following should not match:
* Are you okay!
* What's going on

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise6.py

### Grouping Expressions

We have been writing sub-expressions so far that can be combined. It is often useful to contain our sub-expressions in groups by using the special character brackets `()`. This is done particularly when the sub-expression becomes complicated.

Each sub-expression gets put within a pair of brackets.

`(first_expression)(second_expression)`

### Exercise 7

We are now able to combine sub-expressions using brackets, as well as escaping special characters. This means we can add some of our expressions together to help match our email formatting.

Using the `[a-zA-Z]+` sub-expression, write a regular expression that matches the format `firstname.lastname`.

It should match the following:
* Fred.woolnough
* Jane.Leslie
* maureen.godson

But not fully match the strings below:
* jack,wigley
* henrywelsh
* sarah-riley
* charlie.alexander123

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise7.py

## Starting and Ending Strings

As you can see in the previous exercise, at the moment we can still have parts of strings that do not match returning a partial match.

For example in `charlie.alexander`**123**, the "123" part goes unmatched. This could lead to unintended consequences, such as creating matches that contain extra leading or trailing strings.

This is counteracted by using two more special characters.

`^` proceeds a character/expression marking the start of a string.

`$` follows a character to mark that it ends the string.

For example, if we wanted an expression that matched anything as long as it starts with a capital letter and ends with a question mark, filled with anything in-between.

To match the strings starting with a capital letter use `"^[A-Z]"`

To match the strings ending in a question mark use `"\?$"`.

To match anything in between use `".*"`.

All together this becomes:

`"^([A-Z])(.*)(\?$)"`

Copy the expression into regex101 and try it out with your own question strings.

The result will look something like this:

<img src="../pics/example_question.png" alt="question result regex"
	title="question" width="300" height="400" />

### Exercise 8

Building on the previous exercise, where the regular expression is `"([a-zA-Z]+)(\.)([a-zA-Z]+)"`, change the expression so that the firstname appears at the beginning of the string.

In addition, add the `@` symbol after the lastname then the department name

Your new format will be `firstname.lastname@department`

The new expression will match:
* harry.potter@mom
* bilbo.Baggins@SHIRE
* Luke.swalker@rotj


But not the following:
* L.sharpe.1@mod
* gerry.barry.chorley@ons

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise8.py

### Exercise 9 

Write a regular expression that matches strings which end in ".gov.uk". The string can contain any number of any character before the ".gov.uk".

The regex should fully match the following strings:

* 1234567.gov.uk
* jasmine.WALTON.gov.uk
* .gov.uk
* ......gov.uk

But not match:

* Jack.gov.uk.org
* Sarah.gove.uk
* Alison.gov.up

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise9.py

### Exercise 10

Combine the regular expressions from exercise 8 and 9 to solve the case study, in regex101 write your own tests based on the criteria and check your answer passes them. Below is a reminder of the case study requirements:

> * the email follows "firstname" "." "lastname" convention in the username, such as "Jenny.Smith@ons.gov.uk". Each name should just be letters.
* there is a "@" character between the username and email domain
* the email ends in ".gov.uk"
* the domain contains a department name (it doesn't need to be a real department) before the ".gov.uk", which is all letters.

> You have been given the addition information:

> * all letters in the emails can be upper or lowercase
* the text will not contain any spaces

We will test the expression in python later to see if it matches the test cases prepared.

# Using regex in programming

So far we have been using the regex101 tool to test the regular expressions we have designed. This is a great tool for learning, and testing individual expressions, but at some point we need to use the expressions in our programming itself.

Regular expressions are supported in many programming major languages, including python, R, Unix and windows command line interfaces.

Occasionally there are slight differences in how the languages style their regular expression, these differences are minor.

In python there is a library included called `re` which gives the programmer access to regular expressions and related functions. In addition, to use regular expressions in dataframes `pandas` provides a range of string methods which take regular expressions as inputs.

In R there is a package included called `grep` which allows the programmer to use regular expressions with text data. Within the `tidyverse` the `stringr` package has functions that can use regular expressions.

This is a Python tutorial, so we will focus on how to use regex using python packages. 

## The `re` library

The `re` library, short for regular expressions, contains a range of functions that allow us to use regular expressions in text. We are going to show two frequently used operations: searching and substituting. 



### Searching

In [None]:
# Import the package to use it's functions
import re

# Define a string we are going to work with
search_text = "Mary had a little lamb."

# Define a regex to matching a string starting with "Mary", 
# it can have anything after, but ends in "little"
mary_regex = "^(Mary).*(little)"

# Run the search function using the expression and string
match_object = re.search(mary_regex, search_text)

In [None]:
# The match object found contains all we need from our regex search
match_object

We can access each piece of information individually

In [None]:
# The group() gives us the string that matches the regex
match_object.group()

In [None]:
# The string gives us the original string that was given to the search function
match_object.string

In [None]:
# The span() tells us what between which indexes the regex matches the string
match_object.span()

There is also a function to `re.findall()` function, which will return all non-overlapping strings in a string that match the given regex expression. It is called in a similar manner to `re.search()`.

### Substituting

To substitute we need to define:
* The regular expression stating what we want to match
* The string we want to replace the match with
* The string we want searched

In [None]:
# Import the package to use it's function
import re

# Define a string we are going to work with
original_text = "The      porridge   in the  hotel     was    too hot"

# Define a regex to match the "hot" at the end of the string,
# but not in "hotel"
temperature_regex = "hot$"

# Run the search function using the expression and string
new_text = re.sub(temperature_regex, "cold", original_text)

new_text

In [None]:
# Find all spaces
extra_spaces_regex = " +"

# Replace single or multiple spaces with a single space
no_spaces_text = re.sub(extra_spaces_regex, " ", new_text)

no_spaces_text

### Using regex with `pandas`

Pandas contains a range of string based methods in the `.str` group, such as `.startswith()`, `.replace()` and `.lower()`. 

Just like with `re` we can match and substitute using regular expressions using string data in `pandas` columns.

This is done using the `.str.match(pat=regex)` method for matching and `.str.replace(pat=regex, repl=replacement, regex=True)` to substituting on the columns we want to operate on.

The matching method will provide us with a Series of True/False values, depending on whether the string in each row matches the expression (`pat`) given. 

The replace method will return a new column where the string in each row has had any matching sub string replaced.

## Testing our regex

Below we are going to import a dataset that contains (fake) emails which may or may not meet the specified criteria to be eligible for signing up to the event. There is a second column which will tell us whether the individual is eligible, we can check the result of our matching against this column.

In [None]:
import pandas as pd

# Load dataset
gov_emails = pd.read_csv("../data/emails.csv")

In [None]:
# Look at the columns and values
gov_emails.head()

In [None]:
# We have string and boolean data types
gov_emails.dtypes

Let's check which emails end in ".gov.uk".

In [None]:
# Define our regex
ends_in_gov_uk = ".*(\.gov\.uk)$"

# Use match to check regex
gov_emails["end_matches"] = gov_emails["email"].str.match(pat=ends_in_gov_uk)

# Show some results
gov_emails.head()

We can see that three of the emails do not end in "\.gov\.uk", lets have a look at them.

In [None]:
# Mask the dataframe with only non-matching rows
gov_emails[~gov_emails["end_matches"]]

Great, we can see that these shouldn't have matched the overall criteria anyway, probably because their domains are incorrect.

### Exercise 11

Using your regular expression created in exercise 10 and the `gov_emails` data set create a new column called "match" which states whether the email matches your regular expression. Compare the values of "match" and "should_match", does your expression get it right for all the emails?

If your expression does not match all the emails it should correctly keep improving it until all are matched.

In [None]:
# Write your code here


In [None]:
# You can check how many mistakes your expression makes by running the code below
print("Number of incorrect matches:", (gov_emails["should_match"] != gov_emails["match"]).sum())

In [None]:
# To see the answer run this cell once, to see the image run it again
%load ../exercise_answers/exercise11.py

## Summary

Using regular expressions we can match complex patterns in strings. This is done by using "special characters" in the regex languages. Each of these characters has a meaning, which when combined can describe a pattern of text. 

The expressions will allow us to match and substitute with more flexibility and efficiency than traditional programming methods.

We have covered the following regular expression concepts:
* Matching strings
* Sets
* Multiple occurrences
* The `.` character
* Expression grouping
* Beginning and ending strings

In addition, we have briefly looked at using these concepts in python with:
* The `re` library
* `pandas`
* Matching and substituting
* A case study

From here you should be able to build some of your own regular expressions to solve string based problems. Remember, it is crucial that you apply your expressions to test strings to check that the expression performs as intended.

This tutorial covered a sub-set of special characters in regex, it is suggested you look through the **Further Reading** section to be aware of other

## Further reading

This tutorial goes through a subset of regular expressions and their features, there are many more features that can be explored.

* [Quantifiers](https://launchschool.com/books/regex/read/quantifiers)
* [Or operator - Alteration](https://www.regular-expressions.info/alternation.html)
* [General python implementation](https://docs.python.org/3/howto/regex.html)
* [Lookaheads and negation](https://blog.finxter.com/how-to-find-all-lines-not-containing-a-regex-in-python/)
* [Shorthand character classes](https://www.regular-expressions.info/refshorthand.html)
* [Example of cleaning text](https://medium.com/python-in-plain-english/data-cleaning-in-python-using-regular-expressions-920629586a05)

## Special Character Glossary

Below is a short look up table to show the meaning of each of the special characters mentioned in this modules. A standalone version is available within the supporting materials folder.

| Character  | Example  | Matches | Meaning  |
|:-:|:-|:-:|--:|
| "."  | "Hello ."  | "Hello ", "Hello 6", "Hello Y" | Matches any character including no character  |
| "[]"  | "[a-z]"  | "a", "h", "g"  | Any character included within the set  |
| "+" | "Hello+" | "Hello", "Helloo", "Hellooooo" | Occurs once or more |
| "*" | "Hello*" | "Hell", "Hello", "Hellooooooo" | Occurs zero or more times |
| "?" | "Hello?" | "Hell", "Hello" | Occurs zero or one times |
| {} | "Goodb{1,3}ye" | "Goodbye", "Goodbbye", "Goodbbbye" | Specifies the range of occurances, or a specific number |
| ^ | "^Welcome" | "Welcome", **not** "Hello, Welcome" | Matches only at the start of the string|
| \$ | "Adios\$" | "Goodbye, Adios", **not** "Adios!" | Matches only at the end of the string|
| () | "^(Hello)(.)(there)$" | "Hello there", "Hello,there" | Groups expressions|
| \ | "Who\?" | "Who?" | Escapes the following special character |
