<h1>Revising Regular Expression</h1>

This Notebook acts as a reference for using regular expressions. I am starting with regular expressions by themselves, and then further down I have examples of using regular expressions in Python using the re module.

The work shown here is actually a summary of the youtube videos provided by Corey Schafer, and you can find out more about him over on his <a href="https://www.youtube.com/channel/UCCezIgC97PvUuR4_gbFUs5g">youtube channel.</a> For a full explination of regex you should go check out his videos.

You can also test out all the examples over at <a href="https://regex101.com/">Regex101</a>

<h2>Getting Started: General Regex</h2>

* You can search for literals, whereby you can type in the search term you are looking for e.g. "abc" as literally "abc". Important to note however:
    * The search is case sensitive
    * The order of letters matters
* MetaCharacters - some characters have special functions, these are <strong>   .\[{()^$|?*+   </strong> to escape these characters (and allow you to search for them) we use the backslash chracter
    * So to escape a backslash, you would use '\\\'

<h3>MetaCharacters</h3>

As mentioned above, metacharacters have special functions:

* .     - Match any character except a new line
* \\d   - Match any digit 0-9
* \\D   - Match anything that is not a digit
* \\w   - Match any word character (a-z, A-Z, 0-9)
* \\W   - Match any non-word character (opposite of above)
* \\s   - Match any whitespace (space, tab, or new line)
* \\S   - Match any non-whitespace 

<h3>Anchors</h3>

A bit like metacharacters, except they do not match chracters, but match the positions before and after characters

* \\b   - Will match word boundaries
* \\B   - Match anything without a word boundary

$Example$

> \\bHa

Would match the following:

> <strong style="text-decoration: underline">Ha</strong> <strong style="text-decoration: underline">Ha</strong>Ha

So because there is a word boundary at the beginning of the first and second 'Ha' it matches both of them, but the last 'Ha' does not have a word boundary at the beginning, only the end. If we were to use the expression *\\bHa\\b* then it would only match the first instance of 'Ha' as this is the only one with a word boundary at the beginning **and** end

> \\BHa

Would match the following:

> Ha Ha<strong style="text-decoration: underline">Ha</strong>

The last 'Ha' is matched, because its the only instance of 'Ha' where there is no word boundary at the beginning

* ^   - Beginning of the string
* \$   - End of the string

$Example$

> \\^Ha

Would match the following:

> <strong style="text-decoration: underline">Ha</strong> HaHa

Only match the instance of 'Ha' that is preceeded by the beginning of the string

> \\BHa

Would match the following:

> Ha Ha<strong style="text-decoration: underline">Ha</strong>

Only match the instance of 'Ha' that is followed by the end of the string

<h3>More examples: Matching phone numbers and using character sets</h3>

* []   - Match character set (within square brackets)
* [^ ] - Match chracters not in character set

Lets look at matching phone numbers as an example:

> 321-555-4321

> 123.555.1234

We can start by writing an expression that matches three digits in a row:

> \\d\\d\\d

We could then use the period metacharacter to match any chracter that occurs between our digits, and our final expression for matching phone numbers would look like this:

> \\d\\d\\d.\\d\\d\\d.\\d\\d\\d\\d

However, if we wanted to only match phone numbers where the digits are seperated by a dash or a do, like in the example above, we use an expression with a **character set**. A character set uses square brackets that contains the characters we want to match:

> \\d\\d\\d[-.]\\d\\d\\d[-.]\\d\\d\\d\\d

NB. You do not need to escape metacharacters when they are within a character set, EXCEPT for the dash character. Now if the dash chracter is at the beginning or end of the character set, then we don't need to escape it. But, if it is placed between chracters, then it represents a range.

So the character set contains all chracters that would be acceptable at that position to match the search. It should be highlighted that it is looking for a single chracter that matches a character in the square brackets. If the phone number was *345..-234* then it would not match this string, because after it has recognised the first dot, it will move along the expression and look for a digit.

We can use character sets for powerful search queries, so for example, if we wanted to find phone numbers that started with the three digits *800* or *900*:

> [89]00[-.]\\d\\d\\d[-.]\\d\\d\\d\\d

We have a character set that will match the first digit if it is an 8 or 9, and then two literals that must be a 0 too match

If we wanted to specify a range of values in the character set, say a phone number that begins with a value between 1 and 7, we would use:

> [1-7]\\d\\d[-.]\\d\\d\\d[-.]\\d\\d\\d\\d

We can use ranges for letters e.g. [a-z] matches all lower case letters 'a' through 'z' or we can also use ranges back to back, and use [a-zA-Z] to match all lowercase and uppercase leters 'a' through 'z'

We can match everythin that is **not** in the character set by using the carrot symbol:

> [^a-z]

This would match any characters that are not lower case letters between 'a' and 'z'

Another example, say we want to match everything in this list except 'bat':

> cat

> mat

> pat

> bat

We could use the carrot and a character set to match everything with the first letter that is not 'b' and also the literals 'at':

> [^b]at

<h3>Quantifiers</h3>

Rather than repeating ourselves by typing out expressions like:

> \\d\\d\\d[-.]\\d\\d\\d[-.]\\d\\d\\d\\d

We can use quantifiers:
* \*    - 0 or More
* \+    - 1 or More
* ?     - 0 or One
* {3}   - Exact quantity (the value in the curly brackets)
* {2,4} - Range of numbers ({minimum, maximum}, so between 2 or 4 occurances, in this example)

Looking at the phone number example again:

> 321-555-4321

> 123.555.1234

We can use quantifiers:

> \\d{3}[-.]\\d{3}[-.]\\d{4}

Another example, say we wanted to match all of the names beginning with 'Mr' in the following list, regardless of whether it was followed by a period:

> Mr. Schafer

> Mr Smith

> Ms Davis

> Mrs. Robinson

> Mr. T

We could use the expression:

> Mr\\.?

This would look for the literal 'Mr' and then a period chracter occuring once or not at all, indicated by the question mark following the period

To complete this expression we would then add the following:

> Mr\\.?\\s[A-Z]\\w*

Lets break this down, we match the literal 'Mr', then the period character occuring once or not at all, then a whitespace chracter, then we want to match an uppercase letter in the range 'A' to 'Z', and finally we want to match any word chracter that occurs 0 or more times, so we follow the \\w with an asterix 

Now, what if we want to grab the female names which start with 'Ms' or 'Mrs', for this we have to use groups:

<h3>Groups</h3>

* ( )  - Groups
* |    - Either Or

To grab the female names only, we would use a group that specifies the female prefixes:

> (Ms|Mrs)\.?\s[A-Z]\w*

Here, the expression matches a string that starts with 'Ms' OR 'Mrs', followed by the same expression as previous

<h3>A couple of final examples</h3>

Matching email addresses:

> CoreyMSchafer@gmail.com

> corey.schafer@university.edu

> corey-321-schafer@my-work.net

Lets create a regular expression that matches all of the above email addresses:

> [a-zA-Z0-9.-]+@[a-zA-Z-]+\\.(com|edu|net)

So, this matches one or more character 'a' through 'z' upper and lowercase, or numbers 0 to 9, or a dot or dash, followed by a literal '@', then one or more characters 'a' through 'z' upper and lowercase or a dash, followed by a literal period, followed by either 'com', 'edu', or 'net'.

Mathing domain names:

> https://www.google.com

> http://coreyms.com

> https://youtube.com

> https://www.nasa.gov

We could match these domains with:

> https?://(www\\.)?\w+\\.\w+

Notice that the group captures the 'www.' which doesn't always occurs, which we can then follow with a '?' and what this says is, look for 'www.' occuring zero or more times.

Groups also allow us to tidy up our expressions by splitting them into groups as follows:

> https?://(www\\.)?(\w+)(\\.\w+)

Whats even cooler, is that we can now reference these groups. Now how we reference the groups can depend on the language or tool we are using but it is often a backslash or sometimes a dollar sign. But in some text editors, we could reference the first groups 'www.' with \$1

It's not being disgusting, indexes do start at 0 (as they should) its just group 0 refers to the entire matched expression.

<h2>Regex in Python</h2>

You can use Regex with the re module in python. I will go over some of the examples as above but with Python programming language, and using some Python specfic commands.

In [1]:
import re

In [20]:
text_to_search = '''
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ

Ha HaHa

Metacharacters (need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )

coreyms.com

321-555-4321
123.555.1234

Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

domain_name_text = """
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
"""

A raw string in python is prefixed with an 'r', this tells Python not to handle backslashes in any special way. Backslashes are used to represent white space like tabs and new lines, but a raw string will interpret the backslash as just a backslash. So this is important because we want Regex to interpret the string as is, and we don't want Python interfering. So we pass our strings as raw strings.

We use the compile method in the re module to assign our regex expressions to a variable:

In [3]:
#So we prefix with r to make it a raw string
pattern = re.compile(r'abc')

matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 4), match='abc'>


Span is the beginning and end index of the match. So when we used the finditer method on pattern, it found one occurance on indexes 1 through 4. We can then use string slicing to get our match:

In [4]:
print(text_to_search[1:4])

abc


Pull out phone numbers and add to array:

We can use the group method of match to get the string value that matches our expression. Remember that group 0 corresponds to our entire match.

In [18]:
matches = re.compile(r'\d{3}[-.]\d{3}[-.]\d{4}').finditer(text_to_search)
phone_numbers = [match.group(0) for match in matches]

In [19]:
phone_numbers

['321-555-4321', '123.555.1234']

Get all male names:

In [21]:
matches = re.compile(r'Mr\.?\s[A-Z]\w*').finditer(text_to_search)
male_names = [match.group(0) for match in matches]

In [22]:
male_names

['Mr. Schafer', 'Mr Smith', 'Mr. T']

Get all female names:

In [23]:
matches = re.compile(r'(Ms|Mrs)\.?\s[A-Z]\w*').finditer(text_to_search)
female_names = [match.group(0) for match in matches]

In [24]:
female_names

['Ms Davis', 'Mrs. Robinson']

Get domain names without the protocal prefix by refering to groups:

In [29]:
matches = re.compile(r'https?://(www\.)?(\w+)(\.\w+)').finditer(domain_name_text)
domain_names = [match.group(2)+match.group(3) for match in matches]

In [30]:
domain_names

['google.com', 'coreyms.com', 'youtube.com', 'nasa.gov']

We can also use back references to replace the domain names with the corrected domain names as in the list above. We do this using the re module 'sub', and then refer to the group we want to replace the string with by refering to the group(s) with a backslash, and then as a second parameter we pass the string we want to replace:

In [31]:
subbed_urls = re.compile(r'https?://(www\.)?(\w+)(\.\w+)').sub(r'\2\3', domain_name_text)

In [33]:
print(subbed_urls)


google.com
coreyms.com
youtube.com
nasa.gov



The findall method will return the matches as a list of strings, so lets get a list of phone numbers:

In [38]:
matches = re.compile(r'\d{3}[-.]\d{3}[-.]\d{4}').findall(text_to_search)

In [39]:
matches

['321-555-4321', '123.555.1234']

The match method doesn't return an iterable, but instead matches an expression at the beginning of a string, and then returns re match object:

In [43]:
sentence = "Start of a sentence"

In [44]:
matches = re.compile(r'Start').match(sentence)

In [45]:
print(matches)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


The search method also doesn't return an iterable, but instead finds the first match in the entire string, and then returns re match object:

In [48]:
matches = re.compile(r'of').search(sentence)

In [49]:
print(matches)

<_sre.SRE_Match object; span=(6, 8), match='of'>


<h3>Flags</h3>

Python allows you to pass in flags as a second parameter to the compile method, which can be quite powerful. So for example, if you wanted to ignore case in your search, rather than using a range for every chracter, you can do:

In [50]:
awkward_text = "ThIs Is SOMe AwKwARD TexT"

In [58]:
matches = re.compile(r'.*', re.IGNORECASE).search(awkward_text)

In [59]:
matches

<_sre.SRE_Match object; span=(0, 25), match='ThIs Is SOMe AwKwARD TexT'>

And that concludes the Regex summary, other useful links:

* https://www.regextester.com/
* https://regex101.com/
* https://www.debuggex.com/cheatsheet/regex/python