https://regexr.com/

# Regex


Regex or regular expressions are used for defining search patterns for text. This makes them extremely useful for cleaning data as we can concisely express which part or the text we want. We'll also use them when scraping websites to specify which links to follow. They have many other use cases such as password or email validation and are pretty programming language agnostic. Nearly every language has an implementation of them. Even useful command line tools like grep and sed support them. 

In [None]:
import re

The re module has a lots of functions, the most useful ones are briefly outlined below.

* .search - search whole string for match
* .match - match from the start of the string
* .findall - finds all matches and return a list of strings
* .finditer - finds all matches but returns a iterable object
* .sub - can be used to substitute text 
* .split - used to split using a regex

In [None]:
pattern = "cat" 
s = "the cat sat on the mat"

In [None]:
m = re.search(pattern,s)
m

`.search` returns a match object which we can use to obtain our match or the position of it.

In [None]:
m.group()

In [None]:
m.span()

# Character set

Lets say I wanted to match all of the words that ended with 'at' in the sentence 'the cat sat on the mat'. I could use a character set `[]` to contain the letters 'c' ,'s' ,and 'm' .

In [None]:
re.findall(r"[csm]at",s)

But maybe we want the position of the matches instead.

In [None]:
[ m.span() for m in re.finditer(r"[csm]at",s) ]

# Exclude sets 

When we use the caret `^` at the start of a character set `[]`, we'll match everything except those characters. Note the caret only has this meaning when inside a character set `[]`. Its meaning changes when used outside a character set.

In [None]:
re.findall(r"[^c]at",s) 

# Ranges

Similar to the caret,  the hypen **`-`** has a special meaning when used inside a character set. It allows us to specify a range or letters or numbers. We can use ranges to specify all of the digits between 5 to 9 or all letters from a to d.

In [None]:
import string

In [None]:
string.ascii_letters

In [None]:
alphabet = string.ascii_letters
alphabet

In [None]:
re.search("[a-z]+",alphabet) #only matches lower case a-z

In [None]:
re.search("[e-z]+",alphabet)

In [None]:
re.search("[f-zA-C]+",alphabet) #match lower case f to z and upper case 

 # Either
 
The pipe `|` can be used to match either word (or character).

In [None]:
s1 = "The rainbow has many colors. Like, a LOT of colours."

In [None]:
re.findall('color|colour',s1)

# Quantifiers


Quantifiers are used to specify how many times we want to match something.

* `*` - zero or more  times, e.g. ca*t --> matches 'ct', 'cat', 'caat', 'caaat' etc.
* `+` - one or more times,  e.g. ca*t --> matches 'cat', 'caat', 'caaat' etc.
* `?` -  one or zero times. (make the character optional)
* `{ n }` - match exactly n times
* `{ n , }` - match at least n times
* `{ n , m }` - match between n to m times
* `\` - escapes the next character. So that you can match the special characters e.g. [ ] ( ) . * + ? ^ $ \ |

In [None]:
s = "The rainbow has many colors but not the colour silver"
re.findall('colou?r',s)

# Meta Chars

Here some more symbols with special meanings.

* `\d` - match any digit same as [0-9]
* `\D` - anything but digits, aka [^0-9]
* `\w` - match any word char (a-z, A-Z, 0-9 and _'s)
* `\W` - match any non-word char
* `\s` - match white space (spaces, tabs...)
* `\S` - match non-white space
* `\t` - match tab only
* `.`  - match any characters except a line break


In [None]:
s = "Number of bookmarks: 99 "
re.search('\d+', s)

In [None]:
re.search('(\w+)\s\w+',s)

In [None]:
re.search('\D+',s)

# Anchors

Anchors allow us to specify where in the text we want the match to be.

* `^` - what the string starts with
* `$` - what the string ends with
* `\b` - word boundary - the end of a word is defined as being after a whitespace or non-alphanumeric character


In [None]:
s = '700^4'
re.search('\d\d\^\d', s )

Using ^:
1. [^x] - exclude x in search
2. '\^', or [x^] - look for this character of '^'
3. '^x' - look for a word starting with the character x

In [None]:
s = "4252345adesft"
re.search('^\d+',s) #only match string thats all numbers
#using the ^ in different ways

In [None]:
s = "Hello?"
re.search('^\w+\?$',s)

In [None]:
s = "Hello? There"
re.search('^\w+\w\?',s) #looking for the question mark, NOT for end of string

In [None]:
s = "This island is beautiful."
re.search('is',s) #wrong is :/

In [None]:
re.search(r'\bis\b',s) #the right is :)
#Why do we use r? r: raw strings. If we don't include this, Python considers \b as a backspace character
#For more info on this, see: https://docs.python.org/2/reference/lexical_analysis.html#string-literals

# Groups

Anything contained within `()` is a group. They allow us to easily break up our pattern into separate parts. Characters would have to be matched in that exact order.

In [None]:
s = "a great string"
m = re.search('(\w+)\s(\w+)\s(\w+)',s)
m

#  Backreferences

Once we've defined a matching group, we can use backreferences to refer back to them. For example if we've defined a single matching group, we can use `\1` to match the same text captured by that group.

In [None]:
s ='foo bar foo boo car'
m = re.findall('(.ar)\1*',s) #. - match any characters except a line break
m

#  Lazy Operator

Normally regular expressions are greedy; meaning that they will try to find the longest string that matches the pattern. However, sometimes we don't want this behaviour. We can use `?` at the end of a matching group as a lazy operator. This means that the regex will grab as little as possible to make the match.

In [None]:
import re
s = "https://www.openrice.com/en/hongkong/restaurants/cuisine/thai/resturant-name/"

In [None]:
m = re.search('cuisine/(.+)/',s)
m #without the lazy operator

In [None]:
m = re.search('cuisine/(.+?)/',s)
m

In [None]:
s= "The fat cat sat on the mat."

In [None]:
m=re.search('(.*at)',s) #without the lazy operator
m

In [None]:
m=re.search('(.*?at)',s)
m

# Exercises

To get more familar with regex, please play one of the following games:

* [Regex Golf](https://alf.nu/RegexGolf)
* [Regex Crossword Game](https://regexcrossword.com/)

# Resources

Some extra resources

* Breaking down Regex expressions: https://regexr.com/
* [Regular Expression Info](https://www.regular-expressions.info/)
* [Regex cheatsheet](https://www.debuggex.com/cheatsheet/regex/python)
* [Net Ninja Regex Video Tutorials](https://www.youtube.com/watch?v=r6I-Ahc0HB4&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD)
* [Coding Train Regex Video Turorials](https://www.youtube.com/watch?v=7DG3kCDx53c&list=PLRqwX-V7Uu6YEypLuls7iidwHMdCM6o2w) 
