# Regular Expressions

Regular expressions are a type of string search that is more powerful than some of the methods we have learned already in this class. If you are working with large strings for your final project, you may find it useful to learn some basic string searching tools using regex.

In [None]:
from datascience import *

We already know how to do some basic string searching within tables or strings:

In [None]:
t = Table().with_columns([
   'Film', ['Inception', 'Eternal Sunshine of the Spotless Mind', 'Memento', 'The Last Five Years', "Sophie's World", 'Groundhog Day', 'Palm Springs'],
   'Year', [2010, 2004, 2000, 2014, 1999, 1993, 2020],
])
t

In [None]:
t.where('Film', are.containing('the'))

In [None]:
t.where('Film', are.not_containing('the'))

In [None]:
s = "Colourless green ideas sleep furiously."

In [None]:
s.find('green')

In [None]:
s[11:16]

Regular expresssions, from the library `re`, can do this kind of searching, and so much more.

In [None]:
import re

## Basic matches
The `search` methods returns a `Match` object if there is a match anywhere in the string. If the pattern is not in the string, then `None` is returned (and no output is printed).

The syntax is `re.search(pattern, string)`. You can put any string or string object into the `string` argument.

In [None]:
re.search('green',s)

In [None]:
# The 'span' object is a tuple that returns the beginning and end of the match in the string
s[11:16]

In [None]:
# The 'match' object is the first string that matched
re.search('a','a')

In [None]:
# Searches are case sensitive
re.search('a','A')

In [None]:
# You can assign a Match object to a variable
match = re.search('green',s)
print(match.span())
print(match.group())

In [None]:
# Failed searches return nothing
re.search('b','a')

In [None]:
# Multiple matches will not be shown; only the first
print(re.search('e',s))
print(s[7:])

## Negative matches and expanded searches

Regex is powerful because you can use special sequences of characters, such as `[]`, `.`, `-`, and more, to increase the possibilities for matches that are not literal matches to your input string.

In [None]:
# Search for "not a"
re.search('[^a]',s)

In [None]:
re.search('[^a]','aaa')

In [None]:
re.search('[^a]','aaab')

In [None]:
# Search for everything between "a" and "z" in the alphabet.
re.search('[a-z]', '12345x')

In [None]:
# It's case-sensitive!
re.search('[a-z]', '12345X')

In [None]:
re.search('[A-Z]', '12345X')

In [None]:
# Search for "a" or "b"
re.search('[ab]', 'abc')

In [None]:
re.search('[ab]', 'bcd')

In [None]:
# Search for nontraditional characters
re.search('ñ','piñata')

In [None]:
re.search('n','piñata')

## Multiple matches
Use `re.findall()` to return a list of all matches, not just the Match object for the first match

In [None]:
re.findall('[ab]', 'abc')

In [None]:
re.findall('e',s)

In [None]:
len(re.findall('e',s))

## Wildcard searching

You can use the special character `.` to stand in for almost any character.

In [None]:
re.search('e..o','heyyo')

In [None]:
re.search('e..o','helicoptero')

In [None]:
d = 'Mrs. and Mr. Benoit Blanc'

In [None]:
re.search('Mr\.',d)

In [None]:
ko = ['Mr. Harlan Thrombey', 'Ms. Marta Cabrera', 'Mr. Benoit Blanc', 'Mrs. Linda Drysdale', 'Detective Lieutenant Elliott', '[literal string]']

In [None]:
ko

In [None]:
for name in ko:
    print(re.search('Mr\.',name))

In [None]:
for name in ko:
    print(re.search('\[..............\]', name))

Common escape sequences:

* `\\` escape a backslash
* `\[` escape a bracket
* `\{` escape a curly brace
* `\.` escape a period

## Anchor characters for beginnings and ends of strings

The special character `^` will look for the pattern only at the beginning of a string, while the special character `$` will only look for it at the end.

In [None]:
print (s)
z = 'Greenless colourful sleeps furiously idealize.'
print (z)

In [None]:
# Search for 'Colour' at the beginning of a string
print(re.search('^Colour',s))
print(re.search('^Colour',z))

In [None]:
# Search for 'furiously.' at the end of a string
print(re.search('furiously\.$', s))
print(re.search('furiously\.$', z))

## Kleene star, plus, and question mark

The special characters `*`, `+`, and `?` allow multiple instances of the previous character to be matched.

In [None]:
# Search for 0 or more 'a' followed by 'h'
re.search('a*h', 'h')

In [None]:
re.search('a*h', 'ah')

In [None]:
re.search('a*h', 'aaaaaaaaaaaah')

In [None]:
re.search('a*h', 'aaaaaahaaaaaah')

In [None]:
re.findall('a*h', 'aaaaaahaaaaaah')

In [None]:
# Search for 1 or more 'a' followed by 'h'
re.search('a+h', 'h')

In [None]:
re.search('a+h', 'ah')

In [None]:
# Search for 0 or 1 'u' after 'Colo' and before 'r'
re.search('Colou?r',s)

In [None]:
s_american = "Colorless green ideas"

In [None]:
re.search('Colou?r', 'Colouuuuuuuuur')

## Pipes and parenthetical groups

Another way to indicate "or" for larger string patterns:

In [None]:
# Search for 'Color' or 'Colour'
re.search('Color|Colour',s)

In [None]:
# Search for 'Color' or 'Colour'
re.search('Colo(|u)r',s)

In [None]:
# Search for 'Colo' or 'ur'
re.search('Colo|ur',s)

Finally, remember that `*` and `+` are "greedy" characters, in that they will match the largest substring that matches the pattern:

In [None]:
re.findall('e.*s',s)

But you can make them non-greedy by using `*?` and `+?`.

In [None]:
re.findall('e.*?s',s)

That's it! Go to [pythex.org](pythex.org) to practice your regular expressions if you think you'll find them useful for your final project!