# Regular Expressions
![Trial and Error](https://pbs.twimg.com/media/Cr7mS_OWcAA7Hzt.jpg)
[@ThePracticalDev](https://twitter.com/thepracticaldev/status/774309983467016193)

# The simplest regex is to just match literal text.
# Match objects can be a bit fiddly, though.

In [1]:
import re

In [2]:
pangram = "The quick, brown fox jumps over the lazy dog."

In [3]:
dog = re.compile('dog')
dog_match = re.search(dog,pangram)
print(dog_match.span(), dog_match.group(0))

(41, 44) dog


# "re.findall" is easier to use most of the time.

In [4]:
re.findall('fox',pangram)

['fox']

# Character classes are very powerfull
# What night busses can you catch from The Strand?

In [5]:
strand_busses = """Southampton Street / Covent Garden. Bus routes. 6; 9; 11; 15; 23; 87; 91; 139; 176; N9; N11;
N15; N21; N26; N44; N87; N89; N91; N155; N199; N343; N550; N551. Bus stop A. Waterloo Bridge / South Bank.
Bus routes. 1; 4; 26; 59; 68; 76; 139; 168; 171; 172; 176; 188; 243; 341; 521; N1; N68; N171; N343; RV1; X68."""

In [6]:
print(re.findall('N[0-9]',strand_busses))

['N9', 'N1', 'N1', 'N2', 'N2', 'N4', 'N8', 'N8', 'N9', 'N1', 'N1', 'N3', 'N5', 'N5', 'N1', 'N6', 'N1', 'N3']


# This doesn't quite work. We've only matched "N" and one digit. 

# "+" means that we match one or more digits.

In [7]:
print(re.findall('N[0-9]+',strand_busses))

['N9', 'N11', 'N15', 'N21', 'N26', 'N44', 'N87', 'N89', 'N91', 'N155', 'N199', 'N343', 'N550', 'N551', 'N1', 'N68', 'N171', 'N343']


# "^" inverts the character class.
# Match one or more digits, preceded by a character that *isn't* "N".

In [8]:
print(re.findall('[^N][0-9]+',strand_busses))

[' 6', ' 9', ' 11', ' 15', ' 23', ' 87', ' 91', ' 139', ' 176', '11', '15', '21', '26', '44', '87', '89', '91', '155', '199', '343', '550', '551', ' 1', ' 4', ' 26', ' 59', ' 68', ' 76', ' 139', ' 168', ' 171', ' 172', ' 176', ' 188', ' 243', ' 341', ' 521', '68', '171', '343', 'V1', 'X68']


# Match captilized words:

In [9]:
print(re.findall('[A-Z][a-z]+',strand_busses))

['Southampton', 'Street', 'Covent', 'Garden', 'Bus', 'Bus', 'Waterloo', 'Bridge', 'South', 'Bank', 'Bus']


# "ding" is a word, but did we want to match it? 

In [10]:
print(re.findall('[bd][a-z]+',"Open the bottle, I'm on the brink of needing a drink!"))
print(re.findall('[bd]r[a-z]+',"Open the bottle, I'm on the brink of needing a drink!"))

['bottle', 'brink', 'ding', 'drink']
['brink', 'drink']


# "\s" matches all spaces.
# Great. We're stuck with leading spaces.

In [11]:
print(re.findall('\s[bd][a-z]+',"Open the bottle, I'm on the brink of needing a drink!"))

[' bottle', ' brink', ' drink']


# This one's complicated. "|" works like a boolean OR.
# [a-z]{2} matches exactly two lowercase letters.

In [12]:
print(re.findall('first|second|third|[1-3][a-z]{2}',"""They came in first and second,
You get a medal if you come in 1st, 2nd or 3rd place, but you only came 4th."""))

['first', 'second', '1st', '2nd', '3rd']


# "\d" is shorthand for "[0-9]".

In [13]:
print(re.findall('[\d]{4}',"The was a screening of 2001 on 20/01 2001 at 20:01 "))

['2001', '2001']


# What if you want a "-" in your character class?
# -Welcome to backslash hell!
# {3,} matches 3 or more characters belonging to the class.

In [14]:
print(re.findall('[0-9\-]{3,}=[\d]+','1+1=5, 9-11=74'))

['9-11=74']


# Can we match 2 and 4-digit years?
# [\d]{2,4} matches 2 to 4 digits.
# Not quite there yet!

In [15]:
print(re.findall('[\d]{2,4}',"""Back in '99 and 2000, Prince's "1999" was played 1000000s of times"""))

['99', '2000', '1999', '1000', '000']


# A leading 1-9 fixes the match on "000".
# Pity about the year 1000

In [16]:
print(re.findall('[1-9][\d]{1,3}',"""Back in '99 and 2000, Prince's "1999" was played 1000000s of times"""))

['99', '2000', '1999', '1000']


# We make sure the match ends in a non-digit.
# No more year 1000!
# Pity there's all that junk at the end!

In [17]:
print(re.findall('[1-9][\d]{1,3}[^\d]',"""Back in '99 and 2000, Prince's "1999" was played 1000000s of times"""))

['99 ', '2000,', '1999"']


# Finally!
# "(?=[^\d])" is a "non-capturing look-ahead". (Of *course* it is.)

In [18]:
print(re.findall('[1-9][\d]{1,3}(?=[^\d])',"""Back in '99 and 2000, Prince's "1999" was played 1000000s of times"""))

['99', '2000', '1999']


# Now we can solve the horrid leading spaces from before!
# "(?<=[^a-z])" is a "non-capturing look-ahead". (Of *course* it is.)

In [19]:
print(re.findall('(?<=[^a-z])[bd][a-z]+',"Open the bottle, I'm on the brink of needing a drink"))

['bottle', 'brink', 'drink']


![Googling the Regex](https://pbs.twimg.com/media/Cn1rWcbWcAAgsCA.jpg)[@ThePracticalDev]
[@ThePracticalDev](https://twitter.com/thepracticaldev/status/755893414890209280)