# Regular Expressions

http://www.pyregex.com

This tutorial is based on chapter 7 "Pattern Matching With Regular Expressions" of the book *Automate The Boring Stuff The Boring Stuff With Python* by Al Sweigart.



*Regular expressions* allow you to specify a pattern of text to search for.

Regular expressions are huge time-savers, not just for software users but also for programmers. In fact, tech writer Cory Doctorow argues that even before teaching programming, we should be teaching regular expressions:

> Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.
https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions


## Finding Patterns of Text with Regular Expressions

Regular expressions, called regexes for short, are descriptions for a pattern of text. For example, a `\d` in a regex stands for a digit character, that is, any single numeral `0` to `9`. The regex `\d\d \d\d \d\d \d\d` could be used  by Python to match a Danish telefon number, a string of eight numbers separated by whitespaces.

But regular expressions can be much more sophisticated. For example, adding a `2` in curly brackets (`{2}`) after a pattern is like saying, "Match this pattern two times." So the regex `\d{2} \d{2} \d{2} \d{2}` also matches the correct phone number format. It could be shortened even more to `(\d{2} ){3}\d{2}`.

### Creating Regex Objects

All the regex functions in Python are in the `re` module.

Passing a string value representing your regular expression to `re.compile()` returns a Regex pattern object (or simply, a Regex object). Note, since regular expressions frequently use backslashes in them, it is convenient to pass raw strings to the `re.compile()` function instead of typing extra backslashes. Typing `r'\d{2} \d{2} \d{2} \d{2}'` is much easier than typing `'\\d{2} \\d{2} \\d{2} \\d{2}'`.




In [7]:
import re


phone_num_reg = re.compile(r'\d{2} \d{2} \d{2} \d{2}')

re.compile(r'\d{2} \d{2} \d{2} \d{2} \d{2}', re.UNICODE)

### Matching Regex Objects

A Regex object’s `search()` method searches the string it is passed for any matches to the regex. The `search()` method will return `None` if the regex pattern is not found in the string. If the pattern is found, the `search()` method returns a `Match` object. `Match` objects have a `group()` method that will return the actual matched text from the searched string.

In [6]:
import re


phone_num_reg = re.compile(r'\d{2} \d{2} \d{2} \d{2}')

address_entry = """Møller 
20 86 46 44 
Herningvej 8
4800
Nykøbing F"""

mo = phone_num_reg.search(address_entry)
mo.group()



'20 86 46 44'

### Grouping with Parentheses


Adding parentheses will create groups in the regex: `r'(\d{4})\n(Nykøbing F)'`. Then you can use the `group()` match object method to grab the matching text from just one group. The first set of parentheses in a regex string will be group 1. The second set will be group 2. By passing the integer 1 or 2 to the `group()` match object method, you can grab different parts of the matched text. Passing 0 or nothing to the `group()` method will return the entire matched text.


In [23]:
phone_num_reg = re.compile(r'(\d{4})\n(Nykøbing F)')

address_entry = """Møller 
20 86 46 44 
Herningvej 8
4800
Nykøbing F"""

mo = phone_num_reg.search(address_entry)
print(mo.group(0))
print('Group 1: ', mo.group(1))
print('Group 2: ', mo.group(2))
mo.groups()

4800
Nykøbing F
Group 1:  4800
Group 2:  Nykøbing F


('4800', 'Nykøbing F')

### The `findall()` Method

In addition to the `search()` method, Regex objects also have a `findall()` method. While `search()` will return a Match object of the first matched text in the searched string, the `findall()` method will return the strings of every match in the searched string. 


If there are groups in the regular expression, then findall() will return a list of tuples.

In [29]:
phone_num_reg = re.compile(r'\d{2} \d{2} \d{2} \d{2}')

address_entry = """Møller 
20 86 46 44 
Herningvej 8
4800
Nykøbing F

A Bischoff Møller
86 14 18 31 
Stenkildevej 14
8260
Viby J

A Egelund-Møller
54 94 41 81 
Rønnebærparken 1 0011
4983
Dannemare"""

numbers = phone_num_reg.findall(address_entry)
print('All matches: {}'.format(numbers))

mo = phone_num_reg.search(address_entry)
print('First match: {}'.format(mo.group()))

All matches: ['20 86 46 44', '86 14 18 31', '54 94 41 81']
First match: 20 86 46 44


### More Regexp Syntax

On top of grouping and repetitions, Regexps can express quite a bit more. Have a look to http://www.pyregex.com to see what else is possible. Now, we will play a bit with Regexps in the editor.

In [31]:
import re


with open('./addresses.txt') as f:
    addresses = f.read()

print(addresses)


Møller 
20 86 46 44 
Herningvej 8
4800
Nykøbing F

A Bischoff Møller
86 14 18 31 
Stenkildevej 14
8260
Viby J

A Egelund-Møller
54 94 41 81 
Rønnebærparken 1 0011
4983
Dannemare

A K Møller
75 50 75 14 
Bregnerødvej 75, st. 0002
3460
Birkerød

A Møller
86 45 44 36 
Violvej 3
Ø. Bjerregrav
8920
Randers NV

A Møller
97 95 20 01 
Dalstræde 11
Heltborg
7760
Hurup Thy

A Møller
76 42 00 81 
Hyrdevej 16A
7000
Fredericia

A Møller
74 74 36 62 
Jørgensgaardvej 13
6240
Løgumkloster

A Møller
86 13 22 99 
Brammersgade 45
8000
Aarhus C

A Møller Andersen
45 80 47 14 
Gammel Holtevej 60
Gl Holte
2840
Holte

A Møller Hansen
86 28 39 14 
Rugbjergvej 23
Stavtrup
8260
Viby J

A Møller Jensen
86 15 31 51 
Viborgvej 115, 1. tv
Hasle
8210
Aarhus V

A Møller Knudsen
86 15 99 81 
Julivej 15
8210
Aarhus V

A Møller Pedersen
75 62 16 57 
Fredrik Bajers Gade 26, 1. th
8700
Horsens

A Møller Pedersen
75 89 54 29 
Apotekerhaven 6, 2. mf
8722
Hedensted

A Møller Sørensen
35 38 97 81 
Korsørgade 4, 6. tv
2100
Køb