# Regular Expressions
[Read the docs](https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy)

Reg_exp tester: https://regex101.com/

This tutorial is based on chapter 7 "Pattern Matching With Regular Expressions" of the book *Automate The Boring Stuff The Boring Stuff With Python* by Al Sweigart.



***Regular expressions*** allow you to specify a **pattern of text** to search for.

Regular expressions are huge time-savers, not just for software users but also for programmers. In fact, tech writer Cory Doctorow argues that even before teaching programming, we should be teaching regular expressions:

> Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slug through.
https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions




## Finding Patterns of Text with Regular Expressions

Regular expressions, called regexes for short, are descriptions for a pattern of text. For example, a `\d` in a regex stands for a digit character, that is, any single numeral `0` to `9`. The regex `\d\d \d\d \d\d \d\d` could be used  by Python to match a Danish phone number, a string of eight numbers separated by whitespaces.

But regular expressions can be much more sophisticated. For example, adding a `2` in curly brackets (`{2}`) after a pattern is like saying, "Match this pattern two times." So the regex `\d{2} \d{2} \d{2} \d{2}` also matches the correct phone number format. It could be shortened even more to `(\d{2} ){3}\d{2}`.



### Creating Regex Objects with raw string

All the regex functions in Python are in the `re` module.

Passing a string value representing your regular expression to `re.compile()` returns a Regex pattern object (or simply, a Regex object). Note, since regular expressions frequently use backslashes in them, it is convenient to pass raw strings to the `re.compile()` function instead of typing extra backslashes. Typing `r'\d{2} \d{2} \d{2} \d{2}'` is much easier than typing `'\\d{2} \\d{2} \\d{2} \\d{2}'`.

In [4]:
# phone = '34 34 34 34'
import re

phone_num_reg_ex_pattern_obj = re.compile(r'\d{2} \d{2} \d{2} \d{2}')

### Matching Regex Objects

A Regex Pattern object’s `search()` method searches the string it is passed for any matches to the regex. 
`re.compile(r'pattern').search(string_to_search_through)`

The `search()` method will return **`None`** if the regex pattern is not found in the string.   
The `search()` method returns a `Match` object if the pattern is found. The search method STOPS after the first match. 

`Match` objects have a `group()` method that will return the actual matched text from the searched string.

In [5]:
import re


phone_num_reg = re.compile(r'\d{2} \d{2} \d{2} \d{2}')
print('re.compile returns a pattern object',type(phone_num_reg))
address_entry = """
A Helge Gamborg Møller
Klostergade 28
6760 Ribe
61 69 03 76"""

match_obj = phone_num_reg.search(address_entry)
print(match_obj)
print(match_obj.group())



re.compile returns a pattern object <class 're.Pattern'>
<re.Match object; span=(49, 60), match='61 69 03 76'>
61 69 03 76


### Grouping with Parentheses


- Adding parentheses will create groups in the regex:   
  - `r'(\d{4})\n(Nykøbing F)'`. 

- Then you can use the `group()` match object method to grab the matching text from just one group.   
- The first set of **parentheses** in a regex string will be group 1. 
- The second set will be group 2. By passing the integer 1 or 2 to the `group()` match object method, you can grab different parts of the matched text. 

Passing 0 or nothing to the `group()` method will **return the entire matched text.**


In [6]:
# groups are initiated from parantheses
city_reg1 = re.compile(r'(\d{4}) (\w+)')
city_reg2 = re.compile(r'\d{4} \w+')
address_entry = """
A Henning Gamborg Møller
Klostergade 28
6760 Ribe
4545 Hest
61 69 03 76"""

mo = city_reg1.search(address_entry)
print('Group 0: ', mo.group(0))
print('Group 1: ', mo.group(1))
print('Group 2: ', mo.group(2))
print(list(mo.groups()))

mo = city_reg2.search(address_entry)
print('Groups:',mo.groups(),mo.group(0))

Group 0:  6760 Ribe
Group 1:  6760
Group 2:  Ribe
['6760', 'Ribe']
Groups: () 6760 Ribe


In [16]:

import re 
p = re.compile(r'Hans')
txt = 'Der bor en mand der hedder Hans i det blå hus'
m = p.search(txt)
print(m.group(0)) 

Hans


### The `findall()` Method

In addition to the `search()` method, Regex objects also have a `findall()` method. While `search()` will return a Match object of the first matched text in the searched string, the `findall()` method will **return the strings** of every match in the searched string. 


If there are groups in the regular expression, then **findall() will return a list of tuples**.

### The finditer() method
Will return an iterator that will provide all non overlapping match objects

In [33]:
phone_num_reg = re.compile(r'\d{2} \d{2} \d{2} \d{2}')

address_entry = """

A Henning Gamborg Møller
Klostergade 28
6760 Ribe
61 69 03 76

A K Møller
Bregnerødvej 75, st. 0002
3460 Birkerød
75 50 75 14

A Møller
Violvej 3
Ø. Bjerregrav
8920 Randers NV
86 45 44 36

A Møller
Hyrdevej 16A
7000 Fredericia
76 42 00 81

A Møller
Brammersgade 45
8000 Aarhus C
86 13 22 99

A Møller
Dalstræde 11
Heltborg
7760 Hurup Thy
97 95 20 01
"""

numbers = phone_num_reg.findall(address_entry)
print('All matches: {}'.format(numbers))

mo = phone_num_reg.search(address_entry)
print('First match: {}'.format(mo.group()))

iter_obj = phone_num_reg.finditer(address_entry)

for idx,o in enumerate(iter_obj):
    print(f'Phone no:{idx+1} is {o.group()}')

All matches: ['61 69 03 76', '75 50 75 14', '86 45 44 36', '76 42 00 81', '86 13 22 99', '97 95 20 01']
First match: 61 69 03 76
Phone no:1 is 61 69 03 76
Phone no:2 is 75 50 75 14
Phone no:3 is 86 45 44 36
Phone no:4 is 76 42 00 81
Phone no:5 is 86 13 22 99
Phone no:6 is 97 95 20 01


## 01 Exercise with findall()
In the following text find all the family names of everyone with first name Peter:

"Peter Hansen was meeting up with Jacob Fransen for a quick lunch, but first he had to go by Peter Beier to pick up some chocolate for his wife. Meanwhile Pastor Peter Jensen was going to church to give his sermon for the same 3 people in his parish. Those were Peter Kold and Henrik Halberg plus a third person who had recently moved here from Norway called Peter Harold".

## Most common patterns

|No|**Symbol**|**Effect**|
|--|--|--|
|1|.|dot matches any character except newline|
|2|\w|matches any word character i.e alphanumeric (letters, digits) and underscore ( _ )|
|3|\W|matches non word characters|
|4|\d|matches a single digit|
|5|\D|matches a single character that is not a digit|
|6|\s|matches any white-spaces character like \t and \n|
|7|\S|matches single non white space character|
|8|[abc]|matches single character in the set i.e either match a, b or c|
|9|[^abc]|match a single character other than a, b and c|
|10|[a-z]|match a single character in the range a to z.|
|11|[a-zA-ZæøåÆØÅ]|match a single character in the range a-å or A-Å|
|12|[0-9]|match a single character in the range 0-9|
|13|^|match start at beginning of the string|
|14|$|match start at end of the string|
|15|+|matches one or more of the preceding character (greedy match).|
|16|*|matches zero or more of the preceding character (greedy match).|
|17|?|matches zero or one of the preceding character.|
|18|*?|non-greedy matches zero or more|
|19|??|non-greedy zero or one|
|20|+?|non-greedy one or more|
|21|\||or|
|22|`([\w ]+)*`|one ore more words|

## 5 min read exercise
Look at the code below and reflect:
1. What is the difference between alpha_numeric 1, 2 and 3

In [2]:
# find different patterns with regular expressions
import re
txt_string = 'her er både tal 123 og bogstaver, %&/ samt specialtegn.'

anything_after_numbers = re.compile(r'\d{2}.+') # 2digits and anything after
alpha_numeric1 = re.compile(r'[a-zA-ZæøåÆØÅ0-9,. ]+') #not strict (allows many consecutive spaces)
alpha_numeric2 = re.compile(r'\w+( \w+)*')
alpha_numeric3 = re.compile(r'\w{1,}') # one or more alphanummeric but NOT space
not_a_digit = re.compile(r'\D+')
match_start = re.compile(r'^her (\w+).+')  # text begins with 'her' and we want to get the next word
match_end = re.compile(r'tegn.$')

r1 = anything_after_numbers.search(txt_string); print('1:',r1.group())
r2 = alpha_numeric1.findall(txt_string);       print('2:',r2)# 
r3 = alpha_numeric2.finditer(txt_string)
for match in r3: print('3:',match.group()) #finditer returns an iterator
r4 = alpha_numeric3.findall(txt_string);       print('4:',r4)
r5 = not_a_digit.findall(txt_string);           print('5:',r5)
r6 = match_start.findall(txt_string);           print('6:',r6)
r7 = match_end.findall(txt_string);             print('7:',r7)

1: 123 og bogstaver, %&/ samt specialtegn.
2: ['her er både tal 123 og bogstaver, ', ' samt specialtegn.']
3: her er både tal 123 og bogstaver
3: samt specialtegn
4: ['her', 'er', 'både', 'tal', '123', 'og', 'bogstaver', 'samt', 'specialtegn']
5: ['her er både tal ', ' og bogstaver, %&/ samt specialtegn.']
6: ['er']
7: ['tegn.']


In [38]:
# Show example of re.finditer(pattern, txt)
import re

regex = r"\w+( \w+)*"
test_str = """sdsdf 455 %% 2323 sdsd
dsfsdf 3232 dfsd 23"""
matches = re.finditer(regex, test_str, re.MULTILINE) #re.MULTILINE pattern is tested on each line (rather than just once for all the text)

for matchNum, match in enumerate(matches, start=1):    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
    

Match 1 was found at 0-9: sdsdf 455
Group 1 found at 5-9:  455
Match 2 was found at 13-22: 2323 sdsd
Group 1 found at 17-22:  sdsd
Match 3 was found at 23-42: dsfsdf 3232 dfsd 23
Group 1 found at 39-42:  23


In [10]:
import re
with open('./data/addresses.txt') as f:
    addresses = f.read()

print(addresses)

A Henning Gamborg Møller
Klostergade 28
6760 Ribe
61 69 03 76

A K Møller
Bregnerødvej 75, st. 0002
3460 Birkerød
75 50 75 14

A Møller
Violvej 3
Ø. Bjerregrav
8920 Randers NV
86 45 44 36

A Møller
Hyrdevej 16A
7000 Fredericia
76 42 00 81

A Møller
Brammersgade 45
8000 Aarhus C
86 13 22 99

A Møller
Dalstræde 11
Heltborg
7760 Hurup Thy
97 95 20 01

A Møller
Jørgensgaardvej 13
6240 Løgumkloster
74 74 36 62

A Møller Andersen
Gammel Holtevej 60
Gl Holte
2840 Holte
45 80 47 14

A Møller Jensen
Viborgvej 115, 1. TV
Hasle
8210 Aarhus V
60 94 39 04

A Møller Sørensen
Korsørgade 4, 6. tv
2100 København Ø
35 38 97 81

A Porse Møller
Røddikvej 60
8464 Galten
86 94 66 60

Aage Beck Møller
Rødding Nørregade 8
6630 Rødding
20 44 00 35

Aage Bojsen-Møller
Noret 3
4780 Stege
55 81 46 76

Aage Christian Møller Andersen
Jordemodervej 7
Bislev
9240 Nibe
20 83 70 65

Aage Hansen Møller
Filippavej 38
Hundstrup
5762 Vester Skerninge
62 24 10 81

Aage Jan Møller
Næsborgvej 50, st. th
2650 Hvidovre
20 83 88

## 02 exercise

We will play with the addresses from data/addresses.txt and the following regex patterns

Write a regular expression, that you can use to create 5 lists with:

  * all names in the list above
  * all telephone numbers 
  * all zip codes
  * all city names with corresponding zip code
  * all street names

In [6]:
# Match all sentences in full if they contain either Pizza, pizza, Pizzaria or Pizzeria
txt = """Store Pizza stykker
Lille blå Pizzaria
Byens bedste Pizzeria på gågaden
Pizzaria la costa
My new found pizza happiness
Burgere en masse P
pizzaria no 9
(Pizzaria)
Burgere og Pizza
"""
import re
pattern = re.compile(r'([\w ]+)*(((P|p)izza)|Pizz[ae]ria)([\w ]+)*') # change [\w ] to dot if specialchars are allowed
for line in txt.split('\n'):
    if pattern.match(line):
        print(pattern.match(line).group())
        #print(pattern.match(line).groups())


Store Pizza stykker
Lille blå Pizzaria
Byens bedste Pizzeria på gågaden
Pizzaria la costa
My new found pizza happiness
pizzaria no 9
Burgere og Pizza


## Using re.VERBOSE
- regular expressions are a very compact notation
- they’re not terribly readable. 
- REs of moderate complexity can become lengthy collections of backslashes, parentheses, and metacharacters, making them difficult to read and understand.
- specifying the re.VERBOSE flag can be helpful
- it allows you to format the regular expression more clearly.

#### The re.VERBOSE flag has several effects. 
- Whitespace in the regular expression that isn’t inside a character class is ignored. 
- This means `dog | cat` is equivalent to the less readable `dog|cat`, but [a b] will still match the characters 'a', 'b', or a space. 
- you can put comments inside a RE; comments extend from a # character to the next newline. When used with triple-quoted strings, this enables REs to be formatted more neatly:
```python
pattern = re.compile(r"""
 \s*                 # Skip leading whitespace
 (?P<header>[^:]+)   # Header name
 \s* :               # Whitespace, and a colon
 (?P<value>.*?)      # The header's value -- *? used to
                     # lose the following trailing whitespace
 \s*$                # Trailing whitespace to end-of-line
""", re.VERBOSE)
```



### Non greedy
