# Regular Expressions
#### := sequences of characters that define specific patterns to search in text

![Saving the day with regular expressions!](https://imgs.xkcd.com/comics/regular_expressions.png)

Task to save the day: **"Look for something that looks like an Email Address"**

In [34]:
email_pattern = "^.+@.+\..*$" #<- would work, but it's a bit too generic

In [35]:
email_pattern = "^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$" #<- better

Try out patterns on regex101.com

| character | meaning |
|-----------|---------|
| `.` | any character |
| `\w` | matches any alphanumeric character |
| `\d` | matches any digit character |
| `\s` | matches any space character |
| `[a-z]` | matches any letter |
| `[0-9]` | matches any digit between 0 and 9 |
| `+` | repeats previous symbol one or more times |
| `*` | repeats previous symbol 0 or more times |
|  `vertical bar` | logical OR; used to add multiple search patterns together |
| `\` | excape special characters |
| `(x)` | match group; extract out whatever you put in parenthesis |
| `[^a]` | not "a"|

### Regular expressions in Python

- **re.findall()** 	returns a list of matching strings
- **re.search()** 	returns a match object for the first
- **re.sub()** 	substitute pattern by a string
- **re.match()** 	match the entire string


- **re.compile()** 	pre-compile pattern, so it is faster
- **re.DOTALL** 	switch for matching newlines
- **re.IGNORECASE** 	switch for matching upper/lowercase

In [24]:
import re

In [11]:
text = '''thyme coriander rosemary cinnamon pepper tarragon basil salvia cumin'''

In [12]:
pattern = "c\w+"

### find all occurrences

In [25]:
text

'thyme <a href="coriander99"> <a href="rosemary"> Cinnamon pepper tarragon basil salvia cumin'

In [26]:
re.findall(pattern, text, re.IGNORECASE)

['coriander99', 'Cinnamon', 'cumin']

#### match line-breaks as well

In [14]:
re.findall(pattern, text, re.DOTALL)

['coriander99', 'cumin']

#### ignore case

In [15]:
re.findall(pattern, text, re.IGNORECASE)

['coriander99', 'Cinnamon', 'cumin']

#### match objects

In [28]:
text

'thyme <a href="coriander99"> <a href="rosemary"> Cinnamon pepper tarragon basil salvia cumin'

In [29]:
pattern

'c\\w+'

In [16]:
s = re.search(pattern, text) # returns object that denotes the first occurence
s.span()                     # returns start, stop indexes

(15, 26)

#### replace patterns

In [32]:
text_update = re.sub(pattern, "SPICE starting with C", text)

In [21]:
text

'thyme <a href="coriander99"> <a href="rosemary"> Cinnamon pepper tarragon basil salvia cumin'

In [33]:
text_update

'thyme <a href="SPICE starting with C"> <a href="rosemary"> Cinnamon pepper tarragon basil salvia SPICE starting with C'