# Regular expression
- Regular expressions are a powerful language for matching text patterns.
- The Python `re` module provides regular expression support.
  
    `match = re.search(pat, str)`
- The `re.search()` method takes a regular expression `pattern` and a `string` and searches for that pattern within the string.
-  If the search is successful, search() returns a match `object` or `None` otherwise.


In [1]:
# To search for the pattern 'word:' followed by a 3 letter word
import re

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print('found', match.group()) ## 'found word:cat'
else:
  print('did not find')


found word:cat


## Basic Patterns
- a, X, 9, < -- ordinary characters just match themselves exactly. 
- The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | ( ) (details below).

- . (a period) -- matches any single character except newline '\n'
- \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. **Note** : that although "word" is the mnemonic for this, it only matches a single word char, not a whole word.
- \W (upper case W) matches any non-word character.
- \b -- boundary between word and non-word
- \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character. 
- \t, \n, \r -- tab, newline, return
- \d -- decimal digit [0-9] (some older regex utilities do not support \d, but they all support \w and \s)
- ^ = start, $ = end -- match the start or end of the string
- \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can try putting a slash in front of it, \@. If its not a valid escape sequence, like \c, your python program will halt with an error. 

In [2]:
  ## Search for pattern 'iii' in string 'piiig'.
  ## All of the pattern must match, but it may appear anywhere.
  ## On success, match.group() is matched text.
  match = re.search(r'iii', 'piiig') # found, match.group() == "iii"
  print(match.group())


iii


In [3]:
  match = re.search(r'igs', 'piiig') # not found, match == None
  print(match.group())
 


AttributeError: ignored

In [5]:
## . = any char but \n
match = re.search(r'..g', 'piiig') # found, match.group() == "iig"
print(match.group())


iig


In [7]:
## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123"
print(match.group())

123


In [8]:
match = re.search(r'\w\w\w', '@@abcd!!') # found, match.group() == "abc"
print(match.group())

abc


## Repetition
Things get more interesting when you use + and * to specify repetition in the pattern

  - `+` -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
  - `*` -- 0 or more occurrences of the pattern to its left
  - `?` -- match 0 or 1 occurrences of the pattern to its left 

### Leftmost & Largest
First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. `+` and `*` go as far as possible (the `+` and `*` are said to be `greedy`).

### Repetition Examples

In [9]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
print(match.group())


piii


In [10]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"


ii


In [11]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
print(match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
print(match.group())
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
print(match.group())

1 2   3
12  3
123


In [13]:
## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
print(match.group())

AttributeError: ignored

In [14]:
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
print(match.group())


bar


### Email example

In [15]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
  print(match.group())  ## 'b@google'

b@google


#### Square brackets
- Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'.
- The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot.
- For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [16]:
  match = re.search(r'[\w.-]+@[\w.-]+', str)
  if match:
    print(match.group())  ## 'alice-b@google.com'

alice-b@google.com
