#Python Regular Expression
* In Python a regular expression search is typically written as:
  **`match = re.search(pat, str)`**
* The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediately followed by an if-statement to test if the search succeeded, as shown in the following example which searches for the pattern 'word:' followed by a 3 letter word (details below):


In [33]:
import re

dir(re)

['DEBUG',
 'DOTALL',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'VERBOSE',
 'X',
 '_MAXCACHE',
 '__all__',
 '__builtins__',
 '__doc__',
 '__file__',
 '__name__',
 '__package__',
 '__version__',
 '_alphanum',
 '_cache',
 '_cache_repl',
 '_compile',
 '_compile_repl',
 '_expand',
 '_pattern_type',
 '_pickle',
 '_subx',
 'compile',
 'copy_reg',
 'error',
 'escape',
 'findall',
 'finditer',
 'match',
 'purge',
 'search',
 'split',
 'sre_compile',
 'sre_parse',
 'sub',
 'subn',
 'sys',
 'template']

### #Basic Regex Search

In [5]:
import re

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:                      
    print 'found', match.group() ## 'found word:cat'
else:
    print 'did not find'
        

found word:cat


### #Create a find(pattern,text) function

In [1]:
import re

str = 'an example word:cat!!'

def find(pat, text):
    match = re.search(pat, text)
    # If-statement after search() tests if it succeeded
    if match:                      
        print 'found', match.group() ## 'found word:cat'
    else:
        print 'did not find'
        
def main():
    find(r'word:\w\w\w', str)
    
if __name__ == '__main__':
    main()

found word:cat


### #Basic Patterns

In [17]:
match = re.search(r'iii', 'piiig')
print match.group()
  
## . = any char but \n
match = re.search(r'..g', 'piiig') 
print match.group()

## \d = digit char, 
match = re.search(r'\d\d\d', 'p123g') 
print match.group()

## \w = word char`
match = re.search(r'\w\w\w', '@@abcd!!')
print match.group()

iii
iig
123
abc


### #Repetition

In [22]:
# Repetition Examples

## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') 
print match.group()

## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') 
print match.group()

## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')    # "1 2   3"
print match.group()

match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx')      # "12  3"
print match.group()

match = re.search(r'\d\s*\d\s*\d', 'xx123xx')        # "123"
print match.group()

## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar')                # None

## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar')                 # "bar"
print match.group()



piii
ii
1 2   3
12  3
123
bar
g@hotmail


### #Square Brackets - a set of characters

* Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. 
* The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. 
* For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

In [23]:
str = 'purple alice-b@google.com monkey dishwasher'

# first try to extract email --> failed
match = re.search(r'\w+@\w+', 'xxx.yyy.g@hotmail.com')      # "g@hotmail"
print match.group()


# Use [] for a set of characters
match = re.search(r'[\w.-]+@[\w.-]+', str)                  # . (dot) inside the [] means a literal dot
if match:
    print match.group()  ## 'alice-b@google.com'

g@hotmail
alice-b@google.com


### #Group Extraction

* The "group" feature of a regular expression allows you to pick out parts of the matching text. 
* Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. 
* In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. 

In [18]:
# group extraction with parenthesis
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', str)
if match:
    print match.group()   ## 'alice-b@google.com' (the whole match)
    print match.group(1)  ## 'alice-b' (the username, group 1)
    print match.group(2)  ## 'google.com' (the host, group 2)
else:
    print 'find no match!'


alice-b@google.com
alice-b
google.com


### #Find All

For files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):

In [None]:
# Open file
f = open('test.txt', 'r')
# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall(r'some pattern', f.read())
    

In [31]:
# Find All example
str = 'purple alice-b@google.com monkey dishwasher@xxx.nfl'

# first try to extract email --> failed
match = re.findall(r'[\w.-]+@[\w.-]+', str)      # "g@hotmail"
print match

['alice-b@google.com', 'dishwasher@xxx.nfl']


### #FindAll and Group

The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of *tuples*. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. 

In [32]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print tuples  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
    print tuple[0]  ## username
    print tuple[1]  ## host


[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


### # Regex Substitution

The re.sub(pat, replacement, str) function searches for all the instances of pattern in the given string, and replaces them. The replacement string can include '\1', '\2' which refer to the text from group(1), group(2), and so on from the original matching text.

In [None]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', str)
  ## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher