## Regular Expressions



In [1]:
# "re" (regular expression) module provides regular expression matching operations similar to those found in Perl.

import re

## Module Functions

The module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form.



### Compile and match function

```
re.compile(pattern, flags=0)
```

Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods, described below.

The re functions take options to modify the behavior of the pattern match. The option flag is added as an extra argument to the search() or findall() etc., e.g. **re.search(pat, str, re.IGNORECASE).**

```
prog = re.compile(pattern)
result = prog.match(string)
```

is equivalent to


```
result = re.match(pattern, string)
```

but **using re.compile() and saving the resulting regular expression object for reuse is more efficient** when the expression will be used several times in a single program.



### re.search() vs re.match()

Both return the first occurence of a substring found in the string, but 
re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string

https://docs.python.org/3/library/re.html#search-vs-match

In [2]:
print(re.match("c", "abcdef"))   # No match
print(re.search("c", "abcdef"))   # Match

None
<re.Match object; span=(2, 3), match='c'>


So, if we are sure, we need to find from the first character only -> then re.match is used (as it is faster) and if there is uncertainty in the position, we go with re.search.

## Some examples

In [3]:
## Search for pattern 'press' in string 'expression'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.

match = re.search(r'press', 'expression') 

if match:
  print('found:', match.group()) 


found: press


In [4]:
match = re.search(r'pressing', 'expression') # not found, match == None
if match:
  print('found:', match.group())
else:
  print('Not Found')
  

Not Found


In [5]:
## . = any char but \n
match = re.search(r'pr..s', 'expression') 

if match:
  print('found:', match.group())


found: press


In [6]:
## \d = digit char
match = re.search(r'\d\d\d', 'expression_123') # found, match.group() == "123"
if match:
  print('found:', match.group())


found: 123


In [7]:
# \w = word char

match = re.search(r'\w\w\w', '@@abcd!!')
if match:
  print('found:', match.group())

found: abc


## Repetition qualifiers




In [8]:
## o+ = one or more o's, as many as possible.

match = re.search(r'wo+w', 'wooooooow! amazing') # found, match.group() == "wooooooow"
if match:
  print('found:', match.group())

found: wooooooow


In [11]:
## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second 's'.

match = re.search(r's+', 'expressions') 
if match:
  print('found:', match.group())

found: ss


In [12]:
## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.

match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
if match:
  print('found:', match.group())

found: 1 2   3


In [13]:
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
if match:
  print('found:', match.group())

found: 12  3


In [14]:
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
if match:
  print('found', match.group())  

found 123


In [15]:
## ^ -> matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar')
if match:
  print('found', match.group())
else:
  print('Not found')

Not found


In [16]:
## but without the ^ it succeeds:

match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
if match:
  print('found:', match.group())

found: bar


### Emails Example

In [23]:
str = 'name admin@google.com of company'

match = re.search(r'\w+@\w+', str)
if match:
  print(match.group())

admin@google


## Group Extraction

A common workflow with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.



In [24]:
str = 'name admin@google.com of company'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
  print(match.group()) 
  print(match.group(1))
  print(match.group(2))  

admin@google.com
admin
google.com


## findall

In [25]:
## Suppose we have a text with many email addresses
str = 'name admin@google.com, company sales@google.com grouping'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str)
for email in emails:
  print(email)

admin@google.com
sales@google.com


## findall and Groups

If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of *tuples*. 

Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. 

So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', 'google.com').





In [26]:
str = 'name admin@google.com, company sales@google.com grouping'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)

print(tuples) 
for tuple in tuples:
  print(tuple[0])  ## username
  print(tuple[1])  ## host

[('admin', 'google.com'), ('sales', 'google.com')]
admin
google.com
sales
google.com


## Splitting

In [27]:
re.split('\s+', 'Usage of split function.')

['Usage', 'of', 'split', 'function.']

In [28]:
re.split('[a-f]+', '0a3B9AAM', flags=re.IGNORECASE)

['0', '3', '9', 'M']

## Substitution

In [29]:
str = 'name admin@google.com, company sales@google.com grouping'

## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement

print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@GMAIL.com', str))


name admin@GMAIL.com, company sales@GMAIL.com grouping
