# Regular Expressions
In this code we review some basic regular expression techniques to explore a text. 
## Functions: match, search, findall and finditer

In [2]:
import re # This library needed for using regular expression

In [3]:
txt1 = "When the fox saw the ox is breathing all the oxygen decided to sleep in the box."
txt2 = "oxygen is necessary matter for all animals inclding foxes and oxes, but boxes don't need oxygen."
pattern = "ox"

### Function "match" returns the match object of the pattern if it appears at the beggining of the string. Otherwise it retuurns None.

In [4]:
mat0 = re.match(pattern, txt2) 
print(mat0)
if mat0 != None:
    print(mat0.span())
    print(mat0.start())
    print(mat0.end())
# span(), start() and end() are attributes of a match object.

<re.Match object; span=(0, 2), match='ox'>
(0, 2)
0
2


In [5]:
mat1 = re.match(pattern,txt1)
print(mat1)

None


### Function "search" finds the first match object of the pattern inside the text as an object with certain attribiutes.

In [6]:
mat2 = re.search(pattern, txt1) 
print(mat2)

<re.Match object; span=(10, 12), match='ox'>


## Function "findall" finds and returns a list of all occurrences 
No object here!

In [7]:
mat3 = re.findall(pattern, txt1)
print("mat3 = ", mat3)

mat3 =  ['ox', 'ox', 'ox', 'ox']


## Function "finditer" finds and retruns an iterator object of all occurrences as match objects

In [8]:
mat4 = re.finditer(pattern, txt1)
print("mat4 = ", mat4)
print("TEXT = ", txt1)
print("=================== iterating through mat4  ========================")
for i, m in enumerate(mat4):
    print("The ", i, "-th match = ",m)
    print("The span, start and end of the {}-th match are {}, {} and {}".format(i, m.span(), m.start(), m.end()),"\n")

mat4 =  <callable_iterator object at 0x0000022A43F19220>
TEXT =  When the fox saw the ox is breathing all the oxygen decided to sleep in the box.
The  0 -th match =  <re.Match object; span=(10, 12), match='ox'>
The span, start and end of the 0-th match are (10, 12), 10 and 12 

The  1 -th match =  <re.Match object; span=(21, 23), match='ox'>
The span, start and end of the 1-th match are (21, 23), 21 and 23 

The  2 -th match =  <re.Match object; span=(45, 47), match='ox'>
The span, start and end of the 2-th match are (45, 47), 45 and 47 

The  3 -th match =  <re.Match object; span=(77, 79), match='ox'>
The span, start and end of the 3-th match are (77, 79), 77 and 79 



# Patterns Containing Identifiers and Quantifiers
No we build more sophisticated patterns using "identifiers" and "quantifiers" 

## Identifiers:
Each one of the following identifiers represents one letter of a specific group of characters, unless imediately comes some quantifire.
<table style='font-size:20px;' cellspacing=5, cellpadding=10>
    <tr ><th>Identifier</th><th>Description</th><th>Pattern</th><th >Match</th></tr>
    <tr ><th>\d</th><td style="width:30%">Digit</td><td>book_\d</td><td>book_4</td></tr>
    <tr ><th>\w</th><td>Alphanumeric</td><td>\w-\w\w\w</td><td>T-c_3</td></tr>
    <tr ><th>\s</th><td>White space</td><td>#\s-\sW</td><td># - W</td></tr>
    <tr ><th>\D</th><td>Non-digit</td><td>\D\D\D</td><td>Cat</td></tr>
    <tr ><th>\W</th><td>Non-alphanumeric</td><td>\W\W\W\W</td><td>{+})</td></tr>
    <tr ><th>\S</th><td>Non-whitespace</td><td>\S\S\S\S\S</td><td>#-oP</td></tr>
</table>

### In order to emphasis that we intend raw character (not scape characters) in the pattern, we use letter 'r' at the beginning of the pattern.

In [9]:
pat0 = r"\wox"
print(re.findall(pat0, txt1))
# both fox and box in txt1 start with some letter and continue with "ox"
mat5 = re.finditer(pat0, txt1)
for m in mat5:
    print(mat5)

['fox', 'box']
<callable_iterator object at 0x0000022A43E57BE0>
<callable_iterator object at 0x0000022A43E57BE0>


# Identifiers are especially uuseful for detecting phone numbers:

In [10]:
txt3 = "My phone number is 555-321-2233 and it is very similar to Alex's phone number 555-321-3322. However, Mary's phone number, 777-902-6720, is totally different."
# Now we want to extract all the phone numbers in the above text!
pat1 = r"\d\d\d-\d\d\d-\d\d\d\d"
print(re.findall(pat1, txt3))

['555-321-2233', '555-321-3322', '777-902-6720']


# In similar situations as above, one can use quantifiers to avoid repeating identifiers:

<table style='font-size:20px;' cellspacing=5, cellpadding=10>
    <tr><th>Quantifier</th><th style="width:30%">Description</th><th>Pattern</th><th >Match</th></tr>
    <tr ><th>+</th><td>one or more times</td><td>\w\d-\d+</td><td>A4-3 or s8-342, ...</td></tr>
    <tr ><th>{2}</th><td>exactly 2 times</td><td>\D{2}</td><td>b}</td></tr>
    <tr ><th>{2,5}</th><td>between 2 to 5 times</td><td>\d{2,5}</td><td>1234</td></tr>
    <tr ><th>{3,}</th><td>at least 3 times</td><td>\d{3,}</td><td>243 or 34003, ...</td></tr>
    <tr ><th>*</th><td>zero or more times</td><td>A*B*</td><td>AA or B, ...</td></tr>
    <tr ><th>?</th><td>at most once</td><td>book\d?</td><td>book or book3, ...</td></tr>
</table>

## The following examples show some se cases of these quantifiers:

In [12]:
pat2 = r"\d{3}-\d{3}-\d{3}" # This raw string defines the same pattern as pat1 in the above!
print(re.findall(pat2, txt3))

['555-321-223', '555-321-332', '777-902-672']


In [14]:
# Using the identifiers and quantifiers we can find all words containing the combination "ox" in "txt1" and "txt2".
pat3 = r"\w{0,}ox\w{0,}"
print(re.findall(pat3, txt1))
print(re.findall(pat3, txt2))

['fox', 'ox', 'oxygen', 'box']
['oxygen', 'foxes', 'oxes', 'boxes', 'oxygen']


In [27]:
txt4 = "Let's assume that x, x1, x2 and x43 are real variables and f is a function of one variable. 5 97 xxx2 "
pat4 = r"x\d?"
print(re.findall(pat4, txt4))
pat5 = r"x\d*"
print(re.findall(pat5, txt4))
pat6 = r"x*\d"
print(re.findall(pat6, txt4))
pat7 = r"variables*"
print(re.findall(pat7, txt4))

['x', 'x1', 'x2', 'x4', 'x', 'x', 'x2']
['x', 'x1', 'x2', 'x43', 'x', 'x', 'x2']
['x1', 'x2', 'x4', '3', '5', '9', '7', 'xxx2']
['variables', 'variable']
