# Python Regular Expressions

### What if we want to find an email or phone number but do not know the exact email or phone number?

We know an email format is "text" + "@" + "text" + ".com"

Python comes with this 're' library that allows us to create specialised pattern strings and search for them

In [2]:
text = "Ali's phone number is 408-555-1234. Call soon!"

In [3]:
'phone' in text

True

In [4]:
import re

In [5]:
pattern = 'phone'

In [6]:
re.search(pattern,text)

<re.Match object; span=(6, 11), match='phone'>

In [7]:
pattern = 'NOT IN TEXT'

In [9]:
re.search(pattern,text)

In [10]:
# not in text so does not bring up anything

In [14]:
pattern = 'phone'

In [15]:
match = re.search(pattern,text)

In [16]:
match

<re.Match object; span=(6, 11), match='phone'>

In [17]:
match.span()

(6, 11)

In [18]:
match.start()

6

In [19]:
match.end()

11

In [20]:
# now i will put phone twice and see what happens

In [21]:
text = 'my phone once, my phone twice'

In [22]:
re.search('phone',text)

<re.Match object; span=(3, 8), match='phone'>

In [23]:
#only gives the first one

In [28]:
# now we will find all cases of 'phone'

In [24]:
matches = re.findall('phone',text)

In [25]:
matches

['phone', 'phone']

In [26]:
len(matches)

2

In [30]:
for match in re.finditer('phone',text):
    print(match)

<re.Match object; span=(3, 8), match='phone'>
<re.Match object; span=(18, 23), match='phone'>


In [31]:
for match in re.finditer('phone',text):
    print(match.span())

(3, 8)
(18, 23)


In [32]:
for match in re.finditer('phone',text):
    print(match.group())

phone
phone


![image.png](attachment:33629b68-fc3f-44ed-8765-c7666be4fb06.png)


In [36]:
# How to find something that matches a pattern in a large data set:

In [37]:
text = "Ali's phone number is 408-555-1234. Call soon!"

In [38]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)

In [41]:
phone.group()

'408-555-1234'

In [42]:
# What if this phone number was 100 numbers long? We would not want to write \d\d\d\d\d ...

In [43]:
phone

<re.Match object; span=(22, 34), match='408-555-1234'>

In [44]:
# We can use quantifiers to indicate repetition of the same character

### Quantifiers

![image.png](attachment:2b7e7e92-79c6-4826-89f4-e447149acfa3.png)

In [45]:
phone = re.search(r'\d{3}-\d{3}-\d{4}', text)

In [46]:
phone

<re.Match object; span=(22, 34), match='408-555-1234'>

In [47]:
# find phone numbers and extract first 3 digits of the phone number (area code e.g. +44)

In [48]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [49]:
results = re.search(phone_pattern,text)

In [50]:
results.group()

'408-555-1234'

In [51]:
results.group(1) #group ordering starts at 1

'408'

In [52]:
results.group(2)

'555'

In [53]:
results.group(3)

'1234'

In [55]:
results.group(4)

IndexError: no such group