In [1]:
text = "The agent's phone number is 455-666-4365. Call soon"

In [2]:
'phone' in text

True

In [3]:
import re

In [4]:
pattern = 'phone'

In [5]:
re.search(pattern, text)

<re.Match object; span=(12, 17), match='phone'>

In [6]:
match = re.search(pattern, text)

In [7]:
match

<re.Match object; span=(12, 17), match='phone'>

In [8]:
match.span()

(12, 17)

In [9]:
match.start()

12

In [10]:
match.end()

17

In [11]:
# The search function will return only one match even if the text have multiple
text = 'my phone once, my phone twice'

In [12]:
match = re.search('phone', text)

In [13]:
match

<re.Match object; span=(3, 8), match='phone'>

In [14]:
# To find all matches
matches = re.findall('phone', text)

In [17]:
matches

['phone', 'phone']

In [19]:
for match in matches:
    print(match)

phone
phone


In [20]:
# To find all the matches with proper info

In [16]:
for match in re.finditer('phone', text):
    print(match)

<re.Match object; span=(3, 8), match='phone'>
<re.Match object; span=(18, 23), match='phone'>


# Patterns

So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this, often its just a matter of looking up the pattern code.

Let' begin!

## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

In [21]:
text = "My phone number is 405-567-4533"

In [22]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)

In [23]:
phone

<re.Match object; span=(19, 31), match='405-567-4533'>

In [24]:
phone.group()

'405-567-4533'

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

In [26]:
my_phone = re.search(r'\d{3}-\d{3}-\d{4}', text)

In [27]:
my_phone

<re.Match object; span=(19, 31), match='405-567-4533'>

In [28]:
my_phone = re.search(r'\d*-\d*-\d*', text)

In [29]:
my_phone

<re.Match object; span=(19, 31), match='405-567-4533'>

In [32]:
my_phone = re.search(r'\d+-\d+-\d+', text)

In [40]:
my_phone

<re.Match object; span=(19, 31), match='405-567-4533'>

## Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down). 

Using the phone number example, we can separate groups of regular expressions using parenthesis:

In [36]:
# Grouping
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [37]:
results = re.search(phone_pattern, text)

In [38]:
results.group()

'405-567-4533'

In [39]:
#seperating the results
results.group(1)

'405'

## Additional Regex Syntax

### Or operator |

Use the pipe operator to have an **or** statment. For example

In [41]:
re.search(r'cat|dog', 'There is a dog')

<re.Match object; span=(11, 14), match='dog'>

### The Wildcard Character

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period **.** for this. For example:

In [46]:
re.findall(r'at', 'The cat in the hat went splat')

['at', 'at', 'at']

In [49]:
re.findall(r'.at', 'The cat in the hat went splat')

['cat', 'hat', 'lat']

In [50]:
re.findall(r'..at', 'The cat in the hat went splat')

[' cat', ' hat', 'plat']

In [51]:
re.findall(r'...at', 'The cat in the hat went splat')

['e cat', 'e hat', 'splat']

### Starts with and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [53]:
re.findall(r'^\d', '1 is number')

['1']

In [52]:
re.findall(r'^\d', 'I have 23 cats') # Returns empty array because it only returns if the whole sentence starts with 

[]

In [59]:
re.findall(r'\d$', 'I have 23 cats, black are 3') # The whole sentence ends with

['3']

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded. For example:

In [60]:
phrase = 'There are 3 numbers 34 inside 5 this sentence'

In [61]:
pattern = r'[^\d]' 

In [62]:
re.findall(pattern, phrase)

['T',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e']

In [64]:
phrase2 = 'This is a string! But it has punctuation. How can we remove it?'

In [65]:
re.findall(r'[^!.?]+', phrase2) # Removing the punctuation marks

['This is a string', ' But it has punctuation', ' How can we remove it']

In [67]:
clean = re.findall(r'[^!.? ]+', phrase2) # Removing the punctuation and as well as the spaces, by adding the space in the pattern

In [69]:
clean

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [71]:
' '.join(clean)

'This is a string But it has punctuation How can we remove it'

In [73]:
text_text = "Only find the hypen-words in this sentence. But you do not know how long-ish they are"

In [74]:
pattern = r'[\w]+'

In [81]:
re.findall(pattern, text_text)

['Only',
 'find',
 'the',
 'hypen',
 'words',
 'in',
 'this',
 'sentence',
 'But',
 'you',
 'do',
 'not',
 'know',
 'how',
 'long',
 'ish',
 'they',
 'are']

In [77]:
pattern_hyphon = r'[\w]+-[\w]+'

In [80]:
re.findall(pattern_hyphon, text_text)

['hypen-words', 'long-ish']

## Brackets for Grouping

As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

## Parenthesis for Multiple Options

If we have multiple options for matching, we can use parenthesis to list out these options. For Example:

In [84]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [89]:
re.search(r'cat(fish|nap|erpillar)', textthree)

<re.Match object; span=(26, 37), match='caterpillar'>