## Regular Expressions 

Regular Expressions allow us to search a specific pattern in a text document

The key thing to keep in mind is that every character type has a corresponding pattern code.

For example, digits have the placeholder pattern code od \d

The use of backslash allows python to understand that it is a special code and the letter "d"

In [2]:
text="The phone number of the agent is 400-673-1223. Call Now!"
"400-673-1223" in text

True

### Searching for the basic patterns

In [3]:
import re

In [4]:
text

'The phone number of the agent is 400-673-1223. Call Now!'

In [5]:
pattern='phone'
my_match=re.search(pattern,text)

In [6]:
my_match

<re.Match object; span=(4, 9), match='phone'>

In [7]:
my_match.span()   # indexes of the match

(4, 9)

In [8]:
my_match.start()  # Starting index of the match

4

In [10]:
my_match.end()   # Ending index of the match

9

In [11]:
text1="My phone is a new phone"
pattern='phone'
my_match=re.search(pattern,text1)

In [13]:
my_match

<re.Match object; span=(3, 8), match='phone'>

In [14]:
# If there are multiple same pattern present in a text then search method only give the position of the first founded match not the other one

**To find all the similar matches inside the text**

In [16]:
# We will use findall method to find all the similar matches inside a text

In [17]:
all_matches=re.findall(pattern,text1)

In [18]:
all_matches

['phone', 'phone']

In [19]:
len(all_matches)

2

To get actual match objects, use the iterator:

In [21]:
for match in re.finditer(pattern,text1):
    print(match.group(), match.span())

phone (3, 8)
phone (18, 23)


## Patterns

### Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:
                               

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [23]:
text

'The phone number of the agent is 400-673-1223. Call Now!'

In [24]:
pattern=r"\d\d\d-\d\d\d-\d\d\d\d"
phone_number=re.search(pattern,text)

In [25]:
phone_number.span()

(33, 45)

In [26]:
phone_number.group()

'400-673-1223'

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [28]:
re.search(r"\d{3}-\d{3}-\d{4}",text)

<re.Match object; span=(33, 45), match='400-673-1223'>

## Groups

In [29]:
phone_pattern=re.compile(r"(\d{3})-(\d{3})-(\d{4})")
my_match=re.search(phone_pattern,text)

In [30]:
my_match.group()  # all groups combined

'400-673-1223'

In [32]:
my_match.group(1)

'400'

In [33]:
my_match.group(2)

'673'

In [34]:
my_match.group(3)

'1223'

## Or Operators

In [36]:
re.search(r"man|woman","there was a woman here")

<re.Match object; span=(12, 17), match='woman'>

In [37]:
re.search(r"man|woman","there was a man here")

<re.Match object; span=(12, 15), match='man'>

## The wildcard character
Use a "wildcard" as a placement that will match any character placed there. You can use a simple period . for this.

In [38]:
re.findall(r".at","There is a cat in the hat on the mat")

['cat', 'hat', 'mat']

In [39]:
re.findall(r"..at","This bat went splat")

[' bat', 'plat']

In [40]:
re.findall(r"\S+at","This bat went splat")

['bat', 'splat']

## Starts with and ends with
We can use the ^ to signal starts with, and the $ to signal ends with:

In [44]:
re.findall(r'\d$',"5,2,3 This string ends with and starts with a digit 1")

['1']

In [45]:
re.findall(r"^\d","5,2,4 This string starts and ends with a digit 2")

['5']

## Exclusion
To exclude characters, we can use the ^ symbol in conjunction with a set of brackets []. Anything inside the brackets is excluded.

In [46]:
line='This string 3 include 34 some numbers 253'

re.findall(r'[^\d]',line)

['T',
 'h',
 'i',
 's',
 ' ',
 's',
 't',
 'r',
 'i',
 'n',
 'g',
 ' ',
 ' ',
 'i',
 'n',
 'c',
 'l',
 'u',
 'd',
 'e',
 ' ',
 ' ',
 's',
 'o',
 'm',
 'e',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ']

In [48]:
f=re.findall(r'[^\d]+',line)

In [49]:
''.join(f)

'This string  include  some numbers '

We can use this to remove punctuations from a string

In [50]:
my_str="In this string, we have punctuations! But we don't want them?"
re.findall(r"[^,!? ]+",my_str)

['In',
 'this',
 'string',
 'we',
 'have',
 'punctuations',
 'But',
 'we',
 "don't",
 'want',
 'them']

In [51]:
" ".join(re.findall(r"[^,!? ]+",my_str))

"In this string we have punctuations But we don't want them"

## Brackets for grouping

In [53]:
line_a= 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'
re.findall(r"[\w]+-[\w]+",line_a)

['hypen-words', 'long-ish']