# Regular Expression

In this notebook we will practice regular expression.

## Basic Text Search

First create some example text.

In [1]:
text = "The agent's phone number is 408-555-1234. Call soon."

Suppose we want to check is the word 'phone' is in the sentence. We can check it with the `in` command.

In [2]:
'phone' in text

True

The `re` library will handle the regular expressions in Python. So we will import the library

In [3]:
import re

Regular expression help us locate certain patterns in the text data. Suppose our pattern was the word `phone`. We can use `re` library to locate that pattern.

In [4]:
pattern = 'phone'

In [5]:
re.search(pattern, text)

<re.Match object; span=(12, 17), match='phone'>

The above output shows that `re` found the pattern and it was located in span (12 to 17). This is just the index of the pattern's location in the text data.

In [6]:
pattern = 'Not in the text.'

In [8]:
print(re.search(pattern, text))

None


When `re` does not find any pattern, it simply returns `None`

We can save the results in the match object, this will help us perform more actions on the results.

In [9]:
pattern = 'phone'

In [10]:
match = re.search(pattern,text)

In [11]:
print(match)

<re.Match object; span=(12, 17), match='phone'>


As mentioned above the `span` has a start and an end index. We can use it slice our data and locate the correct pattern.

In [12]:
match.span()

(12, 17)

In [13]:
match.start()

12

In [14]:
match.end()

17

But sometime the pattern will occur more than once. Instead of `re.search` we can use `re.findall` to search all the instances of the pattern. Lets create a new example text.

In [15]:
text = "My new phone is an Android phone."

In [18]:
match = re.search('phone', text)

In [30]:
match.span()

(7, 12)

We can see that it matched only the first instance of the word 'phone'. 

To match all we will us the `findall` method.

In [40]:
match = re.findall('phone', text)

In [41]:
match

['phone', 'phone']

In [33]:
len(match)

2

To get the location of all the matched patterns we will need to use the for loop and an iterator.

In [38]:
for match in re.finditer('phone',text):
    print(match.span())

(7, 12)
(27, 32)


## Patterns

So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this; often it's just a matter of looking up the pattern code.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [44]:
text = "To contact the IRS anonymous tip center call 1-800-829-0433"

Suppose we want to locate the phone number listed in the text.

In [47]:
pattern = (r"\d-\d\d\d-\d\d\d-\d\d\d\d")

In [49]:
match = re.search(pattern, text)
match.group()

'1-800-829-0433'

We have repeated the `\d` many times, we can avoid these repetitions by using the Quantifiers.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [62]:
pattern = re.compile(r"(\d)-(\d{3})-(\d{3})-(\d{4})")
match = re.search(pattern, text)
match.group()

'1-800-829-0433'

Since we compiled the pattern with groups, we can locate the result positions.

In [66]:
match.group(2)

'800'

In [68]:
match.group(4)

'0433'

## Additional Regex

### OR operator |

In [70]:
text = "Jim likes oats and milk."
pattern = r"cereal | oats"
re.search(pattern,text)

<re.Match object; span=(9, 14), match=' oats'>

### The Wildcard Character

In [72]:
text = "These animals rhyme, cat, bat, rat."
pattern = r".at"
re.findall(pattern,text)

['cat', 'bat', 'rat']

In [73]:
text = "These words rhyme, cat, bat, rat, aristocrat."
pattern = r".at"
re.findall(pattern,text)

['cat', 'bat', 'rat', 'rat']

To match the entire words 'aristocrat' we can us the \S+ pattern

In [74]:
text = "These words rhyme, cat, bat, rat, aristocrat."
pattern = r"\S+at"
re.findall(pattern,text)

['cat', 'bat', 'rat', 'aristocrat']

Starts with and Ends with.

In [80]:
text = "3 closest airports to get to New York, JKK, LGA and EWR."
pattern = r"^[\d]"
re.findall(pattern,text)

['3']

In [85]:
text = "Q: How many states in the USA? 50"
pattern = r"\d+$"
re.findall(pattern,text)

['50']

### Exclusion

`^` symbol is used for exclusion.

In [88]:
text = "S&P500 was 1.2% up, Nasdaq was 1.5% up and Dow was 1.05% up."
pattern = r"[^\d]"
re.findall(pattern,text)

['S',
 '&',
 'P',
 ' ',
 'w',
 'a',
 's',
 ' ',
 '.',
 '%',
 ' ',
 'u',
 'p',
 ',',
 ' ',
 'N',
 'a',
 's',
 'd',
 'a',
 'q',
 ' ',
 'w',
 'a',
 's',
 ' ',
 '.',
 '%',
 ' ',
 'u',
 'p',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 'D',
 'o',
 'w',
 ' ',
 'w',
 'a',
 's',
 ' ',
 '.',
 '%',
 ' ',
 'u',
 'p',
 '.']

In [90]:
text = "S&P500 closed 20 points up, Nasdaq closed 30 points up and Dow was 157 points up."
pattern = r"[^\d]+"
re.findall(pattern,text)

['S&P',
 ' closed ',
 ' points up, Nasdaq closed ',
 ' points up and Dow was ',
 ' points up.']

### Remove punctuations.

In [104]:
text = 'This is a string! But it has punctuation. How can we remove it?'
pattern = r"[^\W]+"
re.findall(pattern,text)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

We can join them back to form a sentence without punctuation.

In [108]:
text = 'This is a string! But it has punctuation. How can we remove it?'
pattern = r"[^\W]+"
clean = ' '.join(re.findall(pattern,text))
clean

'This is a string But it has punctuation How can we remove it'

### Grouping

In [111]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

re.findall(r'[\w]+-[\w]+', text)

['hypen-words', 'long-ish']

We can find words that spelled differntly.

In [123]:
text = 'The word gray can be spelled gray or grey'
for match in re.finditer(r"gr(e|a)y", text):
    print(match.group())

gray
gray
grey
