# 1.4 Regular Expression

> The post explains how to use regular expression to find and modify text.

- toc : true
- badges : false
- comments : false
- categories : [regular-expression, re, text-cleaning,re.findall, NPL-Chapter-1]
- image : false

## Regular exp. library

In [6]:
import re

### Finding the first instance in the text

In [25]:
text = "The phone number given in the helpline is 408-999-4567"
pattern = 'phone'
re.search(pattern, text)

<re.Match object; span=(4, 9), match='phone'>

If the match is found then search return the location of the match. Note: It only gives the first instance in the text.

Span is the starting and ending index of the match. (Index starts from zero)

In [7]:
match=re.search(pattern, text)
match

<re.Match object; span=(4, 9), match='phone'>

**.span() give the span of the match, .start() give the start index, .end() gives the end index**

In [8]:
match.span()

(4, 9)

In [10]:
match.start()

4

In [11]:
match.end()

9

### Find all instances in the text

In [14]:
text1 = "My phone is a hi-tech phone. The phone is dual band, with the lastest phone-tech processor"

In [16]:
matches = re.findall("phone", text1)
matches

['phone', 'phone', 'phone', 'phone']

In [17]:
len(matches)

4

In [18]:
### Using iterator to the span of each match instances

In [19]:
for match in re.finditer('phone', text1):
    print(match.span())

(3, 8)
(22, 27)
(33, 38)
(70, 75)


**To find the word matched, use .group() method**

In [21]:
match.group()

'phone'

## Identifiers in Regex

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [26]:
text 

'The phone number given in the helpline is 408-999-4567'

If we want to find phone number with the pattern xxx-xxx-xxxx, we can use the identifier for it.

In [30]:
re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text).group()

'408-999-4567'

## Quantifiers in Regex

In repeating the identifier, we can use quantifiers to do the same thing.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [33]:
re.search(r'\d{3}-\d{3}-\d{4}', text).group()

'408-999-4567'

## Groups in Regex search

Using parentheses in regex we can create groups with the matched data

In [41]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [46]:
results = re.search(phone_pattern, text)

In [48]:
results.group()

'408-999-4567'

Each parentheses in the regex pattern is group which can called out.

In [51]:
results.group(1)

'408'

In [52]:
results.group(2)

'999'

In [53]:
results.group(3)

'4567'

## Or operator |

In [55]:
re.search(r"man|woman", "This man is a good person")

<re.Match object; span=(5, 8), match='man'>

In [57]:
re.search(r"man|woman", "This woman is a good person")

<re.Match object; span=(5, 10), match='woman'>

## Wildcard characters

In [62]:
re.findall(r".at", "The fat cat ate the peta bread and sat on the rattop and splat")

['fat', 'cat', ' at', 'sat', 'rat', 'lat']

We see that all 3 letter word being matched. One single period matches on wildcard letter before the pattern.

In [64]:
re.findall(r"..at", "The fat cat ate the peta bread and sat on the rattop and splat")

[' fat', ' cat', ' sat', ' rat', 'plat']

In [67]:
re.findall(r"\S+at", "The fat cat ate the peta bread and sat on the rattop and splat")

['fat', 'cat', 'sat', 'rat', 'splat']

In case one or more non whitespace that end with 'at' are matched.

## Starts with and ends with

^ : Starts with , $ : ends with

In [70]:
re.findall(r'\d$', "This ends with a number 2")

['2']

In [72]:
re.findall(r'^\d', "5 is the number of choice")

['5']

## Exclusion

Square brackerts[^] are used for exclude a character. 

In [12]:
phrase = "there are 3 numbers 34 insides 5 this sentence."


In [13]:
re.findall(r'[^\d]+', phrase)

['there are ', ' numbers ', ' insides ', ' this sentence.']

## Removing the punctuation

In [78]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [79]:
test_phrase

'This is a string! But it has punctuation. How can we remove it?'

In [81]:
re.findall(r'[^!.? ]+', test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

**Putting it together**

In [82]:
clean = ' '.join(re.findall(r'[^!.? ]+', test_phrase))

In [83]:
clean

'This is a string But it has punctuation How can we remove it'

In [1]:
## Brackets for Grouping

In [3]:
text3 = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [8]:
text3

'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [10]:
re.findall(r'[\w]+-[\w+',text3)

['hypen-words', 'long-ish']

**Note** Difference between [], ()

The [] construct in a regex is essentially shorthand for an | on all of the contents. For example [abc] matches a, b or c. Additionally the - character has special meaning inside of a []. It provides a range construct. The regex [a-z] will match any letter a through z.

The () construct is a grouping construct establishing a precedence order (it also has impact on accessing matched substrings but that's a bit more of an advanced topic). The regex (abc) will match the string "abc".


In [14]:
## Parentheses for Multiple Options

In [15]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [16]:
re.search(r'cat(fish|nap|claw)',text).group()

'catfish'

In [17]:
re.search(r'cat(fish|nap|claw)',texttwo).group()

'catnap'

In [19]:
re.search(r'cat(fish|nap|claw)',textthree)