# lecture 4 - Regular Expressions

Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

In [1]:
text = 'The phone number of the agent is 408-555-1234. Call soon!'

In [2]:
'phone' in text

True

In [3]:
'408-555-1234' in text

True

In [4]:
import re # regular expression library

In [5]:
pattern = 'phone'

In [7]:
re.search(pattern,text)

<re.Match object; span=(4, 9), match='phone'>

In [8]:
my_match = re.search(pattern,text)

In [9]:
my_match.span() # from index 4 to index 9

(4, 9)

In [10]:
my_match.start()

4

In [11]:
my_match.end()

9

In [12]:
text = 'my phone is a new phone'

In [13]:
match = re.search(pattern,text)

In [14]:
match.span()

(3, 8)

In [15]:
all_matches = re.findall('phone',text)

In [16]:
len(all_matches) # two matches

2

In [17]:
for match in re.finditer('phone',text):
    print(match.span())

(3, 8)
(18, 23)


## patterns

So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this; often it's just a matter of looking up the pattern code.

## identifiers for characters in patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [19]:
text = "My telephone number is 408-555-1234"

In [20]:
text

'My telephone number is 408-555-1234'

In [21]:
pattern = r'\d\d\d-\d\d\d-\d\d\d\d' # generalized pattern

In [22]:
phone_number = re.search(pattern,text)

In [24]:
phone_number

<re.Match object; span=(23, 35), match='408-555-1234'>

In [25]:
phone_number.group()

'408-555-1234'

## quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [26]:
pattern = r'\d{3}-\d{3}-\d{4}' # more efficiently than before

In [27]:
re.search(pattern,text)

<re.Match object; span=(23, 35), match='408-555-1234'>

In [28]:
pattern = r'(\d{3})-(\d{3})-(\d{4})'

In [29]:
my_match = re.search(pattern,text)

In [31]:
my_match.group(1)

'408'

In [32]:
my_match.group(2)

'555'

In [33]:
my_match.group(3)

'1234'

In [35]:
re.search(r'man|woman','This man was here')

<re.Match object; span=(5, 8), match='man'>

In [36]:
re.search(r'man|woman','This woman was here')

<re.Match object; span=(5, 10), match='woman'>

In [37]:
re.findall(r'.at','The cat is the hat sat')

['cat', 'hat', 'sat']

In [38]:
re.findall(r'.at','The cat is the hat sat splat')

['cat', 'hat', 'sat', 'lat']

the substring 'splat' is not matched entirely because . matches only one character before at

In [40]:
re.findall(r'..at','The cat is the hat sat splat')


[' cat', ' hat', ' sat', 'plat']

In [42]:
re.findall(r'...at','The cat is the hat sat splat')

['e cat', 'e hat', 'splat']

In [43]:
re.findall(r'\d$','This ends with number 2') # $ means 'ends with'

['2']

In [45]:
re.findall(r'\d','1 is the loneliest number')

['1']

In [46]:
phrase = 'these are 2 numbers 34 inside 5 this sentence'

In [56]:
re.findall(r'[^\d]+',phrase) # ^ is going to indicate exclusion

['these are ', ' numbers ', ' inside ', ' this sentence']

In [57]:
test_phrase = 'This is a string! but it has puncutation. how to remove it?'

In [58]:
my_list = re.findall(r'[^!.?]+',test_phrase)

The '^' in the context of square brackets ([]) means negation while '+' ensures that we match consecutive sequences of non-punctuation characters

In [59]:
''.join(my_list)

'This is a string but it has puncutation how to remove it'

In [61]:
text = 'Only find the hyphen-words. Were are the long-ish dash words?'

In [62]:
re.findall(r'[\w]+-[\w]+',text)

['hyphen-words', 'long-ish']