# RegEx for text processing

The many built-in string functions work fine for most text processing tasks. However, sometimes you will encounter a situation where you need the power of regular expressions.

In the code below we make a new list using a list comprehension by only selecting items from a starter list that contain the pattern 'ain'.

In [2]:
import re
word_list = ['contain', 'restrain', 'complaining', 'nothing']
[w for w in word_list if re.search('ain', w)]

['contain', 'restrain', 'complaining']

There are some important special characers in regular expressions:
* $ end of string
* ^ start of string
* . matches any character
* ? means that the previous character is optional
* * is repeat
* + is one or more

In [14]:
# $ example
[w for w in word_list if re.search('ain$', w)]


['contain', 'restrain']

In [15]:
# ^ example
[w for w in word_list if re.search('^co', w)]

['contain', 'complaining']

In [16]:
# ? example
[w for w in word_list if re.search('tr?ain', w)]

['contain', 'restrain']

In [17]:
# . and * example
[w for w in word_list if re.search('co.*ain', w)]

['contain', 'complaining']

In [18]:
# + example
[w for w in word_list if re.search('in+', w)]

['contain', 'restrain', 'complaining', 'nothing']

### brackets []

Brackets are a way to group parts of a pattern. The following looks for a pattern that starts at the beginning of a string, contains exactly 3 digits, followed by a dash. 

In [21]:
str1 = '123-45-6789'
patt = '^[0-9]{3}-'
re.search(patt, str1)

<_sre.SRE_Match object; span=(0, 4), match='123-'>

In [4]:
# ^ has a different meaning inside [], it means NOT
str1 = '123-45-6789'
patt = '^[^0-9]{3}-'  # first ^ means start, next one means NOT
re.search(patt, str1)

### More

There is much more to RegEx than we covered here. Chapter 3 of the NLTK book also gives examples using the re.finall() method that can find all non-overlapping matches of a pattern. There are many tutorials you can find on the web. It is also handy to have cheat sheets [like this one](http://docs.cs-cart.com/4.3.x/_downloads/regular-expressions-cheat-sheet-v1.pdf).

# Practice

Look at Table 3.3 on [this page](http://www.nltk.org/book/ch03.html). Make up an example with at least 5 of the Meta-Characters and share with the class. 
    