Now that we can write regular expressions and test them online (e.g. <a href = "http://www.pythex.org">Pythex</a>), let's take a look at some of the functions Python has for working with them.

First, import the `re` module,  which is part of the Python standard library. 

In [1]:
import re
s = "Test string with some numbers like 439 and 234 and maybe some others like 19 and 20004"

So that we can focus on learning the _Python_, let's use a very simple regular expression that matches simple integers.

In [2]:
regex = r"\d+" # integers (no sign, no commas, etc....)

Notice the `r` in front of the string. This means we're entering a _raw_ string and that the computer is to interpret each character in the string literally. Without the preceding `r`, the `\` is typically interpreted as an _escape character_ that gives special meaning to the following character. For instance, the newline character is written as `\n`:

In [3]:
print("hello\nworld")

hello
world


But for a raw string...

In [4]:
print(r"hello\nworld")

hello\nworld


We use raw strings with regular expressions because we don't want the core Python language to be interpreting the strings; we want it to pass them exactly as we enter them to the functions of the `re` module.

<a href = "https://docs.python.org/2/library/re.html#re.search">`re.search`</a> is a function that looks for a regular expression in a string. As soon as it finds it once, it returns.

In [5]:
match = re.search(regex, s)
print match
print type(match)

<_sre.SRE_Match object at 0x06DF3288>
<type '_sre.SRE_Match'>


Apparently `re.search` returns a <a href = "https://docs.python.org/2/library/re.html#re.MatchObject">_match object_</a>. Here are a few of the useful methods; take a look at the documentation for more.

In [6]:
print match.start() # index where the match begins
print match.end()   # index where the match ends
print match.group() # the part of the string that is matched

35
38
439


Actually, the `group` method is for more than that, but to explore its purpose we'll need to use a regular expression with some capture groups. A trivial example is:

In [7]:
regex = r"(\w+) (\d+)" # word followed by a number; capture the word and the number separately

In [8]:
match = re.search(regex, s)
print match.group(0) # the entirety of the string that is matched
print match.group(1) # the first capture group
print match.group(2) # the second capture group

like 439
like
439


Since `search` only finds one match, it's best for testing regular expressions. More often we'll want to <a href = "https://docs.python.org/2/library/re.html#re.findall">`re.findall`</a>.

In [9]:
matches = re.findall(regex, s)
print matches

[('like', '439'), ('and', '234'), ('like', '19'), ('and', '20004')]


The `match` object returned by `search` had some handy methods, but `search` only found one `match`. <br>
`findall` gives us a list of tuples of _all_ the capture groups, but it doesn't give us where they were found or anything else. <br>
To get the best of both worlds we have to use <a  href = "https://docs.python.org/2/library/re.html#re.finditer">`re.finditer`</a> function.

In [10]:
match_iterator = re.finditer(regex, s)
print match_iterator

<callable-iterator object at 0x06DE8350>


In case you haven't seen Python iterators yet, let's not worry about what that means. Let's just turn that thing into a more familiar `list`. 

In [11]:
matches = list(match_iterator)
matches

[<_sre.SRE_Match at 0x6af6da0>,
 <_sre.SRE_Match at 0x6dde8d8>,
 <_sre.SRE_Match at 0x6dde890>,
 <_sre.SRE_Match at 0x6dde4a0>]

In [12]:
for match in matches:
    print("The text '" + match.group() + "' begins at index " + str(match.start()))

The text 'like 439' begins at index 30
The text 'and 234' begins at index 39
The text 'like 19' begins at index 69
The text 'and 20004' begins at index 77


Finally, besides extracting information from text, another very important use of regular expressions is for _replacing_ text. The `re` module covers that, too, with <a href = "https://docs.python.org/2/library/re.html#re.sub">`re.sub`</a>. In another horribly contrived example, we'll flip the order of the number and the preceeding word and enclose them in brackets, just because we can. 

In [13]:
replace_with = r"[\2 \1]" # indicate that the capture group 2 is supposed to come before capture group one
s2 = re.sub(regex, replace_with, s)
print s2

Test string with some numbers [439 like] [234 and] and maybe some others [19 like] [20004 and]


Notice that `re.sub` interprets the brackets in the string `replace_with` literally, but interprets the numbers preceeded with backslashes as the capture groups. This allows you to replace matches with a combination of regular text and the text found in the capture groups.

For more information, see <a href = "https://regexone.com/references/python">RegexOne "Using Regular Expressions in Python"</a> and the <a href = "https://docs.python.org/2/library/re.html">official documentation</a>.