# Regexes in Python

Python comes with a built-in `re` module for regular expressions.  There is also a more advanced `regex` module that you can install separately, which is backwards-compatible with `re`, but which we will not need.

The official Python documentation has a good [tutorial](https://docs.python.org/3/howto/regex.html), which you might want to look at after this lecture.

## Regex methods

In [2]:
import re

Importing the `re` module brings that object into your namespace, and the functions to use regexes are methods of that object.

These methods return none if the pattern isn't found, otherwise they return a match object or a list of matches.

It would seem logical to start with the `match` method, but it is stupid and is usually not what you want.  I only mention it so that you know to avoid it.

In [3]:
text = 'The quick brown fox jumped over the lazy dogs, but not these ones.'
re.match('[A-Z]', text)

<re.Match object; span=(0, 1), match='T'>

In [None]:
match = re.match('T', text)
match.start(), match.end()

In [None]:
print(re.match('[a-z]', text))

The `match` method is pretty restricted, since it only applies the pattern at the beginning of the text, so I never use it (except accidentally).

If you want to restrict your pattern to matching at the beginning of the string, just start your pattern with `^`.

The `search` method is more useful, as it looks for the pattern anywhere in the text, which is almost always what you want.

In [None]:
re.search('[a-z]', text)

But it only finds the first match in the text.  It does have a second parameter that tells it what position in the text string to start searching from, just like the `index` string method.

So you could use that to find the next match after the first one, and so on, as with `index`.  But there is an easier way. `findall` returns a list of all matching strings.

This is **much** more useful if you are looking for multiple appearances of a pattern.

In [None]:
re.findall('[Tt]he', text)

`finditer` returns an object that generates a list of match objects.

In [None]:
for match in re.finditer('[Tt]he', text):
    print(match)

You can also use the `re.split` method to split a text on a regex

In [None]:
re.split('\s+', text)

In [None]:
re.split('[aeiou]+', text)

In [None]:
text = 'Sam-I-Am, that Sam-I-Am'
re.split('-|\s+', text)

Finally, there is a `sub` (substitution) method that permits you to replace the pattern you find with something else (or with nothing to delete the pattern).  Like a string method, it returns a changed version of the original string.

This is very handy for tidying up texts.

**NB** Like a string method, `sub` does not change the string you give it.  Remember, strings are immutable.  It returns a copy of the string with the substitutions performed and keeps the original string the same

In [None]:
text = 'Sam-I-Am, that Sam-I-Am'
re.sub('-', ' ', text)

In [None]:
text

In [5]:
from string import punctuation as punct
punct

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [6]:
text = "The, quick-br*wn fox' (jumped) over the l@zy dogs!!"
pat = '['+punct+']'
print(pat)

[!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~]


In [None]:
re.sub(pat, '', text)

## Managing backslashes

When you type in a pattern, you do so as a string, between quote marks.  Backslashes sometimes have a special meaning between quote marks, and in these cases surprising things can happen when you try to insert them in your pattern.  These problems are hard to debug, since you can't see what's going on unless you check.

In [7]:
text = 'word word word wordwordword'
re.findall('\bword\b', text)

[]

In [8]:
'\bword\b'

'\x08word\x08'

One fix is to double the backslashes.

In [9]:
re.findall('\\bword\\b', text)

['word', 'word', 'word']

In [None]:
print('\\bword\\b')
'\\bword\\b'

But then you need to remember to do it, and either do it everywhere (which makes the pattern hard to read) or know which backslashes are going to cause trouble and only do it there.

A better solution is always to enter regexes as "raw" strings, in which Python does no interpretation of backslashes when reading in the string between quote marks.

In [None]:
'\bword\b'

In [None]:
r'\bword\b'

In [None]:
print(r'\bword\b')

You should get into the habit of always using r'....' when typing in a regex in Python.

In [None]:
re.findall(r'\bword\b', text)

There are a few such string prefixes, but the most important ones are `f` and `r`.  

Remember that the `f` is for a format string, where anything between curly braces is a Python expression to be evaluated and the result inserted into the string.

In [None]:
name = 'Peter'
f'My name is {name}'

## Regex flags

You can modify the behavior of your pattern by adding flags to the function call.

The most common is:

* re.I: ignore case

In [10]:
text = '''The quick brown fox jumped over 
the lazy dogs, but not these ones.'''
re.findall(r'\bthe\b', text)

['the']

In [11]:
re.findall(r'\bthe\b', text, flags=re.I)

['The', 'the']

There are two other flags that are useful for multi-line texts.

* re.M: Normally `^` and `$` match at the beginning and end of the string.  With this flag, they also match at the beginning and end of each line within the string. 

* re.S: Normally the `.` special character matches any character except a newline.  This flag makes it match newlines as well.

In [12]:
re.findall(r'^the', text)

[]

In [13]:
print(text)

The quick brown fox jumped over 
the lazy dogs, but not these ones.


In [14]:
re.findall(r'^the', text, flags=re.M)

['the']

If you want to add multiple flags, join them with the vertical bar `|` (which means logical or).

In [15]:
re.findall(r'^the', text, flags=re.M|re.I)

['The', 'the']

## Matching with groups

Often, you define the match by including the context of the pattern you are interested in, but you are not actually interested in capturing that full context.  So you can use parens to restrict what you capture 

In [16]:
text = '''The quick brown fox jumped over 
the lazy dogs, but not these ones.'''

In [None]:
re.findall(r'the\s+\b\w+\b', text, flags=re.M|re.I)

['The quick', 'the lazy']

In [None]:
re.findall(r'the\s+\b(\w+)\b', text, flags=re.M|re.I)

In [None]:
re.findall(r'(the)\s+(\w+)', text, flags=re.M|re.I)