# Regular Expressions in Python

**By Arpit Omprakash, Byte Sized Code**

## Basic RegExp

**What is a Regular Expression?**  
A Regular Expression (a.k.a. regex or regexp) is essentially a search query for text that's expressed by a string pattern.  
When you run a search against a particular piece of text, anything that matches a regular expression pattern you specified, is returned as a result of the search.  
They allow us to search a text for strings matching a specific pattern.

**Module**  
First we need to import the `re` module in python that lets us create and deal with regular expression.

In [34]:
import re

**Search**  

Most of what we do with regex in python can be accomplished by using the `re.search()` function.  
We provide two arguments to the function:  
- A regex pattern
- A string to match  

In [35]:
result = re.search(r"zza", "pizza")
print(result)

<re.Match object; span=(2, 5), match='zza'>


In [36]:
result = re.search(r"zza", "youzza")
print(result)

<re.Match object; span=(3, 6), match='zza'>


As we can see above, when there is a match, `re` returns a `re.Match` object.  
It consists of a `span` that lets us know where the match occured in the given string.  
And a `match`, that shows the exact match that was found.

What about when there is no match?  
It returns a None.

In [37]:
result = re.search(r"zza", "lol")
print(result)

None


**IGNORECASE**

By default, regex matching is case sensitive. To make the matching case insensitive we can use the `re.IGNORECASE` flag.

In [38]:
print(re.search(r"ping", "Ping"))

None


In [39]:
print(re.search(r"ping", "Ping", re.IGNORECASE))

<re.Match object; span=(0, 4), match='Ping'>


**FINDALL**

The `re.search` function, although very handy, will always return the first match in a string.  
What if we want to match all instances of occurence of a given pattern in a string?  
We use the `re.findall` method.

In [40]:
print(re.search(r"ding", "dingdingdingding"))

<re.Match object; span=(0, 4), match='ding'>


In [41]:
print(re.findall(r"ding", "dingdingdingding"))

['ding', 'ding', 'ding', 'ding']


### Special Characters

Next we take a look at some of the special characters that help with pattern matching in regex.

We can use the caret (^) symbol to explicitly state that we want our match to start from the first character of the line.  
Similarly, we use the dollar ($) symbol to state that we want our match to be at the last character of the line.

In [42]:
print(re.search(r"x", "axon"))

<re.Match object; span=(1, 2), match='x'>


In [43]:
print(re.search(r"^x", "axon"))

None


In [44]:
print(re.search(r"n", "xenon"))

<re.Match object; span=(2, 3), match='n'>


In [45]:
print(re.search(r"n$", "xenon"))

<re.Match object; span=(4, 5), match='n'>


**Wildcard**  

A wildcard (represented by a dot) is a special character that will match any character (alphanumeric, spaces, digits, and symbols) present in the string.

In [46]:
print(re.search(r"p.ng", "penguin"))

<re.Match object; span=(0, 4), match='peng'>


In [47]:
print(re.search(r"p.ng", "ping"))

<re.Match object; span=(0, 4), match='ping'>


In [48]:
print(re.search(r"p.ng", "sponge"))

<re.Match object; span=(1, 5), match='pong'>


In [49]:
print(re.search(r"p.ng", "p ng"))

<re.Match object; span=(0, 4), match='p ng'>


**Character class**

A character class is represented by square brackets ([ ]).  
Anything inside the character class is considered for a given position of the pattern.  
For example, in the first example, both p and P are considered to be matched at the first position of the pattern.

In [50]:
print(re.search(r"[Pp]ing", "ping"))

<re.Match object; span=(0, 4), match='ping'>


In [51]:
print(re.search(r"[pP]ing", "Ping"))

<re.Match object; span=(0, 4), match='Ping'>


A character class can also contain a group or sequence of characters.  
For example, [a-z] represents all lower case characters.

In [52]:
print(re.search(r"[a-z]way", "The highway"))

<re.Match object; span=(7, 11), match='hway'>


In [53]:
print(re.search(r"[a-z]way", "The way"))

None


In [54]:
print(re.search(r"cloud[a-zA-Z0-9]", "cloudy"))

<re.Match object; span=(0, 6), match='cloudy'>


In [55]:
print(re.search(r"cloud[a-zA-Z0-9]", "cloud9"))

<re.Match object; span=(0, 6), match='cloud9'>


At times we would want our match to not contain some characters.  
To exclude characters from matching, we can use the caret (^) symbol inside a character class.  
For example, the pattern below excludes matches that have lower/upper case characters.

In [56]:
print(re.search(r"[^a-zA-Z]","This is a sentence."))

<re.Match object; span=(4, 5), match=' '>


In [57]:
print(re.search(r"[^a-zA-Z ]","This is a sentence."))

<re.Match object; span=(18, 19), match='.'>


**Or**

Sometimes we might want to use an or condition to match either one of two things in our string.  
An "or" is represented by the pipe (|) operator.

In [58]:
print(re.search(r"cat|dog", "nice cat"))

<re.Match object; span=(5, 8), match='cat'>


In [59]:
print(re.search(r"cat|dog", "good dog"))

<re.Match object; span=(5, 8), match='dog'>


We can combine *or* and *findall* to find all matches in a given sentence belonging to either of our individual patterns.  
One can start thinking of the complex patterns that we can already match using the knowledge of regex that we have gained till now.

In [60]:
print(re.search(r"cat|dog", "I have a cat and a dog."))

<re.Match object; span=(9, 12), match='cat'>


In [61]:
print(re.findall(r"cat|dog", "I have a cat and a dog."))

['cat', 'dog']


**Quantifiers (a.k.a Repetition Qualifiers)** 

Till now we have only matched one instance of a character using the search function.  
What if we want to match things that are repeated?  
We can use repetition qualifiers for the same.

There are three basic quantifiers that we use in regex:
- The asterisk (\*) symbol - matches the preceding character 0 or more times.
- The plus (+) sign - matches the preceding character 1 or more times.
- The question mark (?) - matches the preceding character 0 or 1 times.

*The asterisk*

In [62]:
print(re.search(r"Py.", "Python"))

<re.Match object; span=(0, 3), match='Pyt'>


In [63]:
print(re.search(r"Py.*", "Python"))

<re.Match object; span=(0, 6), match='Python'>


In [64]:
print(re.search(r"Py.*", "Py"))

<re.Match object; span=(0, 2), match='Py'>


In [65]:
print(re.search(r"Py.*", "Python programming"))

<re.Match object; span=(0, 18), match='Python programming'>


In [66]:
print(re.search(r"Py[a-zA-Z]*", "Python programming"))

<re.Match object; span=(0, 6), match='Python'>


*The plus sign*

In [67]:
print(re.search(r"o+l+", "goldfish"))

<re.Match object; span=(1, 3), match='ol'>


In [68]:
print(re.search(r"o+l+", "woolly"))

<re.Match object; span=(1, 5), match='ooll'>


In [69]:
print(re.search(r"o+l+", "wool"))

<re.Match object; span=(1, 4), match='ool'>


In [70]:
print(re.search(r"o+l+", "boil"))

None


*The question mark*

In [71]:
print(re.search(r"p?each", "to each his own"))

<re.Match object; span=(3, 7), match='each'>


In [72]:
print(re.search(r"p?each", "peachy"))

<re.Match object; span=(0, 5), match='peach'>


**Escaping Characters**  

What if we want to match a given special character (like . or +)?  
We can use a backslash (\) to escape special characters and treat them as literal string characters.  
**Note:** most special characters inside a character class are treated as literal string characters.

In [73]:
print(re.search(r".com", "www.pythoncomputer.com"))

<re.Match object; span=(9, 13), match='ncom'>


In [74]:
print(re.search(r"\.com", "www.pythoncomputer.com"))

<re.Match object; span=(18, 22), match='.com'>


**Other uses of the backslash**

Some special predefined character classes are represented by a backslash following a character.  
* \\w - alphanumeric character = [a-zA-Z0-9] 
* \\s - spaces = [ \t\n]
* \\d - digits = [0-9]
* \\b - word boundaries

In [75]:
print(re.search(r"\w*", "This is an example."))

<re.Match object; span=(0, 4), match='This'>


In [76]:
print(re.search(r"\w*", "This_is_an_example."))

<re.Match object; span=(0, 18), match='This_is_an_example'>


In [77]:
print(re.search(r"\d+", "There are 12 months"))

<re.Match object; span=(10, 12), match='12'>


In [78]:
print(re.findall(r"\s+", "Space is dark.\nNewline darker."))

[' ', ' ', '\n', ' ']


We can use word boundaries, to specify to regex that we want to match complete words, as shown here:

In [79]:
print(re.search(r"a\w+", "Match a word with a like awesome."))
print(re.search(r"\ba\w+", "Match a word with a like awesome."))

<re.Match object; span=(1, 5), match='atch'>
<re.Match object; span=(25, 32), match='awesome'>


### A Basic Example

Pattern matching Python variable names.  
Variables in python follow these rules:
* contain letters, numbers or underscores
* begin with a letter or an underscore

The pattern that matches the rules is shown below:

In [80]:
pattern = r"^[a-zA-Z_]\w*$"

In [81]:
print(re.search(pattern, "_this_is_a_valid_variable"))

<re.Match object; span=(0, 25), match='_this_is_a_valid_variable'>


In [82]:
print(re.search(pattern, "_this_is_an invalid_variable"))

None


In [83]:
print(re.search(pattern, "it_can_contain_numbers123"))

<re.Match object; span=(0, 25), match='it_can_contain_numbers123'>


In [84]:
print(re.search(pattern, "123cant_start_with_numbers"))

None


## Advanced RegExp

**More Repetition Qualifiers**  

There is another type of repetition qualifier that helps us gain finer control of how many repeats we want in our matches.  
The curly brackets ({}) can be used to define a range of character repetitions that we want our pattern to match in a string.  
{m} - matches exactly m repeats of the preceding character.  
{m, n} - matches if the preceding character is repeated at least m times and at most n times

In [85]:
print(re.search(r"\w{5}", "a scary ghost appears."))

<re.Match object; span=(2, 7), match='scary'>


In [86]:
print(re.findall(r"\w{5}", "a scary ghost appears."))

['scary', 'ghost', 'appea']


We can use a word boundary to explicitly say that we want to match full words.

In [87]:
print(re.findall(r"\b\w{5}\b", "a scary ghost appears."))

['scary', 'ghost']


In [88]:
print(re.findall(r"\b\w{5,10}\b", "a scary ghost appears."))

['scary', 'ghost', 'appears']


A number followed by a comma gives us a lower bound on the number of repeats

In [89]:
print(re.findall(r"\b\w{5,}\b", "a scary ghost appears."))

['scary', 'ghost', 'appears']


A comma followed by a number gives us an upper bound on the number of repeats.

In [90]:
print(re.findall(r"\bs\w{,5}\b", "a scary ghost appears."))

['scary']


**Capturing Groups**

With the help of regex, we can capture groups of characters or single characters that we match using the parentheses (()).  
The match object stores the matches in a list format.  
The first item is the whole matched string  
The second item contains the first matched group  
The third item contains the second matched group  
and so on.

Here is an example to illustrate this:

In [91]:
result = re.search(r"^(\w+) (\w+)$", "Elvis Presley")
print(result)
print(result.groups())
print(result[0])
print(result[1])
print(result[2])

<re.Match object; span=(0, 13), match='Elvis Presley'>
('Elvis', 'Presley')
Elvis Presley
Elvis
Presley


Now that we have a pattern that we can use to match names. Lets stop printing and write a function that takes in a name of the form *FIRSTNAME LASTNAME* and rearranges it to the format *LASTNAME,FIRSTNAME*

In [92]:
def rearrange_name(name):
    result = re.search(r"^(\w+) (\w+)$", name)
    if result is None:
        return name
    return "{},{}".format(result[2], result[1])

In [93]:
print(rearrange_name("Elvis Presley"))
print(rearrange_name("Edward Norton"))

Presley,Elvis
Norton,Edward


Woohoo! We wrote our first useful function using regex. But wait, the same task can be done pretty easily with regex and without writing a function. The next part shows how.

**Splitting and Replacing**  

Apart from searching for and finding patterns, we can use the regex library to do so much more in python.  
Two important uses of regex in python are splitting strings and capturing strings.  

We can use the `re.split()` method to split strings using some regex pattern or characters.  
We can use the `re.sub()` method to replace a given matched pattern with a specified string.

In [94]:
print(re.split(r"[.?!]","One sentence. Another one? The last one!"))

['One sentence', ' Another one', ' The last one', '']


In [95]:
print(re.split(r"([.?!])","One sentence. Another one? The last one!"))

['One sentence', '.', ' Another one', '?', ' The last one', '!', '']


In [96]:
print(re.sub(r"[\w.%+-]+@[\w.-]+", "[REDACTED]", "Received email from lol@google.com"))

Received email from [REDACTED]


We can reference captured groups by using the backslash followed by the index at which the match is stored in the list of matches.  
Equipped with this new piece of information and the `re.sub` function, we can now write the rearrange_name function in a much shorter way:

In [97]:
print(re.sub(r"(\w+) (\w+)", r"\1,\2", "Elvis Presley"))

Elvis,Presley


**Backreferencing**

We learned that we can reference captured groups by using a backslash and the index of the group.  
We can also use the backslash and index to reference a captured group from within the pattern itself. This is known as backreferencing.  
There can be lots of uses of backreferencing and it is quite an advanced regex concept. But we will look at a small example of where it might help.

We want to match two repeats of the same word, but we don't know what exactly the word is.  
How can we approach this problem using regex?

In [98]:
print(re.search(r"(\w+) (\w+)", "okay okay"))

<re.Match object; span=(0, 9), match='okay okay'>


In [99]:
print(re.search(r"(\w+) (\w+)", "alright alright"))

<re.Match object; span=(0, 15), match='alright alright'>


The above pattern seems to match what we want right?  
Let's look at a different string.

In [100]:
print(re.search(r"(\w+) (\w+)", "okay alright"))

<re.Match object; span=(0, 12), match='okay alright'>


This is an issue, although we will match all the words that we wanted to match, we will also match words other than the correct matches.

Backreferencing to the rescue!  
By *backreferencing* the first captured group, we can ensure that the second word matches the first one that found. Thus, we can match the correct words that we want and steer clear of the incorrect matches.

In [101]:
print(re.search(r"(\w+) \1", "okay okay"))

<re.Match object; span=(0, 9), match='okay okay'>


In [102]:
print(re.search(r"(\w+) \1", "alright alright"))

<re.Match object; span=(0, 15), match='alright alright'>


In [103]:
print(re.search(r"(\w+) \1", "okay alright"))

None


Phew! That was a lot. Have a second look, try experimenting and trying out different ideas and matches.  
Feel free to use this notebook as a cheat sheet for regex in Python.