# 100 Regex Exercises in Python

A hands-on regular expressions tutorial for Python students.

## 1-20: Getting started

A couple of warm-up exercises to refresh the basic concepts of regular expressions.


In [2]:
import re
poem = '''Once upon a midnight dreary, while I pondered, weak and weary,
Over many a quaint and curious volume of forgotten lore—
    While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
“’Tis some visitor,” I muttered, “tapping at my chamber door—
            Only this and nothing more.
Ah, distinctly I remember it was in the bleak December;
And each separate dying ember wrought its ghost upon the floor.
    Eagerly I wished the morrow;—vainly I had sought to borrow
    From my books surcease of sorrow—sorrow for the lost Lenore—
For the rare and radiant maiden whom the angels name Lenore—
            Nameless here for evermore.
'''

### 1: Basic pattern matching

Simple regular expressions will look just like any other Python string. Let's use the re.findall function to create a list with all occurrences of the word 'door'. Your regex should be a string preceded by the letter ```r```.

In [3]:
re.findall(r'door',poem)

['door', 'door']

### 2: Basic pattern matching - beyond letters

Characters such as commas and spaces can also be used in regular expressions. Let's use re.findall again to match the only occurrence of ```While I nodded, nearly napping```.

In [4]:
re.findall(r'While I nodded, nearly napping',poem)

['While I nodded, nearly napping']

### 3: The re.IGNORECASE flag

By default, Python regular expressions will treat uppercase and lowercase letters as different characters. You can use the ```re.IGNORECASE``` (or ```re.I```) flag as a third argument to ```re.findall``` if we want to change that behavior. Try searching for the word 'nameless' in the poem with the ```re.I``` modifier and then without it. Do you notice any differences?

In [7]:
re.findall(r'nameless',poem, re.I)

['Nameless']

### 4: Using the re.I flag

Now try it yourself. Let's use re.findall to match all occurrences of the word 'and', with or without capital letters.

In [8]:
re.findall(r'and', poem, re.I)

['and', 'and', 'and', 'And', 'and']

### 5: Character sets [ ]

So far, our regular expressions haven looked a lot like normal Python strings. Let's start exploring their power a little bit more. One of the most useful regex functionalities are the character sets. You can define them by including any number of characters inside square brackets. Your regular expression will then look for any of the characters inside the set. 

Let's look at an example. Complete the code below so that it will match all occurrences of the words "morrow" and "sorrow". We already included the character set for you.

In [9]:

re.findall(r'[ms]orrow',poem)

['morrow', 'sorrow', 'sorrow']

### 6: Using character sets

Now it's your turn. Let's use character sets to match all the occurrences of the words 'tapping', 'napping' and 'rapping'.

In [10]:
re.findall(r'[tnr]apping', poem)

['napping', 'tapping', 'rapping', 'rapping', 'tapping']

### 7: Using character sets (2)

Let's try something a little bit more challenging. How would you use character sets to match all vowels in our poem (case-insensitive)?

In [11]:
re.findall(r'[aeiou]',poem, re.I)

['O',
 'e',
 'u',
 'o',
 'a',
 'i',
 'i',
 'e',
 'a',
 'i',
 'e',
 'I',
 'o',
 'e',
 'e',
 'e',
 'a',
 'a',
 'e',
 'a',
 'O',
 'e',
 'a',
 'a',
 'u',
 'a',
 'i',
 'a',
 'u',
 'i',
 'o',
 'u',
 'o',
 'u',
 'e',
 'o',
 'o',
 'o',
 'e',
 'o',
 'e',
 'i',
 'e',
 'I',
 'o',
 'e',
 'e',
 'a',
 'a',
 'i',
 'u',
 'e',
 'e',
 'e',
 'a',
 'e',
 'a',
 'a',
 'i',
 'A',
 'o',
 'o',
 'e',
 'o',
 'e',
 'e',
 'a',
 'i',
 'a',
 'i',
 'a',
 'a',
 'e',
 'o',
 'o',
 'i',
 'o',
 'e',
 'i',
 'i',
 'o',
 'I',
 'u',
 'e',
 'e',
 'a',
 'i',
 'a',
 'a',
 'e',
 'o',
 'o',
 'O',
 'i',
 'a',
 'o',
 'i',
 'o',
 'e',
 'A',
 'i',
 'i',
 'I',
 'e',
 'e',
 'e',
 'i',
 'a',
 'i',
 'e',
 'e',
 'a',
 'e',
 'e',
 'e',
 'A',
 'e',
 'a',
 'e',
 'a',
 'a',
 'e',
 'i',
 'e',
 'e',
 'o',
 'u',
 'i',
 'o',
 'u',
 'o',
 'e',
 'o',
 'o',
 'E',
 'a',
 'e',
 'I',
 'i',
 'e',
 'e',
 'o',
 'o',
 'a',
 'i',
 'I',
 'a',
 'o',
 'u',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'u',
 'e',
 'a',
 'e',
 'o',
 'o',
 'o',
 'o',
 'o',
 'o',
 'e',
 'o'

### 8: Negating a character set

Including the ^ modifier at the beginning of a character set will match everything that is NOT inside the square brackets. ```[^t]```, for example, will match all characters except for the letter 't'.

```re.findall(r'[^b]',poem)``` would match all characters from our string, except for the letter b.

Let's say we want to match all character from our string which are not vowels. Can you see how negating character sets would help you with that?

In [12]:
re.findall(r'[^aeiou]', poem, re.I)

['n',
 'c',
 ' ',
 'p',
 'n',
 ' ',
 ' ',
 'm',
 'd',
 'n',
 'g',
 'h',
 't',
 ' ',
 'd',
 'r',
 'r',
 'y',
 ',',
 ' ',
 'w',
 'h',
 'l',
 ' ',
 ' ',
 'p',
 'n',
 'd',
 'r',
 'd',
 ',',
 ' ',
 'w',
 'k',
 ' ',
 'n',
 'd',
 ' ',
 'w',
 'r',
 'y',
 ',',
 '\n',
 'v',
 'r',
 ' ',
 'm',
 'n',
 'y',
 ' ',
 ' ',
 'q',
 'n',
 't',
 ' ',
 'n',
 'd',
 ' ',
 'c',
 'r',
 's',
 ' ',
 'v',
 'l',
 'm',
 ' ',
 'f',
 ' ',
 'f',
 'r',
 'g',
 't',
 't',
 'n',
 ' ',
 'l',
 'r',
 '—',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 'W',
 'h',
 'l',
 ' ',
 ' ',
 'n',
 'd',
 'd',
 'd',
 ',',
 ' ',
 'n',
 'r',
 'l',
 'y',
 ' ',
 'n',
 'p',
 'p',
 'n',
 'g',
 ',',
 ' ',
 's',
 'd',
 'd',
 'n',
 'l',
 'y',
 ' ',
 't',
 'h',
 'r',
 ' ',
 'c',
 'm',
 ' ',
 ' ',
 't',
 'p',
 'p',
 'n',
 'g',
 ',',
 '\n',
 's',
 ' ',
 'f',
 ' ',
 's',
 'm',
 ' ',
 'n',
 ' ',
 'g',
 'n',
 't',
 'l',
 'y',
 ' ',
 'r',
 'p',
 'p',
 'n',
 'g',
 ',',
 ' ',
 'r',
 'p',
 'p',
 'n',
 'g',
 ' ',
 't',
 ' ',
 'm',
 'y',
 ' ',
 'c',
 'h',
 'm',
 'b',
 'r',
 

### 9: Using ranges in character sets

Character sets can also be used to match a range of characters. ```[a-z]```, for example, will match all lowercase letters from a to z. You can also combine ranges in the same character set: ```[A-Za-z0-9]``` will match both uppercase and lowercase letters between a and z, as well as digits from 0 to 9.

Let's use our knowledge of ranges in character sets to match all letters from a to m in our poem (both lowercase and uppercase).

In [13]:
re.findall(r'[A-Ma-m]',poem)

['c',
 'e',
 'a',
 'm',
 'i',
 'd',
 'i',
 'g',
 'h',
 'd',
 'e',
 'a',
 'h',
 'i',
 'l',
 'e',
 'I',
 'd',
 'e',
 'e',
 'd',
 'e',
 'a',
 'k',
 'a',
 'd',
 'e',
 'a',
 'e',
 'm',
 'a',
 'a',
 'a',
 'i',
 'a',
 'd',
 'c',
 'i',
 'l',
 'm',
 'e',
 'f',
 'f',
 'g',
 'e',
 'l',
 'e',
 'h',
 'i',
 'l',
 'e',
 'I',
 'd',
 'd',
 'e',
 'd',
 'e',
 'a',
 'l',
 'a',
 'i',
 'g',
 'd',
 'd',
 'e',
 'l',
 'h',
 'e',
 'e',
 'c',
 'a',
 'm',
 'e',
 'a',
 'a',
 'i',
 'g',
 'A',
 'f',
 'm',
 'e',
 'e',
 'g',
 'e',
 'l',
 'a',
 'i',
 'g',
 'a',
 'i',
 'g',
 'a',
 'm',
 'c',
 'h',
 'a',
 'm',
 'b',
 'e',
 'd',
 'i',
 'm',
 'e',
 'i',
 'i',
 'I',
 'm',
 'e',
 'e',
 'd',
 'a',
 'i',
 'g',
 'a',
 'm',
 'c',
 'h',
 'a',
 'm',
 'b',
 'e',
 'd',
 'l',
 'h',
 'i',
 'a',
 'd',
 'h',
 'i',
 'g',
 'm',
 'e',
 'A',
 'h',
 'd',
 'i',
 'i',
 'c',
 'l',
 'I',
 'e',
 'm',
 'e',
 'm',
 'b',
 'e',
 'i',
 'a',
 'i',
 'h',
 'e',
 'b',
 'l',
 'e',
 'a',
 'k',
 'D',
 'e',
 'c',
 'e',
 'm',
 'b',
 'e',
 'A',
 'd',
 'e',
 'a'

### 10: Using ranges in negative character sets

You can use ranges in conjunction with the negation operator ```^``` in your character sets. ```[^0-9]``` will match all characters that are NOT numbers from 0 to 9. 

Use the code box below to write a regular expression that will match all characters in our poem that are NOT letters from a to z (lowercase).

In [14]:
re.findall(r'[^a-z]',poem)

['O',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 'I',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 ',',
 '\n',
 'O',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '—',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 'W',
 ' ',
 'I',
 ' ',
 ',',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 '\n',
 'A',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ',',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 '\n',
 '“',
 '’',
 'T',
 ' ',
 ' ',
 ',',
 '”',
 ' ',
 'I',
 ' ',
 ',',
 ' ',
 '“',
 ' ',
 ' ',
 ' ',
 ' ',
 '—',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 'O',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 '\n',
 'A',
 ',',
 ' ',
 ' ',
 'I',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 'D',
 ';',
 '\n',
 'A',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 'E',
 ' ',
 'I',
 ' ',
 ' ',
 ' ',
 ';',
 '—',
 ' ',
 'I',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 'F',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '—',
 ' ',
 ' ',
 ' ',
 ' ',
 'L',
 '—',
 '\n',
 'F',


### 11: Quantifiers - ?

Character sets have added a little bit of flexibility to our regular expressions, but up to now every character (or character set) in our regular expressions have been matching only a single character in our string. This is where quantifiers come to the rescue. 

The `?` quantifier is the first one we'll see here. We can use it to indicate that a character in our regular expression is optional: we want our string to match the regex regargless of whether the character appears or not. For example, if we we wanted to match both 'a' and 'at', here's how our regular expression would look like: 

```r'at?'```

Now it's your turn. Let's use the `?` quantifier to create a regular expression that will match all occurrences of 'here' and 'there' in our poem.

In [15]:
re.findall(r't?here',poem)

['there', 'here']

### 12: Quantifiers - ? (2)

The code below was written to try to match the hypertext protocol of all the urls in the list. However, it's not working properly. Can you fix the regular expression to make sure it will match both 'http' and 'https'?

In [16]:
urls = '''http://ironhack.com
https://ironhack.com
http://ironhack.gov
http://ironhack.net
https://ironhack.org
'''

re.findall(r'https?',urls)

['http', 'https', 'http', 'http', 'https']

### 13: Quantifiers - *

The `*` quantifier behaves in a similar way to `?`, but a lot more powerful. Instead of matching just zero or one occurrences of a character, it will match **any number of occurrences** of the character, from 0 to infinity. ```r'10*'```, for example, will match all of the following numbers: 1,10,100,1000,10000... and so on.

Using what we just learned, let's write a regular expression that will match all strings in the variable ```hello_hell```.

In [17]:
hello_hell = '''
hello
helloo
helloooo
helloooooooooo
hell hell hell
'''

re.findall(r'hello*',hello_hell)

['hello', 'helloo', 'helloooo', 'helloooooooooo', 'hell', 'hell', 'hell']

### 14: Quantifiers - +

The `+` quantifier is similar to `*`, but with a key difference: instead of matching any number of occurrences from 0 to infinity, it will require the quantified character to appear at least once. `r'10+'`, for example, will match 10,100,1000,10000 and so on... but not the number one if it appears alone.

Let's go back to our `hello_hell` variable. How would you use the `+` quantifier so that your regular expression matches all variations of 'hello', but none of the occurrences of the word 'hell' ? 

In [18]:
re.findall(r'hello+',hello_hell)

['hello', 'helloo', 'helloooo', 'helloooooooooo']

### 15: Quantifiers - playing with numbers (binary)

In the next few exercises we will use our knowledge of character sets and quantifiers to match different types of numbers. We'll start with binary. Binary numbers consist of a series of 0s and 1s -- that's it, nothing else. With that in mind, how would you use character sets and quantifiers to match all of the binary numbers contained in the string below (and nothing else)?

In [28]:
is_it_binary = '''
100010
010101
123456
Hello, World
121212
101010
010101
'''

re.findall(r'[01]{6}',is_it_binary)

['100010', '010101', '101010', '010101']

### 16: Quantifiers - playing with numbers (positive integers)

Binary was easy. But what if we wanted to match any positive integer in the variable below? How would you change your code to do that?

In [29]:
is_it_a_number = '''
123456
1
ffdgdv
300000
999999
I'm a number, I swear!
<(^_^<) ^(^_^)^ (>^_^)>
0
import pandas as pd... what? Wrong line?
42
'''

re.findall(r'[0-9]+',is_it_a_number)

['123456', '1', '300000', '999999', '0', '42']

### 17: Quantifiers - playing with numbers (any integer)

So far, our code only matches positive numbers: if we try to use the same regular expression to match a negative number, it will ignore the `-` sign. Using your knowledge of quantifiers, how would you change your code to also match the sign of negative numbers? 

In [31]:
any_number = '''
-100
ffdgdv
-2
989899
-I'm a negative number! Look at my sign!
^(^_^)> ~(^_^)~ <(^_^)^
-1
0
1
2
...
42
'''
re.findall(r'\-?[0-9]+',any_number)

['-100', '-2', '989899', '-1', '0', '1', '2', '42']

### 18: Quantifiers - playing with numbers (hexadecimal)

Hexadecimal numbers are formed by digits from 0 to 9 and letters from A to F (case insensitive). They are used for several purposes, including the representation of color codes in web pages. How would you write a regular expression to match all valid hexadecimal numbers in the string below?

In [32]:
is_it_hex = '''
FFFFFF
bfbfbf
101010
Wut?
<(^_^)> <(~_^)> <(^_^)>
123123
zzzzzz
a1b2c3
wowowo
'''
re.findall(r'[a-z0-9]{6}', is_it_hex, re.I)

['FFFFFF', 'bfbfbf', '101010', '123123', 'zzzzzz', 'a1b2c3', 'wowowo']

### 19: Quantifiers - playing with numbers (no leading zeros)

Here's our final number challenge before we move on to our next topic. When dealing with integers, it's common practice to ignore leading zeros. `0000012`, for example, would become just `12`. `10001`, on the other hand, would still be `10001`, since those zeros make a difference in the number's value. 

Depending on how you approached the previous exercises, your regular expression would match the entire numeric string instead of just the significant part. Let's see if we can match all positive numbers in the string below while avoiding all leading zeros. (The number 0 itself will not be tested).

In [52]:
no_leading_zeros = '''
000012
negative ten billion
90009
10203040506
0<(^_^)>0<(^_^)>0<(^_^)>0
pizza and beer
00002
00000000000000000000000000000000000000000042
'''
re.findall(r'[1-9]+[0-9]*', no_leading_zeros)

['12', '90009', '10203040506', '2', '42']

### 20 Quantifiers - {_}

The last quantifier we'll see (for now) is the `{}`. It allows us to determine exactly how many consecutive times we want a character to appear in our regular expression. `r'z{4}'`, for example, would match the string `'zzzz'`, but not `'z'` or `'zzz'`. Let's get back to our `hello_hell` variable. Can you use `{}` to write a regular expression that will match only `helloooo`, our secret greeting, but none of the other substrings?

In [53]:
hello_hell = '''
hello
helloo
hellooo
helloooo
hell hell hell
'''
re.findall(r'hello{4}', hello_hell)

['helloooo']

### 21 Quantifiers - {_,_}

You can also use `{}` to define a lower and upper limiter to the number of times you want the character to appear in your string. `r'z{3,5}'`, for example, would match only `'zzz'`,`'zzzz'` and `'zzzzz'`: `'z'` or `'zz'` would not be matched, and in `'zzzzzzz'` all characters after the fifth one would be ignored.

In the string below, we want to match only `'Hello, darkness'`,`'Helloo, darkness'` and `'Hellooo, darkness'`. All other substrings should be ignored. How would you solve that problem using what we just learned?

In [56]:
hello_darkness = '''
hell, darkness
hello, darkness
helloo, darkness
hellooo, darkness
helloooo, darkness
helloooooo, darkness
hellooooooooooooooooo
darkness? darkness? darkneeeeesss?
I'll call you later
'''

re.findall(r'hello{1,3}, darkness', hello_darkness)

['hello, darkness', 'helloo, darkness', 'hellooo, darkness']