# Introduction to Regular Expressions in Python

## What are Regular Expressions?


Regular expressions, or regex for short, are a sequence of characters that define a pattern to search for in a string. Some examples of what they look like are: `^\w{5}\s\w{5}$`, `[xyz]{2}`, and `\d{4}`.

Each of these regular expressions are composed of smaller ones that have a specific task to search for a particular pattern in a string. When you combine multiple regular expressions together, you form a more complex regular expression that can search for multiple patterns at once. 

## How can regex be used?

Regex can be used when you are working with a single string or multiple strings that may show up in a list/dictionary. It’s a very useful tool for Data Scientists because of its capabilities of searching through large text data, lists/dictionaries of strings, or even tables that contain strings with relevant data. When you are dealing with many strings, a large text file, or even just one really long string and you need to find where exactly a pattern lies and potentially perform other operations on those areas, regex comes in handy.

### The regex methods

The first step in using regex is importing the module called `re`.

The `re` methods are:
- `findall()` $\to$ searches a string for ALL instances of a pattern,
- `search()` $\to$ searches the string for the FIRST instance of a pattern,
- `match()` $\to$ searches the BEGINNING of a string for the instance of a pattern ,
- `sub()` $\to$ will replace all (or a specified number of) the substrings that match a pattern,
- `split()` $\to$ will split the string at each instance of a pattern (or a specified number of instances),

Note: you will call these methods by preceding them with re. (`re.findall()`, `re.sub()`, etc.)

### The arguments

Suppose you have a string called `name` and a pattern, `regex`. Then the methods above would look like:

- `findall(pattern=regex, string=name, flags (optional))`,
- `search(pattern=regex, string=name, flags (optional))`,
- `match(pattern=regex, string=name, flags (optional))`,

where the argument, `flags`, is simply an optional specification that affects the way in which the pattern is searched for in the string. 

The other methods would look like:

- `sub(pattern=regex, repl=replacement, string=name, count (int, optional), flags (optional))`
- `split(pattern=regex, string=name, maxsplit (int, optional), flags (optional))`

Where `replacement` is a string that will replace the substring in name that matches the regex pattern, `count` is simply the number of replacements specified, and `maxsplit` is the number of splits that will occur at the location of the regex pattern in the name string.

In [32]:
import re

## Let's write some regex!

In order to write accurate regex, we must learn the descriptions of each singular regular expression and their functions. Refer to the table below:

| Regular Expression | Description | Example |
| --- | --- | --- |
| [] | match any characters contained within the brackets | [abc] |
| [^...] | match any one character not contained in the brackets | [^a] |
| - | match a range of characters/digits | [0-9] | 
| . | match any single character except a new line | a. |
| \d | match any digit | a\d (same as a[0-9]) |
| \w | match any word | a\w | 
| [A-z] | match any letter | a[A-z] | 
| ? | 0 or 1 instance of character | a? |
| + | 1 or more instances of character | a+ | 
| * | 0 or more instances of character | a* | 
| ^ | start of a string | ^a | 
| \$ | end of a string | a$ | 

### Let's try some examples

1. We define: `string` = `“I23like45to2drink6coffee90every8morning!”`. Suppose we want to find all instances of 2 consecutive digits in this string. We have two options: use `[0-9]` or `\d`. 

In [33]:
string = 'I23like45to2drink6coffee90every8morning'
regex = '[0-9]{2}'
result = re.findall(pattern=regex, string=string)
print(result)

regex='\d{2}'
result = re.findall(pattern=regex, string=string)
print(result)

['23', '45', '90']
['23', '45', '90']


Both methods work the same way. 

2. Using the same string, suppose we want to find all instances of characters that are not digits. We can use a few methods:

In [34]:
string = 'I23like45to2drink6coffee90every8morning!'
regex = '[^\d]'
result = re.findall(pattern=regex, string=string)
print(result, '\n') 

regex = '[a-z]'
result = re.findall(pattern=regex, string=string)
print(result, '\n') # notice that this ignores the uppercase I at the beginning 

regex = '[a-z]'
result = re.findall(pattern=regex, string=string, flags=re.IGNORECASE)
print(result, '\n') # includes the "I"

regex = '[A-z]'
result = re.findall(pattern=regex, string=string)
print(result, '\n') # we can also use [A-z], which considers both uppercase and lowercase

['I', 'l', 'i', 'k', 'e', 't', 'o', 'd', 'r', 'i', 'n', 'k', 'c', 'o', 'f', 'f', 'e', 'e', 'e', 'v', 'e', 'r', 'y', 'm', 'o', 'r', 'n', 'i', 'n', 'g', '!'] 

['l', 'i', 'k', 'e', 't', 'o', 'd', 'r', 'i', 'n', 'k', 'c', 'o', 'f', 'f', 'e', 'e', 'e', 'v', 'e', 'r', 'y', 'm', 'o', 'r', 'n', 'i', 'n', 'g'] 

['I', 'l', 'i', 'k', 'e', 't', 'o', 'd', 'r', 'i', 'n', 'k', 'c', 'o', 'f', 'f', 'e', 'e', 'e', 'v', 'e', 'r', 'y', 'm', 'o', 'r', 'n', 'i', 'n', 'g'] 

['I', 'l', 'i', 'k', 'e', 't', 'o', 'd', 'r', 'i', 'n', 'k', 'c', 'o', 'f', 'f', 'e', 'e', 'e', 'v', 'e', 'r', 'y', 'm', 'o', 'r', 'n', 'i', 'n', 'g'] 



Note: `\d` and `[A-z]` consider both letter cases, but `[a-z]` only considers lowercase. You can bypass this by including a flag to ignore the case, but keep in mind that there is an easier method without the flags.

3. Now, we define a list of strings and we want to find which ones start with a number.

In [35]:
strings = [
    '123apple', 
    'pear456', 
    '2orange3', 
    'banana45',
    '78mango3', 
    ''
]

regex = '^\d+'
for string in strings:
    result = re.search(regex, string)
    if result:
        print("The following string is a match:", string)
        print(result, '\n')

The following string is a match: 123apple
<re.Match object; span=(0, 3), match='123'> 

The following string is a match: 2orange3
<re.Match object; span=(0, 1), match='2'> 

The following string is a match: 78mango3
<re.Match object; span=(0, 2), match='78'> 



We use `^` to show that the pattern STARTS with `\d+`, 1 digit or more. 

What about the strings that end with a number?

In [36]:
strings = [
    '123apple', 
    'pear456', 
    '2orange3', 
    'banana45',
    '78mango3', 
    ''
]

regex = '\d+$'
for string in strings:
    result = re.search(regex, string)
    if result:
        print("The following string is a match:", string)
        print(result, '\n')

The following string is a match: pear456
<re.Match object; span=(4, 7), match='456'> 

The following string is a match: 2orange3
<re.Match object; span=(7, 8), match='3'> 

The following string is a match: banana45
<re.Match object; span=(6, 8), match='45'> 

The following string is a match: 78mango3
<re.Match object; span=(7, 8), match='3'> 



We use `$` to show that the pattern ENDS with `\d+`, 1 digit or more.

What about the strings that start AND end with a number?


In [37]:
strings = [
    '123apple', 
    'pear456', 
    '2orange3', 
    'banana45',
    '78mango3', 
    ''
]

regex = '^\d+\w*\d+$'
for string in strings:
    result = re.search(regex, string)
    if result:
        print("The following string is a match:", string)
        print(result, '\n')

The following string is a match: 2orange3
<re.Match object; span=(0, 8), match='2orange3'> 

The following string is a match: 78mango3
<re.Match object; span=(0, 8), match='78mango3'> 



Here, we use both methods above, but we have one additional regex in the middle. Strings that start and end with digits can have any number of characters in between, so we must account for those too by using `\w*`, which would specify 0 or more characters. Therefore, we define the combined regex `‘^\d+\w*\d+‘`. So now we have covered all cases!

4. Now, let us define a list of names. We want to find all names that have a first and last name.

In [70]:
names = [
    'Michael Scott',
    'Helena Bonham Carter',
    'MathematicalWizard',
    'Ada Lovelace',
    'Eddie Vedder',
    'Jigglypuff',
    'Anakin',
    'Edna Mode',
    'Señor Don Gato',
    'Daisy',
    'Ted Mosby',
]

In [71]:
regex = '^\w+\s+\w+$'

# using \w+ will scan for any amount of characters
# using \s+ will scan for any amount of white spaces

for name in names:
    result = re.search(regex, name)
    if result:
        print(name)

Michael Scott
Ada Lovelace
Eddie Vedder
Edna Mode
Ted Mosby


5. Find all names that have a first, middle, and last name

In [72]:
regex = '^\w+\s+\w+\s+\w+$'

for name in names:
    result = re.search(regex, name)
    if result:
        print(name)

Helena Bonham Carter
Señor Don Gato


6. Find all names that have 2 consecutive vowels in any part of the name

In [73]:
regex = '[aeiou]{2}'

for name in names:
    result = re.search(regex, name)
    if result:
        print(name)

Michael Scott
Eddie Vedder
Daisy


7. Find all names that begin with 4 consecutive word characters followed by a whitespace

In [74]:
regex = '^\w\w\w\w\s'

for name in names:
    result = re.search(regex, name)
    if result:
        print(name)

Edna Mode


8. Find all names that are exactly 5 word characters

In [75]:
regex = '^\w{5}$'

for name in names:
    result = re.search(regex, name)
    if result:
        print(name)

Daisy


## The findall() and sub() methods

The `findall()` method takes 2 required arguments: a regex pattern and a string, in addition to the optional `flags`. It will search for ALL instances of a pattern in a string. 

The `sub()` method will search a string for a regex pattern and replace all (or a defined count of) substrings that match the pattern with a new specified replacement string, called `repl`. 

Let’s try some more examples of `findall()` and try out the `sub()` method! We will use the same string as before, `string`=`“I23like45to2drink6coffee90every8morning!”`.

1. Suppose we wanted to find all instances of a number (1 or more digits) and print them. We can use `findall()` with `\d+`


In [44]:
string = 'I23like45to2drink6coffee90every8morning!'

regex = '\d+'
result = re.findall(pattern=regex, string=string)
print(result)

['23', '45', '2', '6', '90', '8']


Now, if we want to replace those numerical characters with a single white space, we can use the `sub()` method to do so! Here we must define an argument called `repl` that defines the string with which we would like to replace the numbers in the original string. 

In [45]:
string = 'I23like45to2drink6coffee90every8morning!'
 
regex = '\d+'
replacement = ' '
result = re.sub(pattern=regex, repl=replacement, string=string)
print(result)

I like to drink coffee every morning!


This will return the result of substituting those numbers with a whitespace (which forms a sentence!)

Notice that both methods have a similar operation at first. Both of them search for all instances of a pattern, but it is the `sub()` method that will replace those instances with a new string. `findall()` will return a list of all of the matching instances, and `sub()` will return the new string after replacement.

2. `sub()` has an additional argument, count, that may be of use depending on the problem we have. If we want to replace only the first 3 instances of numbers in the string above, we can define count = 3. 


In [46]:
string = 'I23like45to2drink6coffee90every8morning!'

regex = '\d+'
replacement = ' '
count = 3
result = re.sub(pattern=regex, repl=replacement, string=string, count=count)
print(result)

I like to drink6coffee90every8morning!


Now, only the first 3 instances of a number are replaced with a single whitespace.

3. We define a list of phone numbers in string format. We want to iterate through and find all valid phone numbers that are composed of exactly 10 digits.

In [76]:
phone_numbers = [
'1234567890', 
'7134273948',
'5123940057',
'713923840',
'40583948',
'80029394583',
'2139495047',
'3452847390',
] 


regex = '^\d{10}$'
for num in phone_numbers:
    result = re.findall(pattern=regex, string=num)
    if result:
        print(result)

['1234567890']
['7134273948']
['5123940057']
['2139495047']
['3452847390']


Now suppose we have the same list of phone numbers above (except now they have a country code) and we want to find which ones are French numbers and have a valid number of 10 digits following the area code

In [48]:
phone_numbers = [
'+62 234567890', 
'+33 7134273948',
'+1 5123940057',
'713923840',
'+34 40583948',
'+61 029394583',
'+2139495047',
'+1 3452847390',
'+1 153948273',
'+33 2148304938'
] 


regex = '^[+33]{3}\s\d{10}$'
for num in phone_numbers:
    result = re.findall(pattern=regex, string=num)
    if result:
        print(result)

['+33 7134273948']
['+33 2148304938']


We want to make sure the beginning of the number starts with `+33` and is followed by a whitespace. The `^` ensures we are working with the start of the string, `[+33]{3}` ensures that the code looks for all 3 characters, `\s` searches for a whitespace, and `\d{10}` searches for 10 consecutive digits. 