# Python RegEx

In this session, you will learn about regular expressions (RegEx).

A **Reg**ular **Ex**pression (RegEx) is a sequence of characters that defines a search pattern. For example,

```python
>>> ^a...s$
```


| Expression | String | Matched? | 
|:----| :--- |:--- |
|               |  **`abs`** | No match | 
|               |  **`alias`** | Match |
| **`^a...s$`** |  **`abyss`** | Match |
|               |  **`Alias`** | No match |
|               |  **`An abacus`** | No match |

In [1]:
import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful!")	

Search successful.


## Specify Pattern Using RegEx

To specify regular expressions, metacharacters are used. In the above example, **`^`** and **`$`** are metacharacters.

## MetaCharacters

Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters:

###  **`[] . ^ $ * + ? {} () \ |`**

### 1. `[]` - Square brackets



| Expression | String | Matched? | 
|:----| :--- |:--- |
|             |  **`a`** | 1 match | 
|             |  **`ac`** | 2 matches |
| **`[abc]`** |  **`Hey Jude`** | No match |
|             |  **`abc de ca`** | 5 matches |


* **`[a-c]`** 
* **`[a-z]`** 
* **`[A-Z]`** 
* **`[0-3]`** 
* **`[0-9]`** 
* **`[A-Za-z0-9]`** 

* **`[^abc]`** 
* **`[^0-9]`** 

### 2. `.` - Period



| Expression | String | Matched? | 
|:----| :--- |:--- |
|          |  **`a`** | No match | 
|          |  **`ac`** | 1 match |
| **`..`** |  **`acd`** | 1 match |
|          |  **`acde`** | 2 matches (contains 4 characters) |

### 3. `^` - Caret

| Expression | String | Matched? | 
|:----| :--- |:--- |
|          |  **`a`** | 1 match | 
| **`^a`** |  **`abc`** | 1 match |
|          |  **`bac`** | No match |
|----------|------------|------------------------------------------------------------|
| **`^a`** |  **`abc`** | 1 match |
|          |  **`acd`** | No match (starts with **`a`** but not followed by **`b`**) |

### 4. `$` - Dollar



| Expression | String | Matched? | 
|:----| :--- |:--- |
|          |  **`a`** | 1 match | 
| **`a$`** |  **`formula`** | 1 match |
|          |  **`cab`** | No result |

### 5. `*` - Star


| Expression | String | Matched? | 
|:----| :--- |:--- |
|            |  **`mn`** | 1 match | 
|            |  **`man`** | 1 match |
| **`ma*n`** |  **`maaan`** | 1 match |
|            |  **`main`** | No match (**`a`** is not followed by **`n`**) |
|            |  **`woman`** | 1 match |

### 6. `+` - Plus


| Expression | String | Matched? | 
|:----| :--- |:--- |
|            |  **`mn`** | No match (no **`a`** character) | 
|            |  **`man`** | 1 match |
| **`ma+n`** |  **`maaan`** | 1 match |
|            |  **`main`** | No match (**`a`** is not followed by **`n`**) |
|            |  **`woman`** | 1 match |

### 7. `?` - Question Mark

| Expression | String | Matched? | 
|:----| :--- |:--- |
|            |  **`mn`** | 1 match | 
|            |  **`man`** | 1 match |
| **`ma?n`** |  **`maaan`** | No match (more than one **`a`** character) |
|            |  **`main`** | No match (**`a`** is not followed by **`n`**) |
|            |  **`woman`** | 1 match |

### 8. `{}` - Braces

Consider this code: **`{n,m}`**.

* **`{3}`**
* **`{3,}`**
* **`{3,8}`**

| Expression | String | Matched? | 
|:----| :--- |:--- |
|              |  **`abc dat`** | No match | 
|              |  **`abc daat`** | 1 match (at **`daat`**) |
| **`a{2,3}`** |  **`aabc daaat`** | 2 matches (at **`aabc`** and **`daaat`**) |
|              |  **`aabc daaaat`** | 2 matches (at **`aabc`** and **`daaaat`**) |



| Expression | String | Matched? | 
|:----| :--- |:--- |
|                  |  **`ab123csde`** | 1 match (match at **`ab123csde`**) | 
| **`[0-9]{2,4}`** |  **`12 and 345673`** | 3 matches (**`12`**, **`3456`**, **`73`**) |
|                  |  **`1 and 2`** | No match |

### 9. `|` - Alternation



| Expression | String | Matched? | 
|:----| :--- |:--- |
|           |  **`cde`** | No match | 
| **`a\|b`** |  **`ade`** | 1 match (match at **`ade`**) |
|           |  **`acdbea`** | 3 matches (at **`acdbea`**) |



### 10. `()` - Group


| Expression | String | Matched? | 
|:----| :--- |:--- |
|                 |  **`ab xz`** | No match | 
| **`(a\|b\|c)xz`** |  **`abxz`** | 1 match (match at **`abxz`**) |
|                 |  **`axz cabxz`** | 2 matches (at **`axz cabxz`**) |

### 11. `\` - Backslash

Backlash **`\`** is used to escape various characters including all metacharacters. 

### Special Sequences

Special sequences make commonly used patterns easier to write. Here's a list of special sequences:

#### `\A` - Matches if the specified characters are at the start of a string.

| Expression | String | Matched? | 
|:----| :--- |:--- |
| **`\Athe`** |  **`the sun`** | Match | 
|             |  **`In the sun`** | No match |

#### `\b` - Matches if the specified characters are at the beginning or end of a word.

| Expression | String | Matched? | 
|:----| :--- |:--- |
|             |  **`football`** | Match | 
| **`\bfoo`** |  **`a football`** | Match |
|             |  **`afootball`** | No match |
|-------------|------------|-----------|
| **`foo\b`** |  **`the foo`** | Match |
|             |  **`the afoo test`** | Match |
|             |  **`the afootest`** | No match |

#### `\B` - Opposite of `\b`. Matches if the specified characters are not at the beginning or end of a word.

| Expression | String | Matched? | 
|:----| :--- |:--- |
|             |  **`football`** | No match | 
| **`\Bfoo`** |  **`a football`** | No match |
|             |  **`afootball`** | Match |
|-------------|------------|-----------|
| **`foo\B`** |  **`the foo`** | No match |
|             |  **`the afoo test`** | No match |
|             |  **`the afootest`** | Match |

#### `\d` - Matches any non-decimal digit. Equivalent to `[^0-9]`

- **`\d`** means: match where the string contains digits (numbers from 0-9)
- **`\D`** means: match where the string does not contain digits
    
| Expression | String | Matched? | 
|:----| :--- |:--- |
| **`\d`** |  **`1ab34"50`** | 3 matches (at **`1ab34"50`**) | 
|          |  **`1345`** | No match |

#### `\s` - Matches where a string contains any whitespace character. Equivalent to `[ \t\n\r\f\v]`.

| Expression | String | Matched? | 
|:----| :--- |:--- |
| **`\s`** |  **`Python RegEx`** | 1 match | 
|          |  **`PythonRegEx`** | No match |

#### `\S` - Matches where a string contains any non-whitespace character. Equivalent to `[^ \t\n\r\f\v]`.

| Expression | String | Matched? | 
|:----| :--- |:--- |
| **`\S`** |  **`a b`** | 2 matches (at  **`a b`**) | 
|          |  **` `** | No match |

#### `\w` - Matches any alphanumeric character (digits and alphabets). Equivalent to `[a-zA-Z0-9_]`. By the way, underscore `_` is also considered an alphanumeric character.

| Expression | String | Matched? | 
|:----| :--- |:--- |
| **`\w`** |  **`12&": ;c`** | 3 matches (at **`12&": ;c`**) | 
|          |  **`%"> !`** | No match |

#### `\W` - Matches any non-alphanumeric character. Equivalent to `[^a-zA-Z0-9_]`.

| Expression | String | Matched? | 
|:----| :--- |:--- |
| **`\W`** |  **`1a2%c`** | 1 match (at **`1a2%c`**) | 
|          |  **`Python`** | No match |

#### `\Z` - Matches if the specified characters are at the end of a string.

| Expression | String | Matched? | 
|:----| :--- |:--- |
|                |  **`I like Python`** | 1 match |
| **`Python\Z`** |  **`I like Python Programming`** | No match | 
|                |  **`Python is fun.`** | No match |

### Summary - MetaCharacters

<div>
<img src="img/regex.png" width="1000"/>
</div>

### Python RegEx Methods

To find a pattern we use different set of **`re`** character sets that allows to search for a match in a string.

* **`re.findall`**
* **`re.split`**
* **`re.sub`**
* **`re.search`**
* **`re.match()`**
returns **`None`**.


In [2]:
import re

### 1. `re.findall()`


In [3]:
# Example 1: re.findall()

# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']

['12', '89', '34']


In [4]:
# Example 2: re.findall()

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It return a list
matches = re.findall('language', txt, re.I)
print(matches)  # ['language', 'language']

['language', 'language']


In [7]:
# Example 3: re.findall()

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns list
matches = re.findall('python', txt, re.I)
print(matches)  # ['Python', 'python']

['Python', 'python']


In [8]:
# Example 4: re.findall()

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

matches = re.findall('Python|python', txt)
print(matches)  # ['Python', 'python']

matches = re.findall('[Pp]ython', txt)
print(matches)  # ['Python', 'python']

['Python', 'python']
['Python', 'python']


### 2. `re.split()`


In [10]:
# Example 1: re.split()

txt = '''I am teacher and  I love teaching.
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs.
Does this motivate you to be a teacher?'''
print(re.split('\n', txt)) # splitting using \n - end of line symbol

['I am teacher and  I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?']


In [11]:
# Example 2: re.split()

import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

# Output: ['Twelve:', ' Eighty nine:', '.']

['Twelve:', ' Eighty nine:', '.']


In [13]:
# Example 3: re.split()

import re

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

#maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

# Output: ['Twelve:', ' Eighty nine:89 Nine:9.']

['Twelve:', ' Eighty nine:89 Nine:9.']


### 3. `re.sub()`

**Syntax:**

```python
re.sub(pattern, replace, string)
```


In [14]:
# Example 1: re.sub()

# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

# Output: abc12de23f456

abc12de23f456


In [17]:
# Example 2: re.sub()

import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub(r'\s+', replace, string, 1) 
print(new_string)

# Output:
# abc12de 23
# f45 6

abc12de 23 
 f45 6


In [18]:
# Example 3: re.sub()

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match_replaced = re.sub('Python|python', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.
# OR
match_replaced = re.sub('[Pp]ython', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript is the most beautiful language that a human being has ever created.


JavaScript is the most beautiful language that a human being has ever created.
I recommend JavaScript for a first programming language
JavaScript is the most beautiful language that a human being has ever created.
I recommend JavaScript for a first programming language


In [19]:
# Example 4: re.sub()

txt = '''%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. 
T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple.
I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. 
D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher?'''

matches = re.sub('%', '', txt)
print(matches)

I am teacher and  I love teaching. 
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs. 
Does this motivate you to be a teacher?


### 4. `re.subn()`


In [20]:
# Example 1: re.subn()

# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

# Output: ('abc12de23f456', 4)

('abc12de23f456', 4)


### 5. `re.search()`


```python
match = re.search(pattern, str)
```

In [21]:
# Example 1: re.search()

import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
    print("pattern found inside the string")
else:
    print("pattern not found")  

# Output: pattern found inside the string

pattern found inside the string


In [22]:
# Example 2: re.search()

import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns an object with span and match
match = re.search('first', txt, re.I)
print(match)  # <re.Match object; span=(100, 105), match='first'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (100, 105)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 100 105
substring = txt[start:end]
print(substring)       # first

<re.Match object; span=(100, 105), match='first'>
(100, 105)
100 105
first


## Match object


In [23]:
import re

txt = 'I love to teach python and javaScript'
# It returns an object with span, and match
match = re.match('I love to teach', txt, re.I)
print(match)  # <re.Match object; span=(0, 15), match='I love to teach'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (0, 15)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 0, 15
substring = txt[start:end]
print(substring)       # I love to teach

<re.Match object; span=(0, 15), match='I love to teach'>
(0, 15)
0 15
I love to teach


In [24]:
import re

txt = 'I love to teach python and javaScript'
match = re.match('I like to teach', txt, re.I)
print(match)  # None

None


### 1. `match.group()`



In [25]:
# Example 6: Match object

import re

string = '39801 356, 2102 1111'

# Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
    print(match.group())
else:
    print("pattern not found")

# Output: 801 35

801 35


In [26]:
match.group(1)

'801'

In [27]:
match.group(2)

'35'

In [28]:
match.group(1, 2)

('801', '35')

In [29]:
match.groups()

('801', '35')

### 2. `match.start(), match.end() and match.span()`


In [30]:
match.start()

2

In [31]:
match.end()

8

In [32]:
match.span()

(2, 8)

### 3. `match.re and match.string`


In [33]:
match.re

re.compile(r'(\d{3}) (\d{2})', re.UNICODE)

In [34]:
match.string

'39801 356, 2102 1111'

## Using `r` prefix before RegEx


In [35]:
# Example 7: Raw string using r prefix

import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

# Output: ['\n', '\r']

['\n', '\r']


## Example of RegEx with Metacharacters

Let us use examples to clarify the meta characters with RegEx methods:

### Square Brackets `[]`

Let us use square bracket to include lower and upper case

In [36]:
# Example 1:

regex_pattern = r'[Aa]pple' # this square bracket mean either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']

['Apple', 'apple']


In [37]:
# Example 2:

regex_pattern = r'[Aa]pple|[Bb]anana' # this square bracket means either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'banana', 'apple', 'banana']

['Apple', 'banana', 'apple', 'banana']


### Escape character `\`

In [38]:
# Example 1:

regex_pattern = r'\d'  # d is a special character which means digits
txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
matches = re.findall(regex_pattern, txt)
print(matches)  # ['8', '1', '9', '4', '2', '1', '4', '2', '0', '1', '8', '7', '6'] - this is not what we want

['8', '1', '9', '4', '2', '1', '4', '2', '0', '1', '8', '7', '6']


### One or more times `+`

In [39]:
# Example 1:

regex_pattern = r'\d+'  # d is a special character which means digits, + mean one or more times
txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
matches = re.findall(regex_pattern, txt)
print(matches)  # ['8', '1942', '14', '2018', '76'] - this is better!

['8', '1942', '14', '2018', '76']


### Period `.`

In [40]:
# Example 1:

regex_pattern = r'[a].'  # this square bracket means a and . means any character except new line
txt = '''Apple and Banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['an', 'an', 'an', 'a ', 'ar']

['an', 'an', 'an', 'a ', 'ar']


In [41]:
# Example 2: [] with +

regex_pattern = r'[a].+'  # . any character, + any character one or more times 
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and Banana are fruits']

['and Banana are fruits']


### Zero or more times `*`

In [42]:
# Example 1:

regex_pattern = r'[a].*'  # . any character, * any character zero or more times 
txt = '''Apple and Banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and Banana are fruits']

['and Banana are fruits']


### Zero or one time `?`

Zero or one time. The pattern may not occur or it may occur once.

In [43]:
# Example 1:

txt = '''I am not sure if there is a convention how to write the word e-mail.
Some people write it as email others may write it as Email or E-mail.'''
regex_pattern = r'[Ee]-?mail'  # ? means here that '-' is optional
matches = re.findall(regex_pattern, txt)
print(matches)  # ['e-mail', 'email', 'Email', 'E-mail']

['e-mail', 'email', 'Email', 'E-mail']


### Quantifier `{}`

We can specify the length of the substring we are looking for in a text, using a curly brackets **`{}`**. Let us imagine, we are interested in a substring with a length of 4 characters:

In [44]:
# Example 1:

txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
regex_pattern = r'\d{4}'  # exactly four times
matches = re.findall(regex_pattern, txt)
print(matches)  # ['1942', '2018']

['1942', '2018']


In [45]:
# Example 2:

txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
regex_pattern = r'\d{1,4}'   # 1 to 4
matches = re.findall(regex_pattern, txt)
print(matches)  # ['8', '1942', '76', '14', '2018', '76']

['8', '1942', '14', '2018', '76']


### Cart `^`

In [46]:
# Example 1: Starts with

txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
regex_pattern = r'^Hawking'  # ^ means starts with
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Hawking']

['Hawking']


In [47]:
# Example 2: Negation

txt = "Hawking born on 8 January 1942 and died on 14 March 2018 Einstein's birth anniversary (Pi-Day) and both died at 76"
regex_pattern = r'[^A-Za-z ]+'  # ^ in set character means negation, not A to Z, not a to z, no space
matches = re.findall(regex_pattern, txt)
print(matches)  # ['8', '1942', '14', '2018', "'", '(', '-', ')', '76']

['8', '1942', '14', '2018', "'", '(', '-', ')', '76']


## ðŸ’» Exercises âžž <span class='label label-default'>RegEx</span>

### Exercises âžž <span class='label label-default'>Level 1</span>

 1. What is the most frequent word in the following paragraph?
    - ```py
paragraph = 'I love teaching. If you do not love teaching what else can you love. I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love.
    ```

   - ```sy
    [
    (6, 'love'),
    (5, 'you'),
    (3, 'can'),
    (2, 'what'),
    (2, 'teaching'),
    (2, 'not'),
    (2, 'else'),
    (2, 'do'),
    (2, 'I'),
    (1, 'which'),
    (1, 'to'),
    (1, 'the'),
    (1, 'something'),
    (1, 'if'),
    (1, 'give'),
    (1, 'develop'),
    (1, 'capabilities'),
    (1, 'application'),
    (1, 'an'),
    (1, 'all'),
    (1, 'Python'),
    (1, 'If')
    ]
    ```

2. The position of some particles on the horizontal x-axis are -12, -4, -3 and -1 in the negative direction, 0 at origin, 4 and 8 in the positive direction. Extract these numbers from this whole text and find the distance between the two furthest particles.

    - ```py
points = ['-1', '2', '-4', '-3', '-1', '0', '4', '8']
sorted_points =  [-4, -3, -1, -1, 0, 2, 4, 8]
distance = 8 -(-4) # 12
    ```

### Exercises âžž <span class='label label-default'>Level 2</span>

1. Write a pattern which identifies if a string is a valid python variable

    - ```py
    is_valid_variable('first_name')  # True
    is_valid_variable('first-name')  # False
    is_valid_variable('1first_name') # False
    is_valid_variable('firstname')   # True
    ```


### Exercises âžž <span class='label label-default'>Level 3</span>

1. Clean the following text. After cleaning, count three most frequent words in the string.

    - ```py
    sentence = '''%I $am@% a %tea@cher%, &and& I lo%#ve %tea@ching%;. There $is nothing; &as& mo@re rewarding as educa@ting &and& @emp%o@wering peo@ple. ;I found tea@ching m%o@re interesting tha@n any other %jo@bs. %Do@es thi%s mo@tivate yo@u to be a tea@cher!?'''

    print(clean_text(sentence));
    I am a teacher and I love teaching There is nothing as more rewarding as educating and empowering people I found teaching more interesting than any other jobs Does this motivate you to be a teacher
    print(most_frequent_words(cleaned_text)) # [(3, 'I'), (2, 'teaching'), (2, 'teacher')]
    ```