# Data Wrangling with Python - Regular Expressions

#### Import Library

In [2]:
import re

## Finding characters in a text
Let's use the below sample text and see how we can extract different parts of this text using regular expressions

In [3]:
sample_text = "abcxyzABCXYZabc xyz%$...| 676-898 "

#### Exact match
Let's find all the occurances of 'abc' in the above text

In [4]:
pattern = re.compile(r'abc') # this is used to store the pattern you want to search
matches = pattern.findall(sample_text) # this is used to find all instanaces in a text that matches the pattern
print(matches) # the result is in the form of a python list

['abc', 'abc']


<b>Note:</b> The pattern which is compiled is case-sensitive. So by searching for 'abc', you cannot find 'ABC' 

#### Match everything
The dot character returns all the distinct characters in a text

In [5]:
pattern = re.compile(r'.')
matches = pattern.findall(sample_text)
print(matches)

['a', 'b', 'c', 'x', 'y', 'z', 'A', 'B', 'C', 'X', 'Y', 'Z', 'a', 'b', 'c', ' ', 'x', 'y', 'z', '%', '$', '.', '.', '.', '|', ' ', '6', '7', '6', '-', '8', '9', '8', ' ']


#### Match all numbers
The identifier <b>\d</b> finds all digits - numbers from 0 to 9

In [6]:
pattern = re.compile(r'\d')
matches = pattern.findall(sample_text)
print(matches)

['6', '7', '6', '8', '9', '8']


#### Match all which are NOT numbers
The identifier <b>\D</b> finds all characters which are not digits

In [7]:
pattern = re.compile(r'\D')
matches = pattern.findall(sample_text)
print(matches)

['a', 'b', 'c', 'x', 'y', 'z', 'A', 'B', 'C', 'X', 'Y', 'Z', 'a', 'b', 'c', ' ', 'x', 'y', 'z', '%', '$', '.', '.', '.', '|', ' ', '-', ' ']


#### Match all alphabets
The identifier <b>\w</b> finds all alphabets - a to z, A to Z and underscore

In [8]:
pattern = re.compile(r'\w')
matches = pattern.findall(sample_text)
print(matches)

['a', 'b', 'c', 'x', 'y', 'z', 'A', 'B', 'C', 'X', 'Y', 'Z', 'a', 'b', 'c', 'x', 'y', 'z', '6', '7', '6', '8', '9', '8']


#### Match all which are NOT alphabets
The identifier <b>\W</b> finds all characters which are not alphabets or underscore

In [9]:
pattern = re.compile(r'\W')
matches = pattern.findall(sample_text)
print(matches)

[' ', '%', '$', '.', '.', '.', '|', ' ', '-', ' ']


#### Match all whitespaces
The identifier <b>\s</b> finds all whitespaces - space, tab, new line

In [10]:
pattern = re.compile(r'\s')
matches = pattern.findall(sample_text)
print(matches)

[' ', ' ', ' ']


#### Match all which are NOT whitespaces
The identifier <b>\S</b> finds all which are not space, tab and new line

In [11]:
pattern = re.compile(r'\S')
matches = pattern.findall(sample_text)
print(matches)

['a', 'b', 'c', 'x', 'y', 'z', 'A', 'B', 'C', 'X', 'Y', 'Z', 'a', 'b', 'c', 'x', 'y', 'z', '%', '$', '.', '.', '.', '|', '6', '7', '6', '-', '8', '9', '8']


#### Reserved Characters 
Some characters are reserved for specific functions in regular expressions - . ^ $ * + ? { } [ ] | \ ( )

We cannot use them directly. For example, as we saw earlier, using '.' gives you all the characters, but what if you want to find the occurance of all dots in a text? You can use the backslash to tell python that you actually want to find a dot and not use dot as an identifier

In [12]:
pattern = re.compile(r'\.')
matches = pattern.findall(sample_text)
print(matches)

['.', '.', '.']


## Matching patterns in a text

Let's use our understanding from the above exercise to extract dates from the below sample text

In [14]:
sample_text = """
2021-01-01: January 2021, Friday: Mr. Xyz
2021-07-19: July 2021, Monday: Mr L
2021-05-28: May 2021, Friday: Ms. IOU abc
2021/11/10: November 2021, Wednesday: Dr. Moby D
2020-01-01xyzabc: Jan 10th: Dr Dora The Exp
2020-04-02 0-1102-010002: Mrs. Joy
2020-02-01 0-12-010002: Mr. Mario
2019-09-18: suu: Mr. Luigi
2018-07-01: hha iuas
21-10-19: xyuz sus
21-10-18
21-9-20
21-8-14
21/12/02
21/11/14
21/10/21
"""

#### Let's start simple
First let's try to find all text that match 4 numbers, a dash, then 2 numbers, a dash and finally followed by 2 more numbers

In [15]:
pattern = re.compile(r'\d\d\d\d-\d\d-\d\d')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2020-01-01', '2020-04-02', '2020-02-01', '2019-09-18', '2018-07-01']


#### Using Quantifiers to match patterns
You can use quantifiers to check of multiple occurances. For example, instead of repeating <b>\d</b> four times, you can say <b>\d+</b> which will look for 1 or more digits

In [17]:
pattern = re.compile(r'\d+-\d+-\d+')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2020-01-01', '2020-04-02', '0-1102-010002', '2020-02-01', '0-12-010002', '2019-09-18', '2018-07-01', '21-10-19', '21-10-18', '21-9-20', '21-8-14']


But if we do this, you end up with garbage values like <b>0-1102-010002</b> and <b>0-12-010002</b>. You can avoid these by specifying the exact number of digits to check for using curly brackets - <b>{how many?}</b>

In [18]:
pattern = re.compile(r'\d{4}-\d{2}-\d{2}')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2020-01-01', '2020-04-02', '2020-02-01', '2019-09-18', '2018-07-01']


Now, we are missing out on dates which have only 2 digits to represent the year - like <b>21-10-19</b> and 1 digit to represent the year - like <b>21-8-14</b>. To include these as well, we can change the pattern to look for a range of numbers. You can do this with <b>{min,max}</b>

In [25]:
pattern = re.compile(r'\d{2,4}-\d{1,2}-\d{2}')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2020-01-01', '2020-04-02', '2020-02-01', '2019-09-18', '2018-07-01', '21-10-19', '21-10-18', '21-9-20', '21-8-14']


#### Matching multiple characters

The above code is not yet perfect. It misses out on dates which are separated by '/' - like <b>21/10/21</b>. We can use square brackets <b>[ ]</b> to tell python to look for -'s or for /'s

In [30]:
pattern = re.compile(r'\d{2,4}[-/]\d{1,2}[-/]\d{2}')
matches = pattern.findall(sample_text)
print(matches)

['2021-01-01', '2021-07-19', '2021-05-28', '2021/11/10', '2020-01-01', '2020-04-02', '2020-02-01', '2019-09-18', '2018-07-01', '21-10-19', '21-10-18', '21-9-20', '21-8-14', '21/12/02', '21/11/14', '21/10/21']


## Matching groups of characters

Let's try to extract all the names in the sample text. We need 8 names - Mr. Xyz, Mr L, Ms. IOU abc, Dr. Moby D, Dr Dora the Exp., Mrs. Joy, Mr. Mario and Mr. Luigi

In [None]:
sample_text = """
2021-01-01: January 2021, Friday: Mr. Xyz
2021-07-19: July 2021, Monday: Mr L
2021-05-28: May 2021, Friday: Ms. IOU abc
2021/11/10: November 2021, Wednesday: Dr. Moby D
2020-01-01xyzabc: Jan 10th: Dr Dora The Exp
2020-04-02 0-1102-010002: Mrs. Joy
2020-02-01 0-12-010002: Mr. Mario
2019-09-18: suu: Mr. Luigi
2018-07-01: hha iuas
21-10-19: xyuz sus
21-10-18
21-9-20
21-8-14
21/12/02
21/11/14
21/10/21
"""

Let's start by getting all the Mr

In [None]:
pattern = re.compile(r'Mr')
matches = pattern.findall(sample_text)
print(matches)

['Mr', 'Mr', 'Mr', 'Mr', 'Mr']


After Mr, you can have a dot or need not have it. We can use the <b>?</b> quantifier to specifiy 0 or 1 dot

In [None]:
pattern = re.compile(r'Mr\.?')
matches = pattern.findall(sample_text)
print(matches)

['Mr.', 'Mr', 'Mr', 'Mr.', 'Mr.']


Now let's bring in Mr, Ms, Mrs and Dr too into the match. For this, we can use <b>(group)</b> them using either or <b>|</b>

In [None]:
pattern = re.compile(r'(?:Mr|Ms|Mrs|Dr)\.?')
matches = pattern.findall(sample_text)
print(matches)

['Mr.', 'Mr', 'Ms.', 'Dr.', 'Dr', 'Mr', 'Mr.', 'Mr.']


Now let's get the first names. You have a space, then a capital letter, then 0 or many words

In [None]:
pattern = re.compile(r'(?:Mr|Ms|Mrs|Dr)\.?\s[A-Z]\w*')
matches = pattern.findall(sample_text)
print(matches)

['Mr. Xyz', 'Mr L', 'Ms. IOU', 'Dr. Moby', 'Dr Dora', 'Mrs. Joy', 'Mr. Mario', 'Mr. Luigi']


Now let's get their middle names and last names

In [None]:
sample_text = """
2021-01-01: January 2021, Friday: Mr. Xyz
2021-07-19: July 2021, Monday: Mr L
2021-05-28: May 2021, Friday: Ms. IOU abc
2021/11/10: November 2021, Wednesday: Dr. Moby D
2020-01-01xyzabc: Jan 10th: Dr Dora The Exp
2020-04-02 0-1102-010002: Mrs. Joy
2020-02-01 0-12-010002: Mr. Mario
2019-09-18: suu: Mr. Luigi
2018-07-01: hha iuas
21-10-19: xyuz sus
21-10-18
21-9-20
21-8-14
21/12/02
21/11/14
21/10/21
"""

After the first name, there is a space, then there are none or more characters. Also note, that not all names have names have middle and last names, so the expressions have to be optional.

<b> ?</b> finds the optional space
    
<b>(?:[A-Za-z]\w*)?</b> finds the optional middle and last names

In [None]:
pattern = re.compile(r'(?:Mr|Ms|Mrs|Dr)\.?\s[A-Z]\w* ?(?:[A-Za-z]\w*)? ?(?:[A-Za-z]\w*)?')
matches = pattern.findall(sample_text)
print(matches)

['Mr. Xyz', 'Mr L', 'Ms. IOU abc', 'Dr. Moby D', 'Dr Dora The Exp', 'Mrs. Joy', 'Mr. Mario', 'Mr. Luigi']


----------------------------------