## Topic 3

In this notebook, we explore how to use regular expressions (RegEx) to examine text patterns. Python supports regular expression via the `re` module.

In [1]:
import pandas as pd
import re

We define regular expressions as raw strings: `r'...'`. The following method can be use to find matching patterns.

In [2]:
re.findall(r'\d', 'dse3101-sem2')

['3', '1', '0', '1', '2']

In [3]:
re.findall(r'\d+', 'dse3101-sem2')

['3101', '2']

In [4]:
re.findall(r'\d+', 'dse')

[]

The leading `^` and trailing `$` are known as the position anchors.

In [5]:
re.findall(r'^\d+', 'dse3101-sem2')

[]

In [6]:
re.findall(r'^\d+', '3101-sem2')

['3101']

In [7]:
re.findall(r'\d$', '3101-sem2')

['2']

More on repetition operators:

+ `+` matches one or more times

+ `*` matches zero or more times

+ `?` matches zero or one times

+ `{m}` maches exactly `m` times

In [8]:
re.findall(r'[0-9]', 'abc123xyz45_0')

['1', '2', '3', '4', '5', '0']

In [9]:
re.findall(r'[0-9]+', 'abc123xyz45_0')

['123', '45', '0']

In [10]:
re.findall(r'[0-9]*', 'abc123xyz45_0')

['', '', '', '123', '', '', '', '45', '', '0', '']

In [11]:
re.findall(r'[0-9]?', 'abc123xyz45_0')

['', '', '', '1', '2', '3', '', '', '', '4', '5', '', '0', '']

In [12]:
re.findall(r'[0-9]{2}', 'abc123xyz456_0')

['12', '45']

In [13]:
re.findall(r'[0-9]{2,3}', 'abc123xyz456_0')

['123', '456']

Example: Image filenames

In [14]:
re.findall(r'^[a-zA-Z0-9_]+\.(gif|png|jpg|jpeg)$', 'wk1_intro.png')

['png']

In [15]:
pattern = re.compile(r'^[a-zA-Z0-9_]+\.(gif|png|jpg|jpeg)$', re.IGNORECASE)
filenames = ['wk1_intro.png', 'tidy_diag.jpeg', 'screenshot.PNG', 'test.gif', 'test.txt']
valid_files = [names for names in filenames if pattern.match(names)]
valid_files

['wk1_intro.png', 'tidy_diag.jpeg', 'screenshot.PNG', 'test.gif']

Example: Email addresses

+ Explore more at [https://regex101.com/](https://regex101.com/).

In [16]:
re.findall(r'^[\w.+-]+@([\w.+-]+)$', 'e0123456@u.nus.sg')

['u.nus.sg']

### Escape sequences

Several characters have special meaning in RegEx. To match these characters, we need to prepend it with a blackdash (`\`), known as escape secuence.

In [17]:
re.findall(r'\.', 'u.nus.sg')

['.', '.']

In [18]:
re.findall(r'.*\s\d\.$', 'today is a thursday 4.')

['today is a thursday 4.']

### Application 1

In [19]:
txt = pd.read_csv('data/nasa.txt', sep = '\t', header = None)
txt


Unnamed: 0,0
0,SEC. 101. FISCAL YEAR 2017.
1,(a) There are authorized to be appropriated to...
2,"(1) For Exploration, $4,330,000,000."
3,"(2) For Space Operations, $5,023,000,000."
4,"(3) For Science, $5,500,000,000."
5,"(4) For Aeronautics, $640,000,000."
6,"(5) For Space Technology, $686,000,000."
7,"(6) For Education, $115,000,000."
8,"(7) For Safety, Security, and Mission Services..."
9,(8) For Construction and Environmental Complia...


In [20]:
names = txt[0].str.extract(r'For\s(.+),\s\$')
names = names.dropna()
names

Unnamed: 0,0
2,Exploration
3,Space Operations
4,Science
5,Aeronautics
6,Space Technology
7,Education
8,"Safety, Security, and Mission Services"
9,Construction and Environmental Compliance and ...
10,Inspector General


In [21]:
amount = txt[0].str.extract(r'\$([,\d]+)')
amount = amount.dropna()
amount

Unnamed: 0,0
1,19508000000
2,4330000000
3,5023000000
4,5500000000
5,640000000
6,686000000
7,115000000
8,2788600000
9,388000000
10,37400000


In [22]:
budget = pd.concat([names, amount], axis = 1)
budget.columns = ['name', 'amount']
budget.head()

Unnamed: 0,name,amount
2,Exploration,4330000000
3,Space Operations,5023000000
4,Science,5500000000
5,Aeronautics,640000000
6,Space Technology,686000000
