# Regular expressions
Regular expressions are string patterns. The `re` module - included in the standard library - allow to define, search for, isolate and operate on string patterns within strings of text. For example: 

- `FRA`, `USA` and `ITA` are all 3-letter, uppercase strings
- `+33`, `+44`, `+91` are all strings starting with a + sign followed by 2 digits
- `lady.gaga@gmail.com` and `bradley-cooper@aol.net` are both email address: they are sequences of characters (A through Z along with some special characters), followed by an ampersand (`@`), followed by a domain-name and finished by a top-level domain (`.net`)

In [1]:
#import the module
import re

## 1. Defining and finding patterns

In [2]:
#create a pattern using a string prefixed with the r character
#the 'r' means 'raw'
pattern = r'crude'

#can we find the pattern in the text?
text = "The price of crude increased by 7.5% in the week of 7 January 2019"

#let's try! 
match = re.search(pattern, text)

print(match)

<_sre.SRE_Match object; span=(13, 18), match='crude'>


In [3]:
#We found our string!
print(match.start())
print(match.end())
print(match.span())
print(match.group())

13
18
(13, 18)
crude


In [4]:
#either one pattern or another
match = re.search(r'bullish|bearish|neutral', "we are bullish on Apple and Microsoft")

print(match)

<_sre.SRE_Match object; span=(7, 14), match='bullish'>


In [5]:
#optional character (0 or 1 occurence)
match = re.search(r'colou?r', 'My favourite color is blue')

print(match)

<_sre.SRE_Match object; span=(13, 18), match='color'>


In [6]:
#one or more occurences
match = re.search(r'A+', 'The bond is rated AAA')

print(match)

<_sre.SRE_Match object; span=(18, 21), match='AAA'>


In [7]:
#any number of times (including 0)
match = re.search(r'9*', '9999')

print(match)

<_sre.SRE_Match object; span=(0, 4), match='9999'>


In [8]:
#between 3 and 5 times
match = re.search(r'9{3,5}','019374564999953578934532453459834058345089')

print(match)

<_sre.SRE_Match object; span=(9, 13), match='9999'>


In [9]:
#careful, the search method will stop as soon as it can
#here, the pattern is any number of 9s, including none
#as the first character is not a 9, this is akin to 0 x 9s. 
match = re.search(r"9*", "123987")

print(match)

<_sre.SRE_Match object; span=(0, 0), match=''>


In [10]:
#any character a through z (dash), one or several times (+)
match = re.search(r'price of [a-zA-Z]+', text)

print(match.group())

price of crude


In [11]:
#at the start of a string
matches = re.search(r'^NY', 'I live in NY'), re.search(r'^NY', 'NY is awesome')

print(matches)

(None, <_sre.SRE_Match object; span=(0, 2), match='NY'>)


In [12]:
#at the end of a string
match = re.search(r'[A-Za-z]+$', 'The last word of the sentence')

print(match)

<_sre.SRE_Match object; span=(21, 29), match='sentence'>


In [13]:
#your turn to play! 
#can you spot your birthday in the first million digits of pi? 
with open('data/pi.txt', 'r') as source: 
    pi = source.read().strip()

match = re.search('040995', pi)

print(match)

<_sre.SRE_Match object; span=(304481, 304487), match='040995'>


## 2. Meta-characters
Meta-characters allow to refer to particular patterns.

1. `\d` matches any decimal digit; this is equivalent to the class `[0-9]`.
2. `\D` matches any non-digit character; this is equivalent to the class `[^0-9]`
3. `\s` matches any whitespace character; this is equivalent to the class `[ \t\n\r\f\v]`
4. `\S` matches any non-whitespace character; this is equivalent to the class `[^ \t\n\r\f\v]`
5. `\w` matches any alphanumeric character; this is equivalent to the class `[a-zA-Z0-9_]`
6. `\W` matches any non-alphanumeric character; this is equivalent to the class `[^a-zA-Z0-9_]`


In [14]:
#let's check this is a valid address
address = "341 Roosevelt Avenue, London EC34 2C, UNITED KINGDOM"

match = re.search(r'^\d{1,4},? (\w|\s)+, \w+ [A-Z0-9 ]+,[A-Z ]+$',address)

if match: 
    print("Address is valid:", match.group())
else:
    print("Address is invalid")

Address is valid: 341 Roosevelt Avenue, London EC34 2C, UNITED KINGDOM


## 3. Capturing
Capturing allows to isolate sequences in string patterns

In [15]:
#create groups with brackets
match = re.search(r'(\d{2})-([A-Za-z]{3})-(\d{2,4})', '12-Feb-2019')

for i, group in enumerate(match.groups()):
    print("Brackets", i+1,":",group)

Brackets 1 : 12
Brackets 2 : Feb
Brackets 3 : 2019


In [16]:
#You can make non-capturing brackets with (?:XXX) where XXX is your pattern
address = "341 Roosevelt Avenue, London EC34 2C, UNITED KINGDOM"

match = re.search(r'^(\d{1,4}),? ((?:\w|\s)+), (\w+) ([A-Z0-9 ]+), ([A-Z ]+)$',address)

for i, group in enumerate(match.groups()):
    print("Brackets", i+1, ':', group)

Brackets 1 : 341
Brackets 2 : Roosevelt Avenue
Brackets 3 : London
Brackets 4 : EC34 2C
Brackets 5 : UNITED KINGDOM


In [17]:
#You can name groups by using ?P<name> within the brackets
date = "23 March 2019"

match = re.search(r'(?P<day>\d{1,2}) (?P<month>[A-Z][a-z]+) (?P<year>\d{2,4})', date)

print("Day:", match.group("day"), sep="\t")
print("Month:", match.group("month"), sep="\t")
print("Year:", match.group("year"), sep="\t")

Day:	23
Month:	March
Year:	2019


## 4. Splitting and replacing

In [18]:
#let's try to parse some data provided by the US Department of Agriculture
with open("data/WASDE.txt", "r") as f: 
    data = f.read().strip()
    
print(data)

World and U.S Supply and Use for Grains
Million Metric Tons

Total Grains 4/          
    2017/18                  2616.60    3413.89     415.07    2601.51     812.39
    2018/19 (Est.)           2625.43    3437.81     427.50    2636.02     801.79
    2019/20 (Proj.)          2664.79    3465.70     436.11    2678.54     787.16
    
Wheat                    
    2017/18                   761.87    1023.94     182.04     742.77     281.18
    2018/19 (Est.)            730.55    1011.72     174.16     736.23     275.49
    2019/20 (Proj.)           771.46    1046.61     183.11     760.15     286.46
    
Coarse Grains 5/         
    2017/18                  1359.87    1745.48     185.91    1376.51     368.97
    2018/19 (Est.)           1396.26    1765.22     207.80    1410.22     355.00
    2019/20 (Proj.)          1395.51    1750.36     206.09    1422.32     328.05
    
Rice, milled             
    2017/18                   494.86     644.48      47.13     482.23     162.25
    2018/1

In [19]:
table = []

for i, line in enumerate(data.split("\n")):
    if i == 0: 
        title = line.strip()
        continue
        
    if i == 1: 
        units = line.strip()
        continue
    
    #is it an empty line?
    match = re.search(r"^\s*$", line)
    if match: 
        continue
        
    #try to match it with a data line
    match = re.search("\s+(?P<year>\d{4}/\d{2}) (?P<status>\((Est.|Proj.)\))? (?P<data>(\d| |.)+)$", line)
    if match:
        year   = match.group("year")
        status = match.group("status") or "(Final)"
        points = re.split("\s+", match.group("data").strip())  #we are splitting on blank characters
        table.append([section, year, status]+[float(point) for point in points])
        continue
        
    #else it must be section title
    match = re.search("([A-Za-z ,]+)", line)
    section = match.groups()[0].strip()

print("Title", title)
print("Units", units)
for row in table: 
    print(*row, sep=",")

Title World and U.S Supply and Use for Grains
Units Million Metric Tons
Total Grains,2017/18,(Final),2616.6,3413.89,415.07,2601.51,812.39
Total Grains,2018/19,(Est.),2625.43,3437.81,427.5,2636.02,801.79
Total Grains,2019/20,(Proj.),2664.79,3465.7,436.11,2678.54,787.16
Wheat,2017/18,(Final),761.87,1023.94,182.04,742.77,281.18
Wheat,2018/19,(Est.),730.55,1011.72,174.16,736.23,275.49
Wheat,2019/20,(Proj.),771.46,1046.61,183.11,760.15,286.46
Coarse Grains,2017/18,(Final),1359.87,1745.48,185.91,1376.51,368.97
Coarse Grains,2018/19,(Est.),1396.26,1765.22,207.8,1410.22,355.0
Coarse Grains,2019/20,(Proj.),1395.51,1750.36,206.09,1422.32,328.05
Rice, milled,2017/18,(Final),494.86,644.48,47.13,482.23,162.25
Rice, milled,2018/19,(Est.),498.62,660.87,45.55,489.57,171.3
Rice, milled,2019/20,(Proj.),497.82,668.73,46.91,496.08,172.65


## 5. Your turn to play! 

#### Exercise 1
Create a function to verify that a phone number is valid. Phone numbers can start with a country code (e.g. '+44') or not.

#### Exercise 2
In finance, futures on commodity markets generally follow a 3-part naming convention: 
1. Commodity: 1 to 3 letters
2. Month: one of FGHJKMNQUVXZ
3. Year: one, two or four digits. In the case of single-digits, assume the year is either the current year or the first year ending with that digit. 

For example
- `CLZ9` refers to the December 2019 crude oil future
- `NGX20` refers to the November 2020 natural gas future
- `WF2018` refers to the January 18 wheat future

Create a function that accepts a ticker (e.g. `CLZ9`) are return a 3-element tuple: commodity, month and year