# Financial Data Analytics

## Regular expressions in Python - Review


Regular expressions (also called REs, regexes, or regex patterns) are specially encoded text strings used as patterns for matching parts of text. They are essentially a tiny, highly specialized programming language embedded inside Python (and other programming languages). To access regexes, we must import the `re` module.

A good place to experiment with regular expressions and learn how they work is [regexr.com](https://regexr.com)

The re module provides access to a number of tools for using regular expressions. We'll make most use of:

* `re.search` — Find *one instance* of a pattern in a string
* `re.findall` — Find *all instances* of a pattern in a string
* `re.sub` — String substitution using a regex pattern

In [None]:
import re

# Class Examples 1 to 9:

## Example 1

In [None]:
# Capture the area code in:
text = '+1 (812) 856-5664'

In [None]:
m = re.search(r'\(\d{3}\)', text)
m

In [None]:
m.group(0)

## Example 2

In [None]:
# Capture the top-level domain (.com, .org, …) in a URL like this:
url = 'http://kelley.iu.edu/About/'

In [None]:
m = re.search(r'\.\w{3}', url)
m.group(0)

In [None]:
# This is equivalent:
m = re.search(r'\.[a-z]{3}', url)
m.group(0)

## Example 3

In [None]:
# Capture the ticker symbol in:
info = '''
Ford Motor Co. (F) - NYSE
Property Insurance Holdings, Inc. (PIH) - Property-Casualty Insurers
ABIOMED, Inc. (ABMD) - Medical/Dental Instruments
Microsoft Corp. (MSFT) - Software
'''

In [None]:
# This will return an exception...
m = re.match(r'\([A-Z]+\)', info)
m.group(0)
# print(m)

In [None]:
m = re.search(r'\([A-Z]+\)', info)
m.group(0)

In [None]:
m = re.findall(r'\([A-Z]+\)', info)
m

## Example 4

In [None]:
# Capture the domains in the following emails:
emails = '''
Mor Haziza; 'mhaziza@iu.edu'
Jay Z; JZ50@jproductions.com
Katy Perry; katy.perry01@kpstudio.org
'''

In [None]:
m = re.search(r'@.*', emails)
m.group(0)

In [None]:
m = re.findall(r'@.*', emails)
m

## Example 5

In [None]:
# Capture the usernames in the same emails:
emails = '''
Mor Haziza; 'mhaziza@iu.edu'
Jay Z; JZ50@example.com
Katy Perry; katy.perry01@kpstudio.org
'''

In [None]:
m = re.findall(r'[\w]+@', emails)
m
# need to revise to capture all the usernames...

In [None]:
m = re.findall(r'[\w.]+@', emails)
m

## Example 6

In [None]:
# Capture the percent change in the DJ index:
DJ = '''
Dow Jones Industrial Average (^DJI)
DJI - DJI Real Time Price. Currency in USD

26,048.51  -14.17   (-0.05%)
'''

In [None]:
m = re.search(r'[+-]\d+\.\d+%', DJ)
m

In [None]:
m.group(0)

## Example 7

In [None]:
# Capture the names of all the people involved in publishing this book:
text  = '''
Python for Data Analysis
by Wes McKinney Copyright © 2018 William McKinney. All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,
Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional
use. Online editions are also available for most titles (http://oreilly.com/safari).
For more information, contact our corporate/institutional sales department: 800-
998-9938 or corporate@oreilly.com.
Editor: Marie Beaugureau
Production Editor: Kristen Brown
Copyeditor: Jasmine Kwityn
Proofreader: Rachel Monaghan
Indexer: Lucie Haskins
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
October 2012: First Edition
October 2017: Second Edition
'''

In [None]:
m = re.findall(r':\s\w+ \w+', text)
m

# The last two are not names... need to revise

In [None]:
m = re.findall(r'[^\d]:\s\w+ \w+', text)
m

### (…) Group for “capturing” a match

We got the names, but also other text we don't need... 
use ( ) to capture the names within a group, as follows:

In [None]:
m = re.findall(r'[^\d]:\s(\w+ \w+)', text)
m

## Example 8

In [None]:
# Capture the page numbers in the following table of contents:

Table_of_contents = '''
1. Introduction...................................................................................................1
2. Futures markets and central counterparties....................................................... 24
3. Hedging strategies using futures ...................................................................... 49
4. Interest rates ................................................................................................ 77
5. Determination of forward and futures prices................................................... 107
6. Interest rate futures ..................................................................................... 135
7. Swaps ....................................................................................................... 155
8. Securitization and the credit crisis of 2007 ...................................................... 184
'''

In [None]:
m = re.findall(r'\d+$', Table_of_contents)
m

# We captured numbers at the end of text only…
# Need to revise: use MULTILINE flag for ‘$’ to include end of lines

In [None]:
m = re.findall(r'\d+$', Table_of_contents, re.MULTILINE)
m

## Example 9

In [None]:
# Capture all instances of the word ‘wood’:
WC_tongue_twister = 'How much wood would a woodchuck chuck if a woodchuck could chuck wood?'

In [None]:
m = re.search(r'\bwood\b', WC_tongue_twister)
m

In [None]:
m = re.findall(r'\bwood\b', WC_tongue_twister)
m

In [None]:
# Can replace the word wood with another word as follows:
m = re.sub(r'\bwood\b', r'wooops', WC_tongue_twister)
m

# Greedy Vs Non-Greedy:

In [None]:
import re
s = '<html><head><title>Title</title>'
# len(s)
print(re.match('<.*>',s).group(0))   
# The '*' is greedy, the match continues with every character until the last '>'

In [None]:
print(re.match('<.*?>',s).group(0))  
# The '?' makes '*' non-greedy. The match stoped at the first '>'

# Working With Groups (...)

## Example 1

In [None]:
m = re.match('(ab)c', 'abcde')
m

In [None]:
 m.group(0)

In [None]:
m.group(1)

## Example 2

In [None]:
m = re.match('(a(b))c(de)', 'abcde')
m
# m.group(0) 
# m.group(1)
# m.group(2)
# m.group(3)
# m.group(2,3,0,0,1)   # this returns a tuple
# m.groups()           # this returns a tuple

# # This is equivalent:
# re.match('(a(b))c(de)', 'abcde').group(0)