# Regex in Python with `re`

> You have a text processing problem. You decide to use regex. Now you have two problems.

Regular expressions are useful for finding patterns in text. For example:

- Find things that looks like UWIs.
- Find numbers with certain units after them.
- Replace fuzzy-matched patterns with other text. 

In [3]:
import re

In [4]:
s = """The pressure was about 1500.0 
psi (on Tuesday 23 February), 
and the next day it was >2000 psi, 
a change of -500 psi."""

In [5]:
print(s)

The pressure was about 1500.0 
psi (on Tuesday 23 February), 
and the next day it was >2000 psi, 
a change of -500 psi.


If all we want to do is check that a pattern exists in a text, we can use `re.search()`.

For example, to find one or more digits (`\d`) followed by the letters 'Feb':

In [6]:
if re.search('\d+ Feb', s):
    print("Found a date in February")

Found a date in February


In [7]:
re.search('\d+ Feb', s)

<re.Match object; span=(47, 53), match='23 Feb'>

In [8]:
m = re.search('\d+ Feb', s)

In [9]:
m.group()

'23 Feb'

## `re.findall()`

Often, we'd like to find all the matches in a string. One way to do this is with `re.findall()`.

As before, we can find numbers with `\d`, and adding `+` says we want sequences of 1 or more numbers.

In [10]:
re.findall(r'\d+', s)

['1500', '0', '23', '2000', '500']

The `r` indicates that this is a raw string. We need to use raw strings in regular expressions because the regex 'mini-language' interprets characters differently from how Python interprets them.

We can require the numbers to have 'psi' after them:

In [11]:
re.findall(r'\d+ psi', s)

['2000 psi', '500 psi']

But this doesn't catch the one with a space and a newline; we can use `\s` to say 'any kind of white space':

In [12]:
re.findall(r'\d+\s+psi', s)

['0 \npsi', '2000 psi', '500 psi']

Now the problem is that numbers sometimes contain decimal points, or minus signs (OK, maybe not for pressure):

In [13]:
re.findall(r' [-<>0-9.]+\s+psi', s)

[' 1500.0 \npsi', ' >2000 psi', ' -500 psi']

Finally, we'd like to separate out the number and the units, and only *capture* those:

In [14]:
re.findall(r' ([-<>0-9.]+)\s+(psi)', s)

[('1500.0', 'psi'), ('>2000', 'psi'), ('-500', 'psi')]

There are a few shortcuts like `\d` and `\s`:

- `\d` &mdash; all digits, like `[0-9]`
- `\D` &mdash; all non-digits, like `[^\d]`
- `\w` &mdash; all alphanumerics, like `[a-zA-Z0-9_]`
- `\W` &mdash; all non-alphanumerics, like `[^\s]`
- `\s` &mdash; all spacey characters, like `[ \t\n\r\f\v]`
- `\S` &mdash; all digits, like `[^\s]`

Type `help(re)` for more.

In [15]:
help(re)

Help on module re:

NAME
    re - Support for regular expressions (RE).

MODULE REFERENCE
    https://docs.python.org/3.7/library/re
    
    The following documentation is automatically generated from the Python
    source files.  It may be incomplete, incorrect or include features that
    are considered implementation detail and may vary between Python
    implementations.  When in doubt, consult the module reference at the
    location listed above.

DESCRIPTION
    This module provides regular expression matching operations similar to
    those found in Perl.  It supports both 8-bit and Unicode strings; both
    the pattern and the strings being processed can contain null bytes and
    characters outside the US ASCII range.
    
    Regular expressions can contain both special and ordinary characters.
    Most ordinary characters, like "A", "a", or "0", are the simplest
    regular expressions; they simply match themselves.  You can
    concatenate ordinary characters, so last mat

### Exercise

Write a function like `clean_depth` in **Practice_functions.ipynb**, but using regex instead of string processing.

You function should work on these examples:

    tests = "2342", "4353.4 m", "23423.123FT", "6371km"
    
...producing:

    [(2342.0, None), (4353.4, 'm'), (23423.123, 'ft'), (6371.0, 'km')]

In [None]:
# YOUR CODE HERE



In [1]:
import re

def clean_depth(depth):
    units = ['m', 'km', 'ft']
    u = '|'.join(units)

    pattern = re.compile(rf"([-\.\d]+) *({u})*", flags=re.IGNORECASE)

    value, units = pattern.search(depth).groups()

    return float(value), units.lower() if units else None

tests = "2342", "4353.4 m", "23423.123FT", "6371km"
[clean_depth(s) for s in tests]

[(2342.0, None), (4353.4, 'm'), (23423.123, 'ft'), (6371.0, 'km')]

### Exercise

Use `re.findall` to parse the tops file, `L-30_tops.txt`. You should end up with the same dictionary you got when we did the exercise with a line-by-line file reader. 

<a title="You can pass a list of little lists (or tuples) to dict() to get a dictionary right from the result of running findall()."><b>Hint</b></a>

In [2]:
with open('../data/L-30_tops.txt') as f:
    data = f.read()

# YOUR CODE HERE


In [16]:
print(data)

# L-30 well tops
WyanDot FM,867.156
DAWSON CANYON FM,984.50402
LOGAN CANYON FM,1136.904
Upper MISSISAUGA FM,2251.2529
Lower MISSISAUGA FM,3190.6464
ABENAKI FM,3404.3112
MID BACCARO,3485.0832
Lower BACCARO,3964.5337
Base O-Marker,2469.207
TD,4268.0
Pay_sand_1-rft,2478.0
pay_sand_2,2499.0
pay_sand_3,2543.0
pay_sand_4,2637.0
sand_5,2699.0
sand_6,2795.0
sand_7,2835.0



In [39]:
import re

pattern = re.compile(r"^([\w\s]+),([-.\d]+?)", flags=re.MULTILINE)
tops = dict(pattern.findall(data))

# If you want to cast to floats, without writing a loop, you can do...
#tops = dict(map(lambda x: (x[0],float(x[1])), rows))

# Or:
tops = {k: float(v) for k, v in tops.items()}

In [40]:
tops

{'WyanDot FM': 8.0,
 'DAWSON CANYON FM': 9.0,
 'LOGAN CANYON FM': 1.0,
 'Upper MISSISAUGA FM': 2.0,
 'Lower MISSISAUGA FM': 3.0,
 'ABENAKI FM': 3.0,
 'MID BACCARO': 3.0,
 'Lower BACCARO': 3.0,
 'TD': 4.0,
 'pay_sand_2': 2.0,
 'pay_sand_3': 2.0,
 'pay_sand_4': 2.0,
 'sand_5': 2.0,
 'sand_6': 2.0,
 'sand_7': 2.0}

## Substitution with `re.sub`

In [16]:
re.sub(r' ([-.0-9]+)\s+(psi)', r' \1 PSI', s)

'The pressure was about 1500.0 PSI (on Tuesday 23 February), \nand the next day it was >2000 psi, \na change of -500 PSI.'

Look behind: used but not gathered. Note that in Python the lookbehind must be fixed-width. This is quite a constraint; for example, you can't use it to find "pressure" followed by no more than 2 words, followed by a number. But you could look for, say

In [36]:
re.findall(r'(?<=about) ([-.0-9]+)\s+(psi)', s)

[('1500.0', 'psi')]

Let's get the greater-than and less-than characters too:

In [33]:
re.sub(r'([-<>.0-9]+)\s+(psi)', r' \1 PSI', s)

'The pressure was about  1500.0 PSI (on Tuesday 23 February), \nand the next day it was > 2000 PSI, \na change of  -500 PSI.'

## Callback

In [41]:
def kPa_transform(match):
    value, _ = match.groups()
    value = float(value) / 14.5
    return f" {value:0.3f} kPa"

In [42]:
def bar_transform(match):
    value, _ = match.groups()
    value = float(value) / 1450
    return f" {value:0.3f} bar"

In [43]:
re.sub(r' ([-.0-9]+)\s+(psi)', kPa_transform, s)

'The pressure was about 103.448 kPa (on Tuesday 23 February), \nand the next day it was >2000 psi, \na change of -34.483 kPa.'

Note that the ">200 psi" was not changed... if you try to fix this you run into trouble:

In [47]:
re.sub(r' ([-<>.0-9]+)\s+(psi)', kPa_transform, s)

ValueError: could not convert string to float: '>2000'

You could do better with a single callback function, wrapped in a function that can take parameters... but let's leave that for another time. This is functional programming.

In [45]:
def unit_transform(units):
    def unit_transform(match):
        divisors = {'kPa': 14.5, 'bar': 1450}
        value, _ = match.groups()
        value = float(value) / divisors[units]
        return f" {value:0.3f} {units}"
    return unit_transform

In [48]:
re.sub(r' ([-.0-9]+)\s+(psi)', unit_transform('kPa'), s)

'The pressure was about 103.448 kPa (on Tuesday 23 February), \nand the next day it was >2000 psi, \na change of -34.483 kPa.'