# An introduction to regular expressions [regex]

The [Python 3 documentation](https://docs.python.org/3/howto/regex.html) tells us that:

>Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

While they are very powerful, they are anything but intuitive and while there are both [regex libraries](http://regexlib.com) and regex builder sites like [pythex](https://pythex.org/), [regex101](https://regex101.com/), [regexr](https://regexr.com/) or [debuggex](https://www.debuggex.com/), it's good to be able to refer back to a simple _cookbook_, hopefully this notebook can serve that purpose.

Regex is part of the [Python standard library](https://docs.python.org/3/library/), we can import it with `import re` like so:

In [1]:
import re

There are many functions that are part of `re` as can be seen with the dot-tab combination: `re.`

In [2]:
#re.

## Regex metacharacters

Regex uses metacharacters as placeholders to define string patterns that can then be matched.

The complete list of Python regex metacharacters is quite short:

> `. [ ] ^ \ * + ? { } | $ ( )`

- `.` matches anything except a newline character.
- `[` and `]` are used for specifying a _character class_, which is a set of characters that you wish to match.
- `^` at the start of a character class (`[]`) (i.e. _complementing the set_), matches the characters _not listed_ within the class.
- note that appart from `^` at the start of a class, metacharaters loose their special abilities _inside_ a class and are treated like any other character.
- `\` is used to _escape_ various characters to signal various special sequences, for example `\d` matches any digit character (see below for more examples). It’s also used to escape all the metacharacters so you can still match them in patterns.
- `*` specifies that the previous character can be matched _zero or more times_, instead of exactly once.
- `+` specifies that the previous character can be matched _one or more times_.
- `?` specifies that the previous character cab be matched either _once or zero times_.
- `{` and `}` are used to specify that the previous characters repeats between *m and n* times inclusive where *m and n* are `integers`, the syntax is: `{m,n}`. `{,n}` implies a minimum of `0` and `{m,}` implies a maximum of infinity.
- `|` signifies alternation, or the `or` operator. If *A and B* are regular expressions, `A|B` will match any string that matches either *A or B*.
- `^` matches a regular expression at the beginning of a line, to match the `^`-literal, use `\^`.
- `$` matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character. To match a literal `$`, use `\$` or enclose it inside a character class, as in `[$]`.
- `(` and `)` mark a _group_, they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as `*`, `+`, `?`, or `{m,n}`.

## Special character sequences

Below is a short list of special character sequences taken from the [documentation](https://docs.python.org/3/howto/regex.html), for a complete list of these sequences, visit the [last part of Regular Expression Syntax in the Standard Library reference](https://docs.python.org/3/library/re.html#re-syntax):

`\d` Matches any decimal digit; this is equivalent to the class [0-9]

`\D` Matches any non-digit character; this is equivalent to the class [^0-9]

`\s` Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]

`\S` Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]

`\w` Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]

`\W` Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]

## Matching examples

### 1. Matching well uwi's

In [3]:
# get some data
canadian_uwi = '100/12-04-091-05-W5/00'
api_well = '42-501-20130-03-00'

In [4]:
# define the regex
re_canada = r'\d{3}\/\d{2}-\d{2}-\d{3}-\d{2}-[A-Z]\d\/\d{2}'
re_canada_grp = r'\d{3}\/(?:\d{2}-){2}\d{3}-\d{2}-[A-Z]\d\/\d{2}'
re_api = r'\d{2}-\d{3}-\d{5}-\d{2}-\d{2}'
re_api_short = r'-?(?:\d+)-?' # note this does not give us a single match

In [5]:
# match the regex to the text
match_canada = re.findall(re_canada, canadian_uwi)
match_canada_grp = re.findall(re_canada_grp, canadian_uwi)
match_api = re.findall(re_api, api_well)
match_api_short = re.findall(re_api_short, api_well)

In [6]:
match_canada, match_canada_grp

(['100/12-04-091-05-W5/00'], ['100/12-04-091-05-W5/00'])

In [7]:
match_api, match_api_short

(['42-501-20130-03-00'], ['42-', '501-', '20130-', '03-', '00'])

### 2. Matching well depths

In [8]:
# a Lorem Ipsum text containing random well depths
input_text = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin nec risus ut velit consectetur finibus. Vivamus veh -8550.217 691.7777 icula cursus rh -7811.0964 oncus. Nullam dictum -7864.6305  nec eros ut sagittis. Nulla id venenatis arcu. Mauris ac tristique magna. Vivamus quis augue sed urna ultricies cursus. Cras semper, lectus eget sagittis consectetur, nisi ma -4472.429 uris suscipit enim, tristique convallis augue orci sed elit. Ut sapie 235.7094 n libero, v 454.369 olutpat quis lobortis nec, viverra  -7816.274 feugiat nisi. Nunc sapien ligula, conseq 225.864 uat fermentum mi sed, feugiat efficitur nulla. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Cura -3664.382 e;'
input_text

'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin nec risus ut velit consectetur finibus. Vivamus veh -8550.217 691.7777 icula cursus rh -7811.0964 oncus. Nullam dictum -7864.6305  nec eros ut sagittis. Nulla id venenatis arcu. Mauris ac tristique magna. Vivamus quis augue sed urna ultricies cursus. Cras semper, lectus eget sagittis consectetur, nisi ma -4472.429 uris suscipit enim, tristique convallis augue orci sed elit. Ut sapie 235.7094 n libero, v 454.369 olutpat quis lobortis nec, viverra  -7816.274 feugiat nisi. Nunc sapien ligula, conseq 225.864 uat fermentum mi sed, feugiat efficitur nulla. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Cura -3664.382 e;'

In [9]:
re_depth = r'-?\d{3,5}.\d{,4}'
match_depth = re.findall(re_depth, input_text)
match_depth

['-8550.217',
 '691.7777',
 '-7811.0964',
 '-7864.6305',
 '-4472.429',
 '235.7094',
 '454.369',
 '-7816.274',
 '225.864',
 '-3664.382']

### 3. Matching pressures

We use an example from the [petrowiki](https://petrowiki.org/Reservoir_pressure_data_interpretation):

In [10]:
example_paragraph = 'Permeability barriers can also be identified as illustrated in Fig. 6. The barrier is indicated in Fig. 6a by the hydrostatic potential difference between the layers above and below the detected permeability barrier of approximately 20 psi. The line with a gradient of 0.497 psi/ft represents the mud pressure, which was measured in the same trip in the well while acquiring the formation pressure. In Fig. 6b, the reservoir fluid gradients differ across the permeability barrier. Nevertheless, a potential difference of approximately 140 psi across the barrier is interpreted as indicating a no-flow barrier. Zero permeability is implied. Otherwise, the pressure would have equilibrated on both sides of the barrier over geologic time.'
example_paragraph

'Permeability barriers can also be identified as illustrated in Fig. 6. The barrier is indicated in Fig. 6a by the hydrostatic potential difference between the layers above and below the detected permeability barrier of approximately 20 psi. The line with a gradient of 0.497 psi/ft represents the mud pressure, which was measured in the same trip in the well while acquiring the formation pressure. In Fig. 6b, the reservoir fluid gradients differ across the permeability barrier. Nevertheless, a potential difference of approximately 140 psi across the barrier is interpreted as indicating a no-flow barrier. Zero permeability is implied. Otherwise, the pressure would have equilibrated on both sides of the barrier over geologic time.'

### Lookbehind and lookahead

Because we could capture both `20 psi` and `0.497 psi/ft` with a simple regex such as `r'\d{1,3} psi'`, we will use lookbehind and lookahead patterns to exclude floats from the psi regex:

- `(?<!...)` Matches if the current position in the string is not preceded by a match for ....
- `(?!...)` Matches if ... doesn’t match next.

The equivalent positive assertions exist:

- `(?=...)` Matches if ... matches next, but doesn’t consume any of the string. 
- `(?<=...)` Matches if the current position in the string is preceded by a match for ... that ends at the current position.

There are many such patterns in the [documentation](https://docs.python.org/3/library/re.html).

In [11]:
re_psi = r'(?<![\d.])\d{1,4}(?!.]) psi' # here we use the lookbehind and lookahead patterns
re_gradient = r'0.\d{1,3} psi\/ft'

In [12]:
match_psi = re.findall(re_psi, example_paragraph)
match_gradient = re.findall(re_gradient, example_paragraph)

In [13]:
match_psi, match_gradient

(['20 psi', '140 psi'], ['0.497 psi/ft'])

### 4. Object-oriented 'compile' pattern

So far we have used the same functional pattern of passing a `raw string regex` to the `findall()` function together with its target string, there is another common pattern that is object-orientated:

- we define the `pattern` to match
- we use the `findall()` function _on that pattern_ to return the matches

Again, we use and example from the [petrowiki](https://petrowiki.org/Hole_geometry), here we want to extract the two casing sizes listed (refer to the [unicode.org](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Decomposition_Type=Fraction:]) for the required codes for the casing sizes.

In [14]:
casing_design = 'The large flow string in Fig. 1 resulted in a 13⅜-in. intermediate string and a 20-in. surface casing. However, these strings may be difficult to design if high formation pressures are encountered. Table 1 shows the pipe required for various conditions on the intermediate string, assuming that a single weight and grade will be used.'
casing_design

'The large flow string in Fig. 1 resulted in a 13⅜-in. intermediate string and a 20-in. surface casing. However, these strings may be difficult to design if high formation pressures are encountered. Table 1 shows the pipe required for various conditions on the intermediate string, assuming that a single weight and grade will be used.'

In [15]:
pattern = re.compile(r'\d{2}[\u2150-\u2189]?-in')
match_casing = pattern.findall(casing_design)

In [16]:
match_casing

['13⅜-in', '20-in']

## Flags

Although you can do a lot of regex searches without using `compilation flags`, occasionally you will need them.

There are 6 `compiliation flags` in Python, and all are available in long name (e.g. `IGNORECASE`) and the equivalent short, one-letter form (e.g. `I`), the [docs](https://docs.python.org/3/howto/regex.html#compilation-flags) lists all the flags and we will illustrate their use with `IGNORECASE`, to start let's search for a UWI in a short paragraph, but ignoring case:

In [17]:
lorem_uwi = 'Lorem ipsum dolor sit ametUWI 223 Z 023 K 224G83 56, consectetur adipiscing elit. Cras eleifend, turpis at blandit dignissim, ligula nisl pellentesque augue, sit amet tempus tellus lorem id erat. Suspendisse ac rhoncus nibh, id mollis odio. Vestibulum sagittis nulla purus, pharetra cursus nibh aliquet ut. Nullam mauris enim, facilisis sed metus a, bibendum malesuada enim. Fusce euismod felis vitae lacus sodales fringilla. Pellentesque dolor odio,UWI 200 D 096 H 094A15 00 mattis sed est nec, placerat venenatis tellus.UWI 200 T 096 G 094B15 00 Nam in laoreet urna.'
lorem_uwi

'Lorem ipsum dolor sit ametUWI 223 Z 023 K 224G83 56, consectetur adipiscing elit. Cras eleifend, turpis at blandit dignissim, ligula nisl pellentesque augue, sit amet tempus tellus lorem id erat. Suspendisse ac rhoncus nibh, id mollis odio. Vestibulum sagittis nulla purus, pharetra cursus nibh aliquet ut. Nullam mauris enim, facilisis sed metus a, bibendum malesuada enim. Fusce euismod felis vitae lacus sodales fringilla. Pellentesque dolor odio,UWI 200 D 096 H 094A15 00 mattis sed est nec, placerat venenatis tellus.UWI 200 T 096 G 094B15 00 Nam in laoreet urna.'

With the long name:

In [18]:
pattern = re.compile(r'UWI \d{3} [a-z] \d{3} [a-z] \d{3}[a-z]\d{2} \d{2}', re.IGNORECASE)
match_uwi = pattern.findall(lorem_uwi)

In [19]:
match_uwi

['UWI 223 Z 023 K 224G83 56',
 'UWI 200 D 096 H 094A15 00',
 'UWI 200 T 096 G 094B15 00']

And with the short name:

In [20]:
pattern_short = re.compile(r'UWI \d{3} [a-z] \d{3} [a-z] \d{3}[a-z]\d{2} \d{2}', re.I)
match_uwi_short = pattern_short.findall(lorem_uwi)

In [21]:
match_uwi_short

['UWI 223 Z 023 K 224G83 56',
 'UWI 200 D 096 H 094A15 00',
 'UWI 200 T 096 G 094B15 00']

### Exercise

Using the `drilling_history` below extracted from a Shell and ExxonMobil Technical report on the [OGA portal](https://data-ogauthority.opendata.arcgis.com/datasets?q=well%20report):

1. extract all `years`
2. extract all `well names`
3. extract all pressures `p` at depths `d`
4. extract the `formation names at TD`
5. extract all `flow rates`

In [22]:
drilling_history = '1983: 29/8b-2 &29/8b-2s drilled by Union Oil. Oil discovered in Fulmar sands (ar s Acorn South). Well TDed in Smith Bank Fm. 1985: 29/8a-3 drilled by Shell/Esso. Acorn discovery well; producible oil in eservoir sands. Reservoir pressure of 10997 psia at 13200ft tvdss datum. DST oil r Wet Cromarty sands in overburden section (Oak Prospect). Well TDed in Smith Bank Fm. 1985: 29/9b-2 drilled by Premier Oil. Beechnut East discovery well; successful ulmar and Triassic sands. Reservoir pressure of 11040 psia at 13800ft tvdss datu 7000 bbl/d. Well TDed in Skagerrak Fm. 1986: 29/9b-3 drilled by Premier Oil. Beechnut West unsuccessful dry hole. 1988: 29/8a-4 drilled by Shell/Esso. Oil discovered in Pentland and Skagerrak s kagerrak Fm. 1989: 29/9b-6 drilled by Premier Oil. Proven producible oil discovered in Fulm ressure of 11231 psia at 13800ft tvdss datum. DST oil rates of ~1200 bbl/d. Wel m. 1992: 29/9c-8 drilled by BG. Dry hole with Triassic Skagerrak sands (Fulmar abs kagerrak Fm. 2001: 29/9b-9 drilled by Hess. Proven producible oil in Fulmar sands. Reservoir sia at 13800ft tvdss datum. DST oil rates of ~2400 bbl/d. Well TDed in Zechstein Fm. 2001:29/9b-9z drilled by Hess. Incomplete, tight Fulmar section, single oil sample ressure of 11130 psia at 13800ft tvdss datum. Well TDed in Rattray Fm. 2009: 29/8a-6 drilled by Venture/Centrica. Horizontal well with EWT in Triassic Sk eservoir pressure of 10901 at 13200ft tvdss datum. Proven producible oil from E 2000 bbl/d declining to 5000 bbl/d. Well TDed in Skagerrak Fm.'
drilling_history

'1983: 29/8b-2 &29/8b-2s drilled by Union Oil. Oil discovered in Fulmar sands (ar s Acorn South). Well TDed in Smith Bank Fm. 1985: 29/8a-3 drilled by Shell/Esso. Acorn discovery well; producible oil in eservoir sands. Reservoir pressure of 10997 psia at 13200ft tvdss datum. DST oil r Wet Cromarty sands in overburden section (Oak Prospect). Well TDed in Smith Bank Fm. 1985: 29/9b-2 drilled by Premier Oil. Beechnut East discovery well; successful ulmar and Triassic sands. Reservoir pressure of 11040 psia at 13800ft tvdss datu 7000 bbl/d. Well TDed in Skagerrak Fm. 1986: 29/9b-3 drilled by Premier Oil. Beechnut West unsuccessful dry hole. 1988: 29/8a-4 drilled by Shell/Esso. Oil discovered in Pentland and Skagerrak s kagerrak Fm. 1989: 29/9b-6 drilled by Premier Oil. Proven producible oil discovered in Fulm ressure of 11231 psia at 13800ft tvdss datum. DST oil rates of ~1200 bbl/d. Wel m. 1992: 29/9c-8 drilled by BG. Dry hole with Triassic Skagerrak sands (Fulmar abs kagerrak Fm. 2001: 2

In [23]:
# patterns
pattern_years = re.compile(r'\d{4}(?=:)', re.I)
pattern_wells = re.compile(r'(\d{2}\/\d[a-z]-\d[a-z]?)', re.I)
pattern_p_d = re.compile(r'\d{4,5} psia at \d{4,5}ft tvdss', re.I)
pattern_fm_td = re.compile(r'(?<=TDed in )([a-zA-Z]+ ?[a-zA-Z]+?)(?= Fm)')
pattern_flow_rates = re.compile(r'~?\d{4}(?= bbl\/d)', re.I)
# matches
match_years = pattern_years.findall(drilling_history)
match_wells = pattern_wells.findall(drilling_history)
match_p_d = pattern_p_d.findall(drilling_history)
match_fm_td = pattern_fm_td.findall(drilling_history)
match_flow_rates = pattern_flow_rates.findall(drilling_history)

In [24]:
match_years

['1983',
 '1985',
 '1985',
 '1986',
 '1988',
 '1989',
 '1992',
 '2001',
 '2001',
 '2009']

In [25]:
match_wells

['29/8b-2',
 '29/8b-2s',
 '29/8a-3',
 '29/9b-2',
 '29/9b-3',
 '29/8a-4',
 '29/9b-6',
 '29/9c-8',
 '29/9b-9',
 '29/9b-9z',
 '29/8a-6']

In [26]:
match_p_d

['10997 psia at 13200ft tvdss',
 '11040 psia at 13800ft tvdss',
 '11231 psia at 13800ft tvdss',
 '11130 psia at 13800ft tvdss']

In [27]:
match_fm_td

['Smith Bank', 'Smith Bank', 'Skagerrak', 'Zechstein', 'Rattray', 'Skagerrak']

In [28]:
match_flow_rates

['7000', '~1200', '~2400', '2000', '5000']

<hr />

<div>
<img src="https://avatars1.githubusercontent.com/u/1692321?s=50"><p style="text-align:center">© Agile Geoscience 2019</p>
</div>