# An introduction to regular expressions (regex)

> Some people, when confronted with a problem, think, "I know, I'll use regular expressions". Now they have 2 problems.

### What are regular expressions?

The [Python 3 documentation](https://docs.python.org/3/howto/regex.html) tells us that:

>Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

While they are very powerful, they are anything but intuitive and while there are both [regex libraries](http://regexlib.com) and regex builder sites like [pythex](https://pythex.org/), [regex101](https://regex101.com/), [regexr](https://regexr.com/) or [debuggex](https://www.debuggex.com/), it's good to be able to refer back to a simple _cookbook_, hopefully this notebook can serve that purpose.

Regex is part of the [Python standard library](https://docs.python.org/3/library/) as the `re` module.

## Matching well uwi's

In [98]:
uwi = '100/12-04-091-05-W5/00'

Regular expressions are 'find' on steroids.

In [137]:
import re

re.search(r'091', uwi)

<_sre.SRE_Match object; span=(10, 13), match='091'>

In [138]:
re.search(r'\d{3}', uwi)

<_sre.SRE_Match object; span=(0, 3), match='100'>

In [106]:
re.search(r'W[345]/00', uwi)

<_sre.SRE_Match object; span=(17, 22), match='W5/00'>

In [112]:
# 'group' gives the entire match.
re.search(r'(W[345])/00', uwi).group()

'W5/00'

In [113]:
# 'groups' is a tuple of all the captures.
re.search(r'(W[345])/00', uwi).groups()

('W5',)

In [115]:
re.findall(r'W[345]/00', uwi)

['W5/00']

Lookahead and lookbehind let us find things without capturing what comes before or after.

This enables tricky subsitutions, for example, that would be impossible with ordinary search and replace.

In [132]:
text = "Both 100/12-04-091-05-W5/00 and 100/13-05-121-05-W4/00 are in Lease W4/00"

In [134]:
re.sub(r'(?<=\d-W)([345])(?=/00)', r'0\1', text)

'Both 100/12-04-091-05-W05/00 and 100/13-05-121-05-W04/00 are in Lease W4/00'

## Regex metacharacters

Regex uses metacharacters as placeholders to define string patterns that can then be matched.

The complete list of Python regex metacharacters is quite short:

> `. [ ] ^ \ * + ? { } | $ ( )`

- `.` matches anything except a newline character.
- `[` and `]` are used for specifying a _character class_, which is a set of characters that you wish to match.
- `^` at the start of a character class (`[]`) (i.e. _complementing the set_), matches the characters _not listed_ within the class.
- note that appart from `^` at the start of a class, metacharaters loose their special abilities _inside_ a class and are treated like any other character.
- `\` is used to _escape_ various characters to signal various special sequences, for example `\d` matches any digit character (see below for more examples). It’s also used to escape all the metacharacters so you can still match them in patterns.
- `*` specifies that the previous character can be matched _zero or more times_, instead of exactly once.
- `+` specifies that the previous character can be matche _one or more times_.
- `?` specifies that the previous character cab be matched either _once or zero times_.
- `{` and `}` are used to specify that the previous characters repeats between *m and n* times inclusive where *m and n* are `integers`, the syntax is: `{m,n}`. `{,n}` implies a minimum of `0` and `{m,}` implies a maximum of infinity.
- `|` signifies alternation, or the `or` operator. If *A and B* are regular expressions, `A|B` will match any string that matches either *A or B*.
- `^` matches a regular expression at the beginning of a line, to match the `^`-literal, use `\^`.
- `$` matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character. To match a literal `$`, use `\$` or enclose it inside a character class, as in `[$]`.
- `(` and `)` mark a _group_, they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as `*`, `+`, `?`, or `{m,n}`.

## Special character sequences

Below is a short list of special character sequences taken from the [documentation](https://docs.python.org/3/howto/regex.html), for a complete list of these sequences, visit the [last part of Regular Expression Syntax in the Standard Library reference](https://docs.python.org/3/library/re.html#re-syntax):

`\d` Matches any decimal digit; this is equivalent to the class [0-9]

`\D` Matches any non-digit character; this is equivalent to the class [^0-9]

`\s` Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]

`\S` Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]

`\w` Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]

`\W` Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]

## Matching well depths

Let's try capturing everything that looks like a depth from a document. We'll use part of [this news article](https://www.ogj.com/drilling-production/article/17279205/leon-drilling-due-under-llogrepsol-swap) as an example.

In [15]:
text = """
Under an asset exchange and new joint operating agreement, LLOG will operate the well on Keathley Canyon Block 642 with a 33% working interest.

The Leon discovery well was drilled to 32,000 ft (about 9,750 m) TD in 6,000 ft (about 1830 m) of water, encountering nearly 500 ft of net oil pay.

Repsol will acquire a 30% interest from LLOG in the 2011 Moccasin discovery on Keathley Canyon Block 736. The discovery well, drilled by Chevron Corp. to deeper than 31,000 ft in more than 6,500 ft of water, found nearly 400 ft of net oil pay (OGJ Online, Sept. 6, 2011).
"""

We'll also use a different way to 'compile' and apply the regular expression. Some people prefer this more object-oriented interface, and in some circumstances it will be faster.

In [16]:
pattern = re.compile(r' ([,0-9]+ (?:ft|m))')

pattern.findall(text)

['32,000 ft',
 '9,750 m',
 '6,000 ft',
 '1830 m',
 '500 ft',
 '31,000 ft',
 '6,500 ft',
 '400 ft']

### EXERCISE

Can you capture everything that looks like a block number from the same text?

In [18]:
pattern = re.compile(r'(Block [0-9]+)')

pattern.findall(text)

['Block 642', 'Block 736']

If you want the positions of things you can use a different way:

In [28]:
for match in pattern.finditer(text):
    print(match.group(0), 'at', match.span())

Block 642 at (106, 115)
Block 736 at (390, 399)


## Matching pressures

We use an example from the [petrowiki](https://petrowiki.org/Reservoir_pressure_data_interpretation):

In [34]:
text = """Permeability barriers can also be identified as illustrated in Fig. 6.

The barrier is indicated in Fig. 6a by the hydrostatic potential difference between the layers above and below the detected permeability barrier of approximately 20 psi.

The line with a gradient of 0.497 psi/ft represents the mud pressure, which was measured in the same trip in the well while acquiring the formation pressure.

In Fig. 6b, the reservoir fluid gradients differ across the permeability barrier. Nevertheless, a potential difference of approximately 140 psi across the barrier is interpreted as indicating a no-flow barrier. Zero permeability is implied.

Otherwise, the pressure would have equilibrated on both sides of the barrier over geologic time."""

### Lookbehind and lookahead

Because we could capture both `20 psi` and `0.497 psi/ft` with a simple regex such as `r'\d{1,3} psi'`, we will use lookbehind and lookahead patterns to exclude floats from the psi regex:

- `(?<!...)` Matches if the current position in the string is not preceded by a match for ....
- `(?!...)` Matches if ... doesn’t match next.

The equivalent positive assertions exist:

- `(?=...)` Matches if ... matches next, but doesn’t consume any of the string. 
- `(?<=...)` Matches if the current position in the string is preceded by a match for ... that ends at the current position.

There are many such patterns in the [documentation](https://docs.python.org/3/library/re.html).

In [51]:
re_psi = r' ([-.,0-9]+ psi(?!/ft))'  # 'psi' but not 'psi/ft'
re_grad = r' ([-.,0-9]+ psi\/ft)'

In [52]:
re.findall(re_psi, text)

['20 psi', '140 psi']

In [53]:
re.findall(re_grad, text)

['0.497 psi/ft']

## Finding casing size

Again, we use and example from the [petrowiki](https://petrowiki.org/Hole_geometry), here we want to extract the two casing sizes listed (refer to the [unicode.org](https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Decomposition_Type=Fraction:]) for the required codes for the casing sizes.

In [63]:
print('\u2151')

⅑


In [61]:
text = 'The large flow string in Fig. 1 resulted in a 13⅜-in. intermediate string and a 20-in. surface casing. However, these strings may be difficult to design if high formation pressures are encountered. Table 1 shows the pipe required for various conditions on the intermediate string, assuming that a single weight and grade will be used.'
text

'The large flow string in Fig. 1 resulted in a 13⅜-in. intermediate string and a 20-in. surface casing. However, these strings may be difficult to design if high formation pressures are encountered. Table 1 shows the pipe required for various conditions on the intermediate string, assuming that a single weight and grade will be used.'

In [62]:
pattern = re.compile(r'\d{2}[\u2150-\u2189]?-in')

pattern.findall(text)

['13⅜-in', '20-in']

## Flags

Although you can do a lot of regex searches without using `compilation flags`, occasionally you will need them.

There are 6 `compiliation flags` in Python, and all are available in long name (e.g. `IGNORECASE`) and the equivalent short, one-letter form (e.g. `I`), the [docs](https://docs.python.org/3/howto/regex.html#compilation-flags) lists all the flags and we will illustrate their use with `IGNORECASE`, to start let's search for a UWI in a short paragraph, but ignoring case:

In [65]:
text = 'Lorem ipsum dolor sit ametuwi 223 z 023 k 224G83 56, consectetur adipiscing elit. Cras eleifend, turpis at blandit dignissim, ligula nisl pellentesque augue, sit amet tempus tellus lorem id erat. Suspendisse ac rhoncus nibh, id mollis odio. Vestibulum sagittis nulla purus, pharetra cursus nibh aliquet ut. Nullam mauris enim, facilisis sed metus a, bibendum malesuada enim. Fusce euismod felis vitae lacus sodales fringilla. Pellentesque dolor odio,UWI 200 D 096 H 094A15 00 mattis sed est nec, placerat venenatis tellus.UWI 200 T 096 G 094B15 00 Nam in laoreet urna.'
text

'Lorem ipsum dolor sit ametuwi 223 z 023 k 224G83 56, consectetur adipiscing elit. Cras eleifend, turpis at blandit dignissim, ligula nisl pellentesque augue, sit amet tempus tellus lorem id erat. Suspendisse ac rhoncus nibh, id mollis odio. Vestibulum sagittis nulla purus, pharetra cursus nibh aliquet ut. Nullam mauris enim, facilisis sed metus a, bibendum malesuada enim. Fusce euismod felis vitae lacus sodales fringilla. Pellentesque dolor odio,UWI 200 D 096 H 094A15 00 mattis sed est nec, placerat venenatis tellus.UWI 200 T 096 G 094B15 00 Nam in laoreet urna.'

With the long name:

In [66]:
pattern = re.compile(r'UWI \d{3} [A-Z] \d{3} [A-Z] \d{3}[A-Z]\d{2} \d{2}', re.IGNORECASE)

pattern.findall(text)

['uwi 223 z 023 k 224G83 56',
 'UWI 200 D 096 H 094A15 00',
 'UWI 200 T 096 G 094B15 00']

## Exercise

Using the `drilling_history` below extracted from a Shell and ExxonMobil Technical report on the [OGA portal](https://data-ogauthority.opendata.arcgis.com/datasets?q=well%20report):

1. extract all `years`
2. extract all `well names`
3. extract all pressures `p` at depths `d`
4. extract the `formation names at TD`
5. extract all `flow rates`

In [67]:
text = '1983: 29/8b-2 &29/8b-2s drilled by Union Oil. Oil discovered in Fulmar sands (ar s Acorn South). Well TDed in Smith Bank Fm. 1985: 29/8a-3 drilled by Shell/Esso. Acorn discovery well; producible oil in eservoir sands. Reservoir pressure of 10997 psia at 13200ft tvdss datum. DST oil r Wet Cromarty sands in overburden section (Oak Prospect). Well TDed in Smith Bank Fm. 1985: 29/9b-2 drilled by Premier Oil. Beechnut East discovery well; successful ulmar and Triassic sands. Reservoir pressure of 11040 psia at 13800ft tvdss datu 7000 bbl/d. Well TDed in Skagerrak Fm. 1986: 29/9b-3 drilled by Premier Oil. Beechnut West unsuccessful dry hole. 1988: 29/8a-4 drilled by Shell/Esso. Oil discovered in Pentland and Skagerrak s kagerrak Fm. 1989: 29/9b-6 drilled by Premier Oil. Proven producible oil discovered in Fulm ressure of 11231 psia at 13800ft tvdss datum. DST oil rates of ~1200 bbl/d. Wel m. 1992: 29/9c-8 drilled by BG. Dry hole with Triassic Skagerrak sands (Fulmar abs kagerrak Fm. 2001: 29/9b-9 drilled by Hess. Proven producible oil in Fulmar sands. Reservoir sia at 13800ft tvdss datum. DST oil rates of ~2400 bbl/d. Well TDed in Zechstein Fm. 2001:29/9b-9z drilled by Hess. Incomplete, tight Fulmar section, single oil sample ressure of 11130 psia at 13800ft tvdss datum. Well TDed in Rattray Fm. 2009: 29/8a-6 drilled by Venture/Centrica. Horizontal well with EWT in Triassic Sk eservoir pressure of 10901 at 13200ft tvdss datum. Proven producible oil from E 2000 bbl/d declining to 5000 bbl/d. Well TDed in Skagerrak Fm.'
text

'1983: 29/8b-2 &29/8b-2s drilled by Union Oil. Oil discovered in Fulmar sands (ar s Acorn South). Well TDed in Smith Bank Fm. 1985: 29/8a-3 drilled by Shell/Esso. Acorn discovery well; producible oil in eservoir sands. Reservoir pressure of 10997 psia at 13200ft tvdss datum. DST oil r Wet Cromarty sands in overburden section (Oak Prospect). Well TDed in Smith Bank Fm. 1985: 29/9b-2 drilled by Premier Oil. Beechnut East discovery well; successful ulmar and Triassic sands. Reservoir pressure of 11040 psia at 13800ft tvdss datu 7000 bbl/d. Well TDed in Skagerrak Fm. 1986: 29/9b-3 drilled by Premier Oil. Beechnut West unsuccessful dry hole. 1988: 29/8a-4 drilled by Shell/Esso. Oil discovered in Pentland and Skagerrak s kagerrak Fm. 1989: 29/9b-6 drilled by Premier Oil. Proven producible oil discovered in Fulm ressure of 11231 psia at 13800ft tvdss datum. DST oil rates of ~1200 bbl/d. Wel m. 1992: 29/9c-8 drilled by BG. Dry hole with Triassic Skagerrak sands (Fulmar abs kagerrak Fm. 2001: 2

In [91]:
pattern_years = re.compile(r'((?:19|20)\d{2})(?=:)', re.I)
pattern_wells = re.compile(r'(\d{2}\/\d[a-z]-\d[a-z]?)', re.I)
pattern_p_d = re.compile(r'(\d{4,5} psia) at (\d{4,5}ft) tvdss', re.I)
pattern_fm_td = re.compile(r'(?<=TDed in )([a-zA-Z]+ ?[a-zA-Z]+?)(?= Fm)')
pattern_flow_rates = re.compile(r'~?\d{4} ?bbl/d', re.I)

In [92]:
pattern_years.findall(text)

['1983',
 '1985',
 '1985',
 '1986',
 '1988',
 '1989',
 '1992',
 '2001',
 '2001',
 '2009']

In [93]:
pattern_wells.findall(text)

['29/8b-2',
 '29/8b-2s',
 '29/8a-3',
 '29/9b-2',
 '29/9b-3',
 '29/8a-4',
 '29/9b-6',
 '29/9c-8',
 '29/9b-9',
 '29/9b-9z',
 '29/8a-6']

In [94]:
pattern_p_d.findall(text)

[('10997 psia', '13200ft'),
 ('11040 psia', '13800ft'),
 ('11231 psia', '13800ft'),
 ('11130 psia', '13800ft')]

In [95]:
pattern_fm_td.findall(text)

['Smith Bank', 'Smith Bank', 'Skagerrak', 'Zechstein', 'Rattray', 'Skagerrak']

In [96]:
pattern_flow_rates.findall(text)

['7000 bbl/d', '~1200 bbl/d', '~2400 bbl/d', '2000 bbl/d', '5000 bbl/d']

<hr />

<div>
<img src="https://avatars1.githubusercontent.com/u/1692321?s=50"><p style="text-align:center">© Agile Geoscience 2019</p>
</div>