# Regular expressions in Python

- hide: true
- toc: true
- badges: true
- comments: true
- categories: [python]

## Preliminaries

### Raw strings

> Raw string notation keeps regular expressions sane. [`re` tutorial](https://docs.python.org/3/library/re.html#raw-string-notation)

Just like the regex engine, Python uses `\` to [escape](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals) characters in strings that otherwise have special meaning (e.g. `'` and `\`) and to create tokens with special meaning (e.g. `\n`).

In [26]:
'It's raining'

SyntaxError: invalid syntax (3769801028.py, line 1)

In [27]:
'It\'s raining'

"It's raining"

In [21]:
print('Hello\nWorld')

Hello
World


A string is processed by the interpreter before being passed on to the regex engine. This means that to search for a single literal backslash, `\`, we need to escape twice and search for the regex `\\\\`. The interpreter reads this as `\\` and passes it to the regex engine, which then reads it as `\` as desired.

In [34]:
s = 'a \ b'
m = re.search('\\\\', s)
print(m.group())

\


A useful alternative is to use raw strings `r''`, which make the interpreter read special characters as literals, obviating the first set of escapes. Hence, it's a good idea to use raw strings in Python regex expressions.

In [26]:
m = re.search(r'\\', s)
print(m.group())

\


## `re` module

In [3]:
import re

Overview of search methods

In [3]:
pattern = 'a'
string = 'Jack is a boy'

methods = [
    ('re.match (start of string)', re.match(pattern, string)),
    ('re.search (anywhere in string)', re.search(pattern, string)),
    ('re.findall (all matches)', re.findall(pattern, string)),
    ('re.finditer (all matches as iterator)', re.finditer(pattern, string))
]

for desc, result in methods:
    print('{:40} -> {}'.format(desc, result))

re.match (start of string)               -> None
re.search (anywhere in string)           -> <re.Match object; span=(1, 2), match='a'>
re.findall (all matches)                 -> ['a', 'a']
re.finditer (all matches as iterator)    -> <callable_iterator object at 0x11236d2e0>


### `re.findall()`

Returns list of all matches if no capturing groups specified, and a list of capturing groups otherwise.

In [4]:
data = """
 012
foo34 
     56
78bar
9
 a10b
"""

data

'\n 012\nfoo34 \n     56\n78bar\n9\n a10b\n'

In [5]:
# no capturing groups
proper_digits = '\s+\d+\s+'
re.findall(proper_digits, data, flags=re.MULTILINE)

['\n 012\n', ' \n     56\n', '\n9\n ']

In [6]:
# single capturing group
proper_digits = '(?m)\s+(\d+)\s+'
re.findall(proper_digits, data, flags=re.MULTILINE)

['012', '56', '9']

In [7]:
# mutiple capturing group
proper_digits = '\s+(\d)(\d+)?\s+'
re.findall(proper_digits, data, flags=re.MULTILINE)

[('0', '12'), ('5', '6'), ('9', '')]

### `re.match()`

Find pattern at the beginning of a string

In [15]:
line = '"688293"|"777"|"2011-07-20"|"1969"|"20K to 30K"|"WA1 4"|"E01012553"|"E02002603"|"M"|"2012-01-25"|"262916"|"NatWest Bank"|"Current"|"364.22"|"9572 24jan12 , tcs bowdon , bowdon gb - pos"|"Debit"|"25.03"|"No Tag"|"No Tag"|"No Tag"|"No Merchant"|"Unknown Merchant"|"2011-07-20"|"2020-07-21 20:32:00"|"2014-07-18"|"2017-10-24"|"U"'

pattern = '"\d+"\|"(?P<user_id>\d+)"'
match = re.match(pattern, line)
print(line, end='\n\n')
print(match)
print(match.group('user_id'))

"688293"|"777"|"2011-07-20"|"1969"|"20K to 30K"|"WA1 4"|"E01012553"|"E02002603"|"M"|"2012-01-25"|"262916"|"NatWest Bank"|"Current"|"364.22"|"9572 24jan12 , tcs bowdon , bowdon gb - pos"|"Debit"|"25.03"|"No Tag"|"No Tag"|"No Tag"|"No Merchant"|"Unknown Merchant"|"2011-07-20"|"2020-07-21 20:32:00"|"2014-07-18"|"2017-10-24"|"U"

<re.Match object; span=(0, 14), match='"688293"|"777"'>
777


In [None]:
match.

In [11]:
from itertools import compress

addresses = [
    '5412 N CLARK',
    '5148 N CLARK',
    '5800 E 58TH',
    '2122 N CLARK'
    '5645 N RAVENSWOOD',
    '1060 W ADDISON',
    '4801 N BROADWAY',
    '1039 W GRANVILLE',
]

def large_house_number(address, threshold=2000):
    house_number = int(re.search('\d+', address)[0])
    return house_number > threshold

large_number = [large_house_number(x) for x in addresses]
list(compress(addresses, large_number))

['5412 N CLARK',
 '5148 N CLARK',
 '5800 E 58TH',
 '2122 N CLARK5645 N RAVENSWOOD',
 '4801 N BROADWAY']

In [13]:
match.group()

'"688293"|"777"'

### `re.sub()`

In [12]:
import string

Stip a string of whitespace and punctuation.

In [15]:
s = 'String. With! Punctu@tion# and _whitespace'
re.sub(r'[\W_]', '', s)

'StringWithPunctutionandwhitespace'

### `re.escape()`

I want to match "(other)". To match the parentheses literally, I'd have to escape them. If I don't, the regex engine interpres them as a capturing group.

In [19]:
m = re.search('(other)', 'some (other) word')
print(m)
m.group()

<re.Match object; span=(6, 11), match='other'>


'other'

I can escape manually.

In [20]:
re.search('\(other\)', 'some (other) word')

<re.Match object; span=(5, 12), match='(other)'>

But if I have many fields with metacharacters (e.g. variable values that contain parentheses) this is a massive pain. The solution is to just use `re.escape()`, which does all the work for me.

In [10]:
re.search(re.escape('(other)'), 'some (other) word')

<re.Match object; span=(5, 12), match='(other)'>

## `regex` module

Awesome [comparison](https://github.com/rexdwyer/Splitsville/blob/master/Splitsville.ipynb) between `re` and `regex`.

In [8]:
import regex as re

In [11]:
pattern = 'e\Z'
subject = 'apple\norange'
re.search(pattern, subject)

<regex.Match object; span=(11, 12), match='e'>

## Pandas

In [None]:
re.Match.group

In [40]:
def colname_cleaner(df):
    """Convert column names to stripped lowercase with underscores."""
    df.columns = df.columns.str.lower().str.strip()
    return df

def str_cleaner(df):
    """Convert string values to stripped lowercase."""
    str_cols = df.select_dtypes('object')
    for col in str_cols:
        df[col] = df[col].str.lower().str.strip()
    return df
    
movies = (data.movies()
          .pipe(colname_cleaner)
          .pipe(str_cleaner))
movies.head(2)

Unnamed: 0,title,us gross,worldwide gross,us dvd sales,production budget,release date,mpaa rating,running time min,distributor,source,major genre,creative type,director,rotten tomatoes rating,imdb rating,imdb votes
0,the land girls,146083.0,146083.0,,8000000.0,jun 12 1998,r,,gramercy,,,,,,6.1,1071.0
1,"first love, last rites",10876.0,10876.0,,300000.0,aug 07 1998,r,,strand,,drama,,,,6.9,207.0


In [48]:
import re

### Finding a single pattern in text

In [63]:
pattern = 'hello'
text = 'hello world it is a beautiful day.'

match = re.search(pattern, text)
match.start(), match.end(), match.group()

(0, 5, 'hello')

In Pandas

In [67]:
movies.title.str.extract('(love)')

Unnamed: 0,0
0,
1,love
2,
3,
4,
...,...
3196,
3197,
3198,
3199,


- `contains()`: Test if pattern or regex is contained within a string of a Series or Index.
- `match()`: Determine if each string starts with a match of a regular expression.
- `fullmatch()`: 
- `extract()`: Extract capture groups in the regex pat as columns in a DataFrame.
- `extractall()`: Returns all matches (not just the first match).
- `find()`: 
- `findall()`:
- `replace()`:

In [47]:
movies.title.replace('girls', 'hello')

0                   the land girls
1           first love, last rites
2       i married a strange person
3             let's talk about sex
4                             slam
                   ...            
3196    zack and miri make a porno
3197                        zodiac
3198                          zoom
3199           the legend of zorro
3200             the mask of zorro
Name: title, Length: 3201, dtype: object

Let's drop all movies by distributors with "Pictures" and "Universal" in their title.

In [108]:
# inverted masking

names = ['Universal', 'Pictures']
pattern = '|'.join(names)
mask = movies.distributor.str.contains(pattern, na=True)
result = movies[~mask]
result.head(2)

Unnamed: 0,title,us_gross,worldwide_gross,us_dvd_sales,production_budget,release_date,mpaa_rating,running_time_min,distributor,source,major_genre,creative_type,director,rotten_tomatoes_rating,imdb_rating,imdb_votes
0,The Land Girls,146083.0,146083.0,,8000000.0,Jun 12 1998,R,,Gramercy,,,,,,6.1,1071.0
1,"First Love, Last Rites",10876.0,10876.0,,300000.0,Aug 07 1998,R,,Strand,,Drama,,,,6.9,207.0


In [112]:
# negated regex

names = ['Universal', 'Pictures']
pattern = '\|'.join(names)
neg_pattern = f'[^{pattern}]'
neg_pattern
mask = movies.distributor.str.contains(neg_pattern, na=False)
result2 =movies[mask]

In [113]:
neg_pattern

'[^Universal\\|Pictures]'

In [114]:
result == result2

ValueError: Can only compare identically-labeled DataFrame objects

In [None]:
def drop_card_repayments(df):
    """Drop card repayment transactions from current accounts."""
    tags = ['credit card repayment', 'credit card payment', 'credit card']
    pattern = '|'.join(tags)
    mask = df.auto_tag.str.contains(pattern) & df.account_type.eq('current')
    return df[~mask]


## Sources

- [Python string documentation](https://docs.python.org/3/library/string.html#string-formatting)
- [Pyformat](https://pyformat.info)
- [Fluent Python](https://www.oreilly.com/library/view/fluent-python/9781491946237/)
- [Python Cookbook](https://www.oreilly.com/library/view/python-cookbook-3rd/9781449357337/)
- [Learning Python](https://www.oreilly.com/library/view/learning-python-5th/9781449355722/)
- [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781491957653/)
- [Python Data Science Handbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/)