# Regular expressions

When we refer to regular expressions we mean a combination of normal characters and metacharaters that describe patterns to find text or positions within a text. The built-in re module offers a rich collection of functions to implement even the most complex regular expressions.

In [1]:
import re

The most commonly applicable functions of the `re` library are:

* `re.search` # finds only one match
* `re.match`  # the match must be at the beginning of the string
* `re.findall` # finds all matches
* `re.finditer`# iterator with all matches
* `re.sub`  # substitutes the match with the text we provide
* `re.split` # splits the string at the specified delimeter(s)

First we will see some simple applications.

## re.match

In [2]:
text = "SNPs unaccounted for in the catalog and were excluded."

In [4]:
match = re.match('SNPs', text)
match

<re.Match object; span=(0, 4), match='SNPs'>

`re.match` returns a match only if the searched string starts with the pattern we search for

The returned match object has several methods like getting the offsets of the matched string

In [5]:
match.span()

(0, 4)

In [6]:
match.start()

0

In [7]:
match.end()

4

We can also use the match object as a boolean

In [8]:
if match:
    print('Found a match')
else:
    print('No match was found')

Found a match


## re.search

On the other hand `re.search` is more flexible

In [11]:
text = "Unaccounted SNPs for in the catalog and were excluded."

In [12]:
match = re.search('SNPs', text)
match

<re.Match object; span=(12, 16), match='SNPs'>

It's important to remember that `re.search` will return only the first match

In [13]:
if re.search('SNPs', text):
    print('Found a match')
else:
    print('No match was found')

Found a match


## re.findall

In [14]:
text = "SNPs unaccounted for in SNPs. The catalog and were excluded."

In [15]:
re.findall('SNPs', text)

['SNPs', 'SNPs']

### re.finditer

In [16]:
re.finditer('SNPs',text)

<callable_iterator at 0x110bdaa40>

In [17]:
for match in re.finditer('SNPs',text):
    print(match.group(), match.span())

SNPs (0, 4)
SNPs (24, 28)


## re.sub

In [18]:
re.sub('SNPs', 'SNP', text)

'SNP unaccounted for in SNP. The catalog and were excluded.'

## re.split

In [19]:
re.split('\.', text)

['SNPs unaccounted for in SNPs', ' The catalog and were excluded', '']

## Regular expression metacharacters and quantifiers

* `\d`  matches a digit
* `\D`  matches a non digit
* `\w`  matches an alphanumeric character
* `\W`  matches non alphanumeric characters
* `\s`  matches a whitespace
* `\S`  matches non-whitespace characters
* `\b`  matches a word boundary

* `.` matches any single character apart from a whitespace
* `^` start of the string
* `$` end of the string
* `*` zero or more times the searched pattern
* `+` one or more times the searched pattern
* `?` zero or one time the searched pattern
* `{}` specified number of times the searched pattern needs to appear
* `\` escapes a metacharacter of its special function
* `[]` a group of characters

In [4]:
text = '''57614 matching loci, 261 contained no verified haplotypes
0 loci contained SNPs unaccounted for in the catalog and were excluded
74934 total haplotypes examined from matching loci, 74673 verified.
'''

In [21]:
re.findall('\d+', text) # finds all numbers in our text

['57614', '261', '0', '74934', '74673']

In [22]:
re.findall('\w+',text)

['57614',
 'matching',
 'loci',
 '261',
 'contained',
 'no',
 'verified',
 'haplotypes',
 '0',
 'loci',
 'contained',
 'SNPs',
 'unaccounted',
 'for',
 'in',
 'the',
 'catalog',
 'and',
 'were',
 'excluded',
 '74934',
 'total',
 'haplotypes',
 'examined',
 'from',
 'matching',
 'loci',
 '74673',
 'verified']

In [23]:
re.findall('\W+', text)

[' ',
 ' ',
 ', ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ', ',
 ' ',
 '.\n']

The metacharacters `*` and `+` are greedy. They will search for the longest possible pattern

In [19]:
re.search('\d+','57614 matching loci')

<re.Match object; span=(0, 5), match='57614'>

In [24]:
re.search('\w+\s\w+','57614 matching')

<re.Match object; span=(0, 14), match='57614 matching'>

We can change their behaviour by including an `?`. In that case we will get the shortest possible match. 

In [25]:
re.search('\d+?', '57614')

<re.Match object; span=(0, 1), match='5'>

In [26]:
re.search('\w+ \w+?','57614 matching')

<re.Match object; span=(0, 7), match='57614 m'>

In [27]:
re.findall('\s',text)

[' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '\n']

In [28]:
re.findall('\S', text)

['5',
 '7',
 '6',
 '1',
 '4',
 'm',
 'a',
 't',
 'c',
 'h',
 'i',
 'n',
 'g',
 'l',
 'o',
 'c',
 'i',
 ',',
 '2',
 '6',
 '1',
 'c',
 'o',
 'n',
 't',
 'a',
 'i',
 'n',
 'e',
 'd',
 'n',
 'o',
 'v',
 'e',
 'r',
 'i',
 'f',
 'i',
 'e',
 'd',
 'h',
 'a',
 'p',
 'l',
 'o',
 't',
 'y',
 'p',
 'e',
 's',
 '0',
 'l',
 'o',
 'c',
 'i',
 'c',
 'o',
 'n',
 't',
 'a',
 'i',
 'n',
 'e',
 'd',
 'S',
 'N',
 'P',
 's',
 'u',
 'n',
 'a',
 'c',
 'c',
 'o',
 'u',
 'n',
 't',
 'e',
 'd',
 'f',
 'o',
 'r',
 'i',
 'n',
 't',
 'h',
 'e',
 'c',
 'a',
 't',
 'a',
 'l',
 'o',
 'g',
 'a',
 'n',
 'd',
 'w',
 'e',
 'r',
 'e',
 'e',
 'x',
 'c',
 'l',
 'u',
 'd',
 'e',
 'd',
 '7',
 '4',
 '9',
 '3',
 '4',
 't',
 'o',
 't',
 'a',
 'l',
 'h',
 'a',
 'p',
 'l',
 'o',
 't',
 'y',
 'p',
 'e',
 's',
 'e',
 'x',
 'a',
 'm',
 'i',
 'n',
 'e',
 'd',
 'f',
 'r',
 'o',
 'm',
 'm',
 'a',
 't',
 'c',
 'h',
 'i',
 'n',
 'g',
 'l',
 'o',
 'c',
 'i',
 ',',
 '7',
 '4',
 '6',
 '7',
 '3',
 'v',
 'e',
 'r',
 'i',
 'f',
 'i',
 'e',
 'd'

Combining the above we can create complex patterns

In [29]:
text

'57614 matching loci, 261 contained no verified haplotypes\n0 loci contained SNPs unaccounted for in the catalog and were excluded\n74934 total haplotypes examined from matching loci, 74673 verified.\n'

In [30]:
re.findall('\d+\s\w+',text)

['57614 matching', '261 contained', '0 loci', '74934 total', '74673 verified']

In [32]:
re.findall('\d?\s\w+',text)

['4 matching',
 ' loci',
 ' 261',
 ' contained',
 ' no',
 ' verified',
 ' haplotypes',
 '\n0',
 ' loci',
 ' contained',
 ' SNPs',
 ' unaccounted',
 ' for',
 ' in',
 ' the',
 ' catalog',
 ' and',
 ' were',
 ' excluded',
 '\n74934',
 ' total',
 ' haplotypes',
 ' examined',
 ' from',
 ' matching',
 ' loci',
 ' 74673',
 ' verified']

In [33]:
re.findall('\d\s\w*', text)

['4 matching', '1 contained', '0 loci', '4 total', '3 verified']

In [34]:
re.findall('^\d+', text) 

['57614']

The functions we saw have special flags that can alter their output. Find most useful the `re.M` for allowing the pattern to match across multilines and the `re.I` for ignoring the case of the searched string.

In [5]:
text

'57614 matching loci, 261 contained no verified haplotypes\n0 loci contained SNPs unaccounted for in the catalog and were excluded\n74934 total haplotypes examined from matching loci, 74673 verified.\n'

In [6]:
re.findall('snps',text)

[]

In [7]:
re.findall('snps', text, flags=re.I)

['SNPs']

Another important concept is the so-called `escaping characters`. E.g. as we saw there are special characters like `.`. What can we do if we want to include them in the searched pattern

In [31]:
re.findall('\w+.', text) # was aiming to find the last word of each sentence

['57614 ',
 'matching ',
 'loci,',
 '261 ',
 'contained ',
 'no ',
 'verified ',
 'haplotypes',
 '0 ',
 'loci ',
 'contained ',
 'SNPs ',
 'unaccounted ',
 'for ',
 'in ',
 'the ',
 'catalog ',
 'and ',
 'were ',
 'excluded',
 '74934 ',
 'total ',
 'haplotypes ',
 'examined ',
 'from ',
 'matching ',
 'loci,',
 '74673 ',
 'verified.']

In [9]:
text

'57614 matching loci, 261 contained no verified haplotypes\n0 loci contained SNPs unaccounted for in the catalog and were excluded\n74934 total haplotypes examined from matching loci, 74673 verified.\n'

In [8]:
re.findall('\w+\.', text)

['verified.']

We can also combine patterns using the OR `|` symbol 

In [10]:
re.findall('loci|haplotypes', text)

['loci', 'haplotypes', 'loci', 'haplotypes', 'loci']

Another useful tool is to use `[]` for checking for a set of characters

In [11]:
re.findall('[Abcd]', 'With the wind, they hear us coming')

['d', 'c']

In [12]:
re.findall('[a-z]+', 'With the wind, they hear us coming')

['ith', 'the', 'wind', 'they', 'hear', 'us', 'coming']

In [13]:
re.findall('[a-zåöä]+', 'With the wind, they hear us coming åöä')

['ith', 'the', 'wind', 'they', 'hear', 'us', 'coming', 'åöä']

In [14]:
re.findall('[A-Za-z]+','With the wind, they hear us coming')

['With', 'the', 'wind', 'they', 'hear', 'us', 'coming']

Pay attention to the `^` symbol when is inside the square brackets. In that case it will look for patterns excluding the character set that is in the square brackets

In [15]:
re.findall('[^A-Qa-q]+','With the wind, they hear us coming')

['W', 't', ' t', ' w', ', t', 'y ', 'r us ']

## Using groups in our patterns

Using `()` we can capture patterns

In [42]:
match = re.search('\w+\s(\w+).*','With the wind, they hear us coming')
match

<re.Match object; span=(0, 34), match='With the wind, they hear us coming'>

In [43]:
match.groups()

('the',)

In [44]:
text = '''57614 matching loci, 261 contained no verified haplotypes.
  0 loci contained SNPs unaccounted for in the catalog and were excluded.'''

In [45]:
match = re.search('(\d+)\s\w+.*?(\d+).*',text)
match

<re.Match object; span=(0, 58), match='57614 matching loci, 261 contained no verified ha>

In [46]:
match.groups()

('57614', '261')

In [47]:
match.group(1)

'57614'

In [48]:
match.group(2)

'261'

In [49]:
match.group(0)

'57614 matching loci, 261 contained no verified haplotypes.'

In [43]:
match.group()

'57614 matching loci, 261 contained no verified haplotypes.'

Compared to `re.search` which returns the entire match, `re.findall` when we use grouping will return only the found groups.

In [51]:
re.findall('(\d{3,5})\s\w+.*?',text)

['57614', '261']

We can name groups using the `?P<>` syntax

In [52]:
match = re.search('(?P<valid>\d+)\s\w+.*?(?P<invalid>\d+).*',text)
match

<re.Match object; span=(0, 58), match='57614 matching loci, 261 contained no verified ha>

In [53]:
match.groups()

('57614', '261')

In [54]:
match.group('valid')

'57614'

In [55]:
match.group('invalid')

'261'

We can also backreference the captured groups

In [56]:
re.sub('(\d+)\s\w+.*?(\d+).*', 'We have \1 valid loci and \2 invalid loci', 
       '57614 matching loci, 261 contained no verified haplotypes.')

'We have \x01 valid loci and \x02 invalid loci'

This is a good point to introduce the `r` metacharacter

In [57]:
re.sub('(\d+)\s\w+.*?(\d+).*', 
       r'We have \1 valid loci and \2 invalid loci', 
       '57614 matching loci, 261 contained no verified haplotypes.')

'We have 57614 valid loci and 261 invalid loci'

In the case of named groups we need to use in front the `\g`

In [58]:
re.sub('(?P<valid>\d+)\s\w+.*?(?P<invalid>\d+).*', 
       r'We have \g<valid> valid loci and \g<invalid> invalid loci', 
       '57614 matching loci, 261 contained no verified haplotypes.')

'We have 57614 valid loci and 261 invalid loci'

## Look arounds

There are cases where we want our pattern to be captured only if followed by a particular pattern or similar if it is after a specified pattern.

In [59]:
text

'57614 matching loci, 261 contained no verified haplotypes.\n  0 loci contained SNPs unaccounted for in the catalog and were excluded.'

Let us start first with look ahead

In [60]:
re.findall('(\d+)\s.*?',text)

['57614', '261', '0']

In [61]:
re.findall('(\d+)\s(?=matching).*?',text)

['57614']

In [62]:
re.findall('(\d+)\s(?!matching).*',text)

['261', '0']

Now let's see some examples with look behind

In [63]:
re.findall('(?<=\s)(\d+)',text)

['261', '0']

The following is bit more tricky

In [64]:
re.findall('(?<!\s)(\d+)',text)

['57614', '61']

In [84]:
re.findall('(?<=\d)\s(\w+)',text)

['matching', 'contained', 'loci', 'total', 'verified']

A tricky part with look behinds is that you cannot use quantifiers.

In [65]:
re.findall('(?<=\d{2,})(\d+)',text)

error: look-behind requires fixed-width pattern

## Exercises

### Exercise 1

Using the text below:

* Extract all numbers
* Extract only the number refering to a date
* Insert after `Dunedin` the following `(Scottish Gaellic for Edinburgh)`
* Create a list with all the unique words (including numbers) in the text
* Create a list with all the unique non-alphanumeric characters of the text
* Extract all the instances of `Hoiho` whether it includes uppercase letters or not
* Extract the first word that follows a number
* Extract the word just before a `,`
* Using the `.` as delimiter create a list containing all the sentences of the text
* Using the `,` as delimiter create a list containing the relevant text sections. Make sure you don't split at the commas found in the numbers (Look arounds could be handy)


In [68]:
bbc_text = '''Sometimes, saving a species means treating one animal at a time. The veterinarians at The Wildlife Hospital, Dunedin do just that, going small to go big by caring exclusively for native animals. Headquartered close to the wildlife-rich Otago Peninsula on New Zealand's South Island, the hospital is ideally placed to help where it's most needed. Hoiho are among the world's most endangered penguin species, with just an estimated 4,000 to 5,000 adults left in the wild, and they arrive at the hospital for a variety of reasons including starvation, injury and disease.
But each animal has a better chance at survival than ever before, thanks to the combined efforts of The Wildlife Hospital and Penguin Place, a nearby recovery home that has been helping the hoiho since the 1990s.'''
bbc_text

"Sometimes, saving a species means treating one animal at a time. The veterinarians at The Wildlife Hospital, Dunedin do just that, going small to go big by caring exclusively for native animals. Headquartered close to the wildlife-rich Otago Peninsula on New Zealand's South Island, the hospital is ideally placed to help where it's most needed. Hoiho are among the world's most endangered penguin species, with just an estimated 4,000 to 5,000 adults left in the wild, and they arrive at the hospital for a variety of reasons including starvation, injury and disease.\nBut each animal has a better chance at survival than ever before, thanks to the combined efforts of The Wildlife Hospital and Penguin Place, a nearby recovery home that has been helping the hoiho since the 1990s."

### Exercise 2

In [88]:
alignment = '''2742497 reads; of these:
  2742497 (100.00%) were paired; of these:
    1510621 (55.08%) aligned concordantly 0 times
    903271 (32.94%) aligned concordantly exactly 1 time
    328605 (11.98%) aligned concordantly >1 times
    ----
    1510621 pairs aligned concordantly 0 times; of these:
      215924 (14.29%) aligned discordantly 1 time
    ----
    1294697 pairs aligned 0 times concordantly or discordantly; of these:
      2589394 mates make up the pairs; of these:
        1091850 (42.17%) aligned 0 times
        678263 (26.19%) aligned exactly 1 time
        819281 (31.64%) aligned >1 times
80.09% overall alignment rate
'''

* Extract the number of paired reads
* Extract the overall alignment rate
* Extract the number of reads aligned exactly 1 time

### Exercise 3

In [85]:
import pandas as pd
snp_array = pd.read_csv('SNP_data.txt')
snp_array.head()                        

Unnamed: 0,Sequence
0,ACCTAATGCACACCCAGCAGGTTATGGGGG[C/T]GCAGTTAGGTC...
1,CCTGCTGGTGTGCTGTGCCATCTGGACTCA[A/G]AGAAACACCAG...
2,TGTGGGCTGTCTGATCAGGCTGTTCTTCAG[C/T]ATGTGGAACAT...
3,AAACGGAACTTGT[A/T]AGGAACACCTCATGCATCTGATATTACA...
4,AATGACTAAAGAAAACTGTGTGTGCAAAATGCATT[C/G]CTGAGG...


In the DNA sequence above single nucleotide polymorphisms (SNPs) are inside the square brackets

* Extract the SNPs
* Extract 10 bases upstream of each SNP
* Extract 10 bases upstream of each SNP
* Extract 10 bases upstream of each SNP only if the first base before the SNP is not a `G`