# Regular Expressions in Python

"Regular expressions" are used to identify patterns. There is an extensive library of functions for this purpose in the module ```re```. The syntax for regular expressions are quite distinct from the rest of Python, acting as a sort-of mini-language within Python.

The types of questions you can address with regular expressions include: "Does this string match a pattern?", "Where and how often does this pattern appear in a string?". You can also use regular expressions to split or extract parts of a string.

For a more thorough description with examples check out the documentation:

https://docs.python.org/3/howto/regex.html
https://docs.python.org/3.3/library/re.html


There are many ways to work with regular expressions. In general, you must first "compile" the pattern into a pattern object. This pattern object can then be used to search a string. There are two ways to compile the regular expression:

In [1]:
import re

motif = re.compile("CCTCGA|GCTCGA")
print(type(motif))

motif = r"CCTCGA|GCTCGA"
print(type(motif))

<class '_sre.SRE_Pattern'>
<class 'str'>


In [2]:
motif = re.compile("CCTCGA|GCTCGA")
print(dir(motif))

['__class__', '__copy__', '__deepcopy__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'findall', 'finditer', 'flags', 'fullmatch', 'groupindex', 'groups', 'match', 'pattern', 'scanner', 'search', 'split', 'sub', 'subn']


In regex, the pipe means "or". Other regex-language specific language includes square brackets to indicate "one of":

In [None]:
motif = re.compile('[CG]CTCGA')

The compiled regex can now be used to search a string for a match:

In [None]:
import re

#motif = re.compile('[CG]CTCGA')
motif = re.compile('CCTCGA|GCTCGA')

mySeq = "GTGCCCCTCGAGAGGAGGGCGCGCGCCGCGCGCTCGACGCGATCGGCGCTCAGCGAGCGAGCTCCTCGAAGCGATCCGCGCGCGCT"

m = motif.search(mySeq)

print(type(motif))
print(type(m))
print(dir(m))

In [None]:
print(m)
print(m.span())
print(m.start())
print(m.end())

In [None]:
print(m.group())

This is helpful, but it clearly isn't finding all of the matches for our motif/pattern. The search method only returns the first match. To find all matches, one uses the ```findall``` method:

In [None]:
import re

motif = re.compile('[CG]CTCGA')

mySeq = "GTGCCCCTCGAGAGGAGGGCGCGCGCCGCGCGCTCGACGCGATCGGCGCTCAGCGAGCGAGCTCCTCGAAGCGATCCGCGCGCGCT"

m = motif.findall(mySeq)
    
print(type(m))
print(type(m[0]))
    
print(m)

The ```findall``` method does indeed find all of the matches, however it only returns a list of the matching subsequences and doesn't return the positions of the match as per the ```search``` method. The method that allows us to do this is ```finditer```:

In [1]:
import re

#motif = re.compile('[CG]CTCGA')
motif = re.compile('CCTCGA|GCTCGA')
#motif = re.compile('CC+T')

mySeq = "GTGCCCCTCGAGAGGAGGGCGCGCGCCGCGCGCTCGACGCGATCGGCGCTCAGCGAGCGAGCTCCTCGAAGCGATCCGCGCGCGCT"

m = motif.finditer(mySeq)
    
print(type(m))

<class 'callable_iterator'>


In [2]:
for match in m:
    
    print(type(match))
    print(match.span())
    print(match.group())

<class '_sre.SRE_Match'>
(5, 11)
CCTCGA
<class '_sre.SRE_Match'>
(31, 37)
GCTCGA
<class '_sre.SRE_Match'>
(63, 69)
CCTCGA


In [None]:
for match in m:
    
    print(str(match.span()[0]) + "\t" + str(match.span()[1]) + "\t" + match.group())

GTGCC\__CCTCGA\__GAGGAGGGCGCGCGCCGCGC**GCTCGA**CGCGATCGGCGCTCAGCGAGCGAGCT**CCTCGA**AGCGATCCGCGCGCGCT

In [None]:
# recall that iterator objects can be "exhausted"
m = motif.finditer(mySeq)

for match in m:
    
    print(str(match.span()[0]) + "\t" + str(match.span()[0]) + "\t" + match.group())

If we want to know the number of matches from finditer():

In [None]:
print(sum(1 for _ in motif.finditer(mySeq)))

In [None]:
print(sum(1 for _ in re.finditer(r"CCTCGA|GCTCGA", mySeq)))

Note that calling a callable iterator exhausts the data:

In [None]:
motif = re.compile('CC+T')

mySeq = "GTGCCCCTCGAGAGGAGGGCGCGCGCCGCGCGCTCGACGCGATCGGCGCTCAGCGAGCGAGCTCCTCGAAGCGATCCGCGCGCGCT"

m = motif.finditer(mySeq)

print(sum(1 for _ in m))

for match in m:
    
    print(match.span())
    print(match.group())

There is a concept of "groups" in Python regex. A regular expression can be sub-divided into components or "groups" using round brackets. For example, we can define a regex of three groups, where the first group matches a specific string, second group matches a space, and third group matches any sequence of letters:

In [None]:
(\W)(\w+)

In [None]:
MyRe = r"(Arabidopsis)(\s)(\w+)"

myString = "Arabidopsis thaliana"

myMatch = re.search(MyRe, myString)

print(myMatch)
print(type(myMatch))

We can access the groups separately using the group() method on the SRE_Match object:

In [None]:
# all matches together:
print(myMatch.group(0))

# print the first group:
print(myMatch.group(1))

# print the second group:
print(myMatch.group(2))

# print the third group:
print(myMatch.group(3))

## overview of re methods

Method | Description
-------------------|----------------   
re.search() | Detect the presence of a pattern
re.findall() | Returns a list of all matching subsequences
re.finditer() | Similar to findall(), but returns iterator with position data as per search
re.match() |  Like re.search but only if pattern matches the entire string
re.sub(query, replacement, string) | Make *all* substitutions in a string    
re.split() | Split a string according to a pattern

In [None]:
print(type(re))

In [None]:
MyString = "23rd May 2000"
re.sub("May", "July", MyString)

In [None]:
MyString = "Well hello hello hello"
print(re.sub("hello", "ciao", MyString, 1))
print(re.sub("hello", "ciao", MyString, 2))

Groups can be referenced using (yet another) special syntax:

In [None]:
re.sub(r"", r"\1", "")

In [None]:
re.sub(r"(\d+\.\d{3})(\d+)", r"\1", "34.73322532")

In [None]:
re.sub(r"(\d+\.\d{3})(\d+)", r"\2", "34.73322532")

Removing a duplicated word:

In [None]:
re.sub(r"(\b[a-z]+)( \1)", r"\1", "cat in the the hat")

In [None]:
re.sub(r"(\b[a-z])( \1)", r"\1", "a a cat in the the hat")

Matching at the beginning and end of a string:

In [None]:
MyFormat = r"(\d{2}\s\w{3}\s\d{4})"

string1 = "114 Aug 20166"
m1 = re.search(MyFormat, string1)
print(m1)

string2 = "21 Aug 1234"
m2 = re.search(MyFormat, string2)
print(m2)

In [None]:
MyFormat2 = r"(^\d{2}\s\w{3}\s\d{4})"

string1 = "114 Aug 20166"
m1v2 = re.search(MyFormat2, string1)
print(m1v2)

string2 = "21 Aug 1234"
m2v2 = re.search(MyFormat2, string2)
print(m2v2)

In [None]:
MyFormat3 = r"(\d{2}\s\w{3}\s\d{4}$)"

string1 = "114 Aug 20166"
m1v3 = re.search(MyFormat3, string1)
print(m1v3)

string2 = "21 Aug 1234"
m2v3 = re.search(MyFormat3, string2)
print(m2v3)

## Exercise 1

Have a closer look at the syntax for regular expression logic. For example, one can use ```or``` statements using the pipe character:

```python
motif = re.compile('[CG]CTCGA|GCGCGC')
```

Write a function that will find the occurences and positions of the motif CAGCCGCG in the following gapped sequences, and return the gapped position of the match (*ie* don't cheat and simply ignore the gaps):

CCA--G-C---A--GCCG---C-GG--TA-AT

CGCA--G-C---A--GCCG---C-GG--TA-AT

TGCA--G-C---A--GCCG---C-GG--TA-AT

---CTTGCTCGT---C---A------GC-CGC----------G-----

## Exercise 2

Revisit Exercise 12 from the "Working with Files" notebook. Use regular expressions to print the gene identifier in the FASTA formatted UTR sequences. Your script should produce the following:

<pre>
>ENSMUSG00000026565
CTGCTGCTGCTGCCGCCGCGGCTGAACTCTCATCTTGACTCGCTGCTC
>ENSMUSG00000026565
CTGCTGCTGCTGCCGCCGCGGCTGAACTCTCATCTTGACTCGCTGCTC
>ENSMUSG00000026565
CTGCTGCTGCTGCCGCCGCGGCTGAACTCTCATCTTGACTCGCTGCTCCTCC
>ENSMUSG00000026565
CTGCTGCTGCTGCCGCCGCGGCTGAACTCTCATCTTGACTCGCTGCTCCTCCGTCCG
>ENSMUSG00000026565
TTTGAAT
>ENSMUSG00000026565
TTTGAATATTTTAACCAAAATCGCCCGGTCGATAAACCCTCCCTCGCTCCCGCTCC
>ENSMUSG00000026564
TATTTGACAGCATATCAATTTAATTGAAAAGAAGCCATGATAGTCAAGCATTGGCAGGGAGAGGCCAGAATCACAACAGAATGGATCACCTGGGTCTCATTGCGAAACCTCAATGAAGAGAACCACAGCCTGCAAGTCAGGTCTTTCTCAGTGGCTTTGTAGACTCACTTCTCCACTCTGTGGTGGACACTACCAAATGCAGAAGGGAAACAGGAAAGCTTAGAAGGGAAACAGGAAAGCTTAGAGGGCCTTTGGTGATAGAAAGTGTAAGCTGGTCCAGAATTTGGGGCTCGCATGAATGGCTCTTGTCTCTTCTCCACCCTACTGTCCAGGCCCTTTCTACTTGGATGCTTGCATTTTTGCCCATATGGAAGAGCCTGCATAACCCTTGGCAGGTCATGGTAAGCTGTTCCCAAGCCCAGTGTTAAATGGCTTCTTTCCGTGTTCCCAGTGTCTCCAAGGAGGGCCTCATTCCGC
>ENSMUSG00000026564
GACAGCATATCAATTTAATTGAAAAGAAGCCATGATAGTCAAGCATTGGCAGGGAGAGGCCAGAATCACAACAGAATGGATCACCTGGGTCTCATTGCGAAACCTCAATGAAGAGAACCACAGCCTGCAAGTCAGGTCTTTCTCAGTGGCTTTGTAGACTCACTTCTCCACTCTGTGGTGGACACTACCAAATGCAGAAGGGAAACAGGAAAGCTTAGAAGGGAAACAGGAAAGCTTAGAGGGCCTTTGGTGATAGAAAGTGTAAGCTGGTCCAGAATTTGGGGCTCGCATGAATGGCTCTTGTCTCTTCTCCACCCTACTGTCCAGGCCCTTTCTACTTGGATGCTTGCATTTTTGCCCATATGGAAGAGCCTGCATAACCCTTGGCAGGTCATGGTAAGCTGTTCCCAAGCCCAGTGTTAAATGGCTTCTTTCCGTGTTCCCAGTGTCTCCAAGGAGGGCCTCATTCCGC
</pre>

# Sources

- [Regular Expression HOWTO (Python doc)](https://docs.python.org/2/howto/regex.html#regex-howto)
- [Software Carpentry v4](http://software-carpentry.org/v4/regexp/index.html)
- [Haddock & Dunn. Practical Computing for Biologists. Sinauer Associates 2011.](http://practicalcomputing.org))
- [Python for Biologists](http://pythonforbiologists.com/index.php/introduction-to-python-for-biologists/regular-expressions/)

# Links

**Online tools to try regular expressions**  
- [http://regex101.com/](http://regex101.com/)   
- [http://www.regexr.com/](http://www.regexr.com/)   
- [https://www.debuggex.com/](https://www.debuggex.com/)  

**Cheat Sheets**  
- [CheatSheet from Practical Computing Biologists](http://practicalcomputing.org/files/PCfB_Appendices.pdf)

**Regular expression in other languages**  
- [in R](http://en.wikibooks.org/wiki/R_Programming/Text_Processing#Functions_which_use_regular_expressions_in_R)  
- [using sed](http://www.grymoire.com/Unix/Sed.html#uh-4)

**More on regular expressions**
- [5 Videos on Software Carpentry](http://software-carpentry.org/v4/regexp/index.html)
- [Sequence Analysis with RegExps](http://www.dalkescientific.com/writings/NBN/slides/regexps.pdf)