# Regular Expressions

Regular expressions, also known as regexes, are a staple in virtually all programming languages.  They allow you to perform **string matching** (also called string searching), which generally involves searching for a smaller string referred to as a **pattern** in a string of equal or greater size (remember, a string is just a sequence of characters).

String matching algorithms are ubiquitous in bioinformatics and computational biology/chemistry.  Why?  Because DNA, RNA, proteins, and chemical structures are all represented to a computer as strings.  

## Regex in Python

While every programming language contains functionality for regular expressions, the syntax for doing so changes between languages.  In python, some string matching/searching can be done with built-in methods.  This will be a relatively shallow dive into regex.  For a more thorough guide to regex in python, see [this article](https://realpython.com/regex-python/).  There are also great resources out there like [this one](https://regexr.com) that will let you test your regexes.

Let's say you have the string 'Python is Fun!'.  You are pretty sure that the **substring** 'Fun' exists within the string, but you would like to confirm this.  If the substring exists, you would also like to know where it begins.  There are a couple of ways we can do this.

To confirm that 'Fun' is a substring of 'Python is Fun!', we can use `in`.

In [10]:
a_str = 'Python is Fun!'

'Fun' in a_str

True

Having confirmed that 'Fun' is in 'Python is Fun!', we can know figure out where in the string it exists.

In [9]:
a_str.index('Fun')

10

In [6]:
a_str.find('Fun')

10

Note that the above methods will fail if the the substring is not in `a_str`, so running the `in` statement is technically unnecessary.

In [12]:
a_str.index('fun') # also note that the searching is case-sensitive

ValueError: substring not found

The above examples are very simple examples where specific characters are compared to each other.  Sometimes, this will be enough, but many times you will need more flexibility.

### The `re` Module

As a first example, let's say you again want to know if 'Fun' is a substring of 'Python is Fun!'.  But, this time, you would also be satisfied if 'fun' or 'Fun' is in your string.  With the re module, we can ask if either is present using `search()` and a regular expression.

When we use `search()`, the syntax will be to provide the **pattern** (our regular expression that we will be searching for) and then the string to be searched.

In [14]:
# first, we need to import the module
import re

# notice that we give the function the pattern and then the string
re.search('[Ff]un', a_str)

<re.Match object; span=(10, 13), match='Fun'>

As you can see, the function `search()` confirmed that our pattern is in our string `a_str` and also returned a match object telling us that the substring spans from position 10 to position 13 (remember, string splicing is non-inclusive in python) and that the match was 'Fun.'

Note that if our string had been slightly different, we would have gotten a different value for match:

In [15]:
b_str = 'Python is fun!'

re.search('[Ff]un', b_str)

<re.Match object; span=(10, 13), match='fun'>

This time, our match is 'fun' with a lower case f.  As you may have guessed, the `[Ff]` portion of the pattern we searched for means that either an uppercase or lowercase f is acceptable.  We use the square brackets `[ ]` to indicate a **character class**.  Basically, the square brackets say, "hey python/re, I want you to look for any of the characters in these brackets in the target string."  

Let's see what happens when we play around with the brackets.

In [17]:
re.search('[Ffun]', a_str)

<re.Match object; span=(5, 6), match='n'>

In [18]:
re.search('[Ffu]n', a_str)

<re.Match object; span=(11, 13), match='un'>

In [22]:
re.search('Ff[un]', a_str)

In [23]:
print(re.search('Ff[un]', a_str))

None


So, what's going on here?  

In the first example, we asked to find the character F, f, u or n in `a_str` and we got a match.  We actually matched the n at the end of 'Python' rather than the one in 'Fun.'  Why?  Because when `search()` finds a match, the function terminates and returns the match.  I will show you how to get all matches later on.

In the second example, we asked to find the character F, f, or u followed by the character n.  In other words, we were looking for Fn, fn, or un.  In this case, we again had a match, this time to the un in 'Fun'.

In the last example, we effectively searched for Ffu or Ffn.  Because our string did not contain either of those substrings, it actually returned `None`.  If you haven't heard of the `None` type in python, you can read up on it [here](https://realpython.com/null-in-python/).

In these examples, we illustrated that characters outside of the brackets must be matched exactly and that only one character within brackets may be matched (without special characters).  We also learned that `search()` only returns the first match.

There are some other special characters that we can use to match certain parts of a string:

In [24]:
# ^ looks for a pattern at the beginning of a string
re.search('^[Pp]ython', a_str)

<re.Match object; span=(0, 6), match='Python'>

In [25]:
# $ looks for a pattern at the end of a string
re.search('Fun!$', a_str)

<re.Match object; span=(10, 14), match='Fun!'>

You can also match numbers and whitespace characters:

In [36]:
numbers = '\n 1 2 3 4 \n 5 6 7 8'

print(numbers)

# searches for a number from 0 to 9 inclusive
re.search('[0-9]', numbers)


 1 2 3 4 
 5 6 7 8


<re.Match object; span=(2, 3), match='1'>

In [35]:
# . matches anything except a new line
re.search('.', numbers) # this one matched the first space!

<re.Match object; span=(1, 2), match=' '>

You can also use more than one set of square brackets in a single regex.

In [41]:
# searches for three digits each from 0 to 9
re.search('[0-9][0-9][0-9]', numbers)

Note that the above example does not return anything because of the spaces between the numbers.  If we wanted to match the first three numbers, we would have to include the spaces:

In [40]:
re.search('[0-9] [0-9] [0-9]', numbers)

<re.Match object; span=(2, 7), match='1 2 3'>

In [42]:
# this also works
re.search('[0-9].[0-9].[0-9]', numbers)

<re.Match object; span=(2, 7), match='1 2 3'>

I've shown you how to match any digit, so here is how you match any letter:

In [43]:
# matches any uppercase letter
re.search('[A-Z]', a_str)

<re.Match object; span=(0, 1), match='P'>

In [44]:
# matches any lowercase letter
re.search('[a-z]', a_str)

<re.Match object; span=(1, 2), match='y'>

In [46]:
# matches ay letter at all
# matches any uppercase letter
re.search('[a-zA-Z]', a_str)

<re.Match object; span=(0, 1), match='P'>

### Functions other than `search()`

I told you earlier that I would show you how to match more than just the first occurrence of a pattern.  One option is `findall()`, which will return a list of every match:

In [47]:
re.findall('[a-z]n', a_str)

['on', 'un']

Another option is `finditer()`, which returns an iterator with match objects.  Here, I've turned the iterator into a list so you can better see what it's doing:

In [49]:
iter_list = list(re.finditer('[a-z]n', a_str))

print(iter_list)

[<re.Match object; span=(4, 6), match='on'>, <re.Match object; span=(11, 13), match='un'>]


You can also use `sub()` with regular expressions to replace substrings with other strings.  For example:

In [50]:
re.sub('[Ff]un', 'lame', a_str)

'Python is lame!'

Of course, python is fun, not lame, so we had better fix our string.

In [51]:
re.sub('[a-z]{4}!', 'Fun', a_str)

'Python is Fun!'

In that example, we used regex to find any four letters before an exclamation point and replace them with 'Fun'.

# Conclusions

Regular expressions, or regexes, are super useful special sequences of characters that allow us to perform advanced pattern matching.  In python, the re module is designed to handle regex and we need to import it.

As mentioned above, this was a very shallow dive into regular expressions.  For a much more thorough review, see [this article](https://realpython.com/regex-python/).