# Regular Expressions



### Escape characters

Every line of code in Python contains one instruction to be executed. If you split an instruction into two lines, it will result in an error.

In [None]:
x = 1 +
2

This may be a problem if you want to write a **string** of text that contains multiple lines.

In [None]:
x = "hello
world"

Python provides the special character `"\n"` to represent a "new line" in your **strings**. This is called an **escape characters**: it's a backslash `\` followed by a regular character.

**Escape characters** are recognized by Python as special and treated accordingly.
Note how the `"\n"` is exactly replaced by a new line when you print it.

In [None]:
print("hello\nworld")
print("escaping \n characters")

New lines are not the only **escape characters** in Python.
There are many more and they are used to easily represent complex text.

Each **escape character** is exactly replaced independently from the adjacent characters. You can also have one **escape character** after the other.

In [None]:
x = "This is a \ttab"
print(x)

x = "Mixing t and tabs t\t\tt\t"
print(x)

x = "These are quotes \' \""
print(x)

x = "This is an\n\tindented text on a new line"
print(x)

x = "And finally backslashes \\\\"
print(x)

### The `re` module

An important Python **module** used for working with **strings** is the **regular expressions** one, named `re`.

A **regular expression** (or **regex** for short) allows to implement functionalities like **find** or **find and replace** using special search pattenrs.

**Regular expressions** add a whole new layer of special characters that are extremely helpful for searching for matches in a **string**.
They are more complex than **escape characters**, but way more powerful.

Let's see the how to use this new module for searching characters in a **string**.

The **function** `search()` is provided by the `re` **module**. It takes as input two **strings** and it will search occurrences of the first **string** in the second.
This function can be used within an `if` statement. To have different behaviors depenending on if an occurrence  has been found or not.

**Never forget to `import re` before using regular expressions.**

In [None]:
import re

dna = "ATCGCGGTCCCAC"

if re.search("GAATTC", dna): # EcoRI restriction site is "GAATTC"
    print("EcoRI restriction site found!")
else:
    print("EcoRI restriction site not found!")
        
if re.search("GGACC", dna) or re.search("GGTCC", dna): # AvaII restriction site is "GGACC" or "GGTCC" 
    print("AvaII restriction site found!")
else:
    print("AvaII restriction site not found!")

### The alternative `(x|y)`

The **regular expression** `(A|T)` represents a character that can be either `A` or `T`.
Note how you have to add a `r` in front of a **string** containing a **regular expression**: this is to tell Python to treat the following text in a special way. Remember that the `r` is outside of the quotes.

This regex is an **alternative**: between parenthesis `(` `)` you have a number of patterns separated by pipes `|`, only one of the patterns will be used.

**You can have any number of patterns and each of them can be either a single character, more than one characters or even another regex**.

Be very careful of not using whitespaces within the regular expression (e.g. between the sequences and the pipe symbol) otherwise they will be searched in the strings.

In [None]:
import re

dna = "ATCGCGGTCCCAC"

if re.search(r"GG(A|T)CC", dna): # AvaII restriction site is "GGACC" or "GGTCC" 
    print("AvaII restriction site found!")
else:
    print("AvaII restriction site not found!")

In [None]:
import re

sequences = ["AGT", "AAAT", "AAAAT", "CAGTA", "AT", "AGAT", "AGAAAT"]

for seq in sequences:
    if re.search(r"(AG|AAA)T", seq):
        print("Matched", seq)
    else:
        print("Not matched", seq)

### Online regex tools

When working with regexes, sometime it's important to have a way to quickly test your new pattern with some text.
You could create a simple Python program for that, but there are many available websites that provide this functionality with a nice graphic.

An example is https://pythex.org/. Try to open this website, write `GG(A|T)CC` in the regular expression section and then write multiple sequences in the underlying test string section.
You will see all the matches highlighted.

### Exercise

Write a single regular expression that matches all the sequences in the first list and none of the words in the second list.

Test it using the search function.

Hint: use for loops to quickly test each list of sequences.

In [None]:
import re

sequences_to_match = ["CAT", "GAT", "CAG", "GAC"]
sequences_to_not_match = ["A", "CA", "AT", "AG", "CGT", "GAA", "AAT", "AAA"]

### Character groups `[xy]`

A character group defines a set of characters and it will match only one of them.
It is indicated as a sequence of characters between square brackets `[abc]`.
Note that this is exactly equivalent to use the alternative regex among the individual characters in the sequence, e.g. `(a|b|c)`.

The character group as a more compact representation for an alternative regex where each alternative is constituted by only 1 character.
If you have more complex alternatives, i.e. made of more than one character or that involve other regex, you can't use the character group.

In [None]:
import re

regex = [r"[AG]GC", r"[AGC]G[AGC]"]
sequences  = ["AGC", "GGC", "CGC", "TGC"]

for reg in regex:
    for seq in sequences:
        if re.search(reg, seq):
            print(reg, "matched", seq)
        else:
            print(reg, "not matched", seq)

### Combining regex

Now that you know some different regular expressions, you can start combining them to create more complex patterns.

Remember that:
 - The alternative regex is  made of two or more alternative patterns. Each pattern can be either a single character, a sequence of characters or another regex.
 - The character group defines a set of characters and it will match only one of the characters in its set.

In [None]:
import re

sequences  = ["AGGT", "CCATGTC", "AAAAAT", "TACTGC", "AGT", "AGTGT", "AGGAT"]
regex = r"A([GT]G|[AC])T"
for seq in sequences:
    if re.search(regex, seq):
        print(regex, "matched", seq)
    else:
        print(regex, "not matched", seq)

### Exercise

Write a single regular expression that describes the following dna sequence:
 - `A` or `G` or `C` or `AGC`
 - `TT`
 - two generic bases (i.e. they can be any of `A`, `T`, `G`, `C`)
 - `A` or `G`

Test that your regex matches the following sequences:
 - `ATTTTA`, `GTTCAA`, `AGCTTGGG`
 
Test that your regex does not match the following sequences:
 - `AGCTTA`, `ATTG`, `TTTAAA`, `CTATTG`, `AGCTTAA`
 
 Hint: when you are designing a complex regex, it can be useful to do it for steps and remember to try it on the online visualizer.

### Quantifiers `{2}`

Quantifiers are symbols that allow to control the number of times a pattern is repeated.

 - `{3}`: matches the preceeding pattern 3 times.
 - `{1,4}`: matches the preceeding pattern at least 1 time and up to 4 times.

Quantifiers are applied to the preceeding "pattern".
This "pattern" can either be:
 - A normal character
 - An alternative regex
 - A character group
 - Any combination of the previous, as long as it's enclosed in parenthesis `()`

In [None]:
import re

def search_regex(reg, sequences):
    # This is an utility function for quickly testing a regex on a list of strings
    for seq in sequences:
        if re.search(reg, seq):
            print(reg, "matched", seq)
        else:
            print(reg, "not matched", seq)
    print("*** ----------- ***")


search_regex(r"A{3}T", ["AAAT", "CAAATG", "AA", "ATAATC"])

search_regex(r"AT{3}", ["ATTTG", "ATTTT", "ATATAT"])

search_regex(r"(AT){3}", ["ATTTG", "ATTTT", "ATATAT"])

search_regex(r"(A|T){3}", ["AAA", "CTTTG", "ATAG", "CAA", "AT", "AAGTT", "AATT"])

search_regex(r"[AT]{3}", ["AAA", "CTTTG", "ATAG", "CAA", "AT", "AAGTT", "AATT"])

search_regex(r"(A{2}|A{3})", ["AAA", "CAAAG", "AA", "ACAAGT", "TAC"])

search_regex(r"A{2,3}", ["AAA", "CAAAG", "AA", "ACAAGT", "TAC"])

search_regex(r"AT{0,1}", ["A", "AT", "T", "GATG", "ATT"])

### Exercise

Write a single regular expression that describes the following dna sequence:
 - 3 equal bases (i.e. they can be any of `A`, `T`, `G`, `C`)
 - `TT`
 - between 0 and 3 `A`

Test that your regex matches the following sequences:
 - `AAATT`, `GGGTTAAA`, `CCCTTA`, `TTTTTT`
 
Test that your regex does not match the following sequences:
 - `AGTTT`, `AACTT`, `AATT`, `AAATTAAAA`