# Regular Expressions



### Escape characters

Every line of code in Python contains one or more instructions to be executed. If you split a command into 2 lines, it will result in an error.

In [2]:
x = 1 +
2

SyntaxError: invalid syntax (<ipython-input-2-bdb480158c9f>, line 1)

This may be a problem if you want to write a **string** with text across more than 1 line.

In [3]:
x = "hello
world"

SyntaxError: EOL while scanning string literal (<ipython-input-3-42dbf8328085>, line 1)

Python provides the special character `"\n"` to represent a "new line" in your **strings**. This is called an **escape characters**: it's a backslash `\` followed by a regular character.

**Escape characters** are recognized by Python as special and treated accordingly.
Note how the `"\n"` is exactly replaced by a new line when you print it.

In [5]:
print("hello\nworld")
print("escaping \n characters")

hello
world
escaping 
 characters


New lines are not the only **escape characters** in Python.
There are many more, used to easily represent more complex strings.

Each **escape character** is exactly replaced independently from the adjacent characters. You can also have one **escape character** after the other.

In [26]:
x = "This is a \ttab"
print(x)

x = "Mixing t and tabs t\t\tt\t"
print(x)

x = "These are quotes \' \""
print(x)

x = "This is an\n\tindented text on a new line"
print(x)

x = "And finally backslashes \\\\"
print(x)

This is a 	tab
Mixing t and tabs t		t	
These are quotes ' "
This is an
	indented text on a new line
And finally backslashes \\


### The `re` module

An important Python **module** used for working with **strings** is the **regular expressions** one, named `re`.

A **regular expression** (or **regex** for short) allows to implement functionalities like **find** or **find and replace** using special search pattenrs.

**Regular expressions** add a whole new layer of special characters that are extremely helpful for searching for matches in a **string**.
They are more complex than **escape characters**, but way more powerful.

Let's see the how to use this new module for searching characters in a **string**.

In [86]:
import re

dna = "ATCGCGGTCCCAC"

if re.search("GAATTC", dna):
    print("EcoRI restriction site found!")
else:
    print("EcoRI restriction site not found!")
        
if re.search("GGACC", dna) or re.search("GGTCC", dna):
    print("AvaII restriction site found!")
else:
    print("AvaII restriction site not found!")
    
if re.search(r"GG[AT]CC", dna):
    print("AvaII restriction site found!")
else:
    print("AvaII restriction site not found!")

EcoRI restriction site not found!
AvaII restriction site found!
AvaII restriction site found!


The **regular expression** `[AT]` represents a character that can be either `A` or `T`.
Note how we have to add a `r` in front of a **string** containing a **regular expression**: this is to tell Python to treat the following text in a special way. Remember that the `r` is outside of the quotes.

You can put any number of characters within square brackets `[` `]` to be included for the match.

https://pythex.org/

### Exercise

Check if 

### Exercise

### `^` and `$`

In [None]:
x = "Hello world".replace("o", "0")
print ("Substituted \"o\" with \"0\" with replace:", x)

x = re.sub("o", "0", "Hello world")
print ("Substituted \"o\" with 0\" with a regex:", x)

x = re.sub(r"\d", "a", "1 z 2 z 3 z 4 z")
print("Substituted any digit with \"a\":", x)