# Regex – Introduction

- Keep it simple: Operators
- Example:
    - Find EcoRI restriction enzyme site in sequence:

In [1]:
string = "TGCATAGCGAATTCGGACGT"
"GAATTC" in string

True

- Find Eco13kl restriction site in sequence
- CCNGG --> CCAGG, CCCGG, CCGGG or CCTGG

In [2]:
string = "CCTGGAGCCCAGGGGACGT"
"CCNGG " in string

False

In [3]:
string = "CCTGGAGCCCAGGGGACGT"
"CCAGG" in string

True

In [4]:
"CCTGG" in string

True

In [5]:
"CCCGG" in string

False

In [6]:
"CCGGG" in string

False

# Regex – re.findall – exercise
- Use re.findall to find:

## EcoRI_site = "GAATTC"
- sequence = "TGCATAGCGAATTCGAGCGT"

## AG_nucl = "AG"
- sequence = "TGCATAGCGAATTCGAGCGT"

## Eco13kl_site = "CCNGG"
- sequence = "CCTGGAGCCCAGGGAGCGT"

In [21]:
import re

eco_r1 = 'GAATTC'
AG_nucl = 'AG'
eco13kl_site = 'CCNGG'

sequence = 'TGCATAGCGAATTCGAGCGT'

print re.findall(eco_r1, sequence,0)
print re.findall(AG_nucl, sequence,0)
print re.findall(eco13kl_site, sequence,0)  #N not found


['GAATTC']
['AG', 'AG']
[]


## Regex – metacharacters

Metacharacters are characters that represent one or multiple characters you want to search for in a string.

Some examples of metacharacters:
- ^ Matches beginning of line
- $ Matches end of line
- . Matches any single character except newline
- [...] Matches any single character in brackets
- [^...] Matches any single character not in brackets
- a | b Matches either a or b

- Now repeat the Eco13kl.site question using [...]

In [38]:
## [ACTG] = A|C|T|G

eco13kl_site = 'CC[ACTG]GG'

sequence = "CCTGGAGCCCAGGGAGCGT"

print re.findall(eco13kl_site, sequence,0)

['CCTGG', 'CCAGG']


# Regex – exercise

Explore the regex listed using the script below, try to find out what the difference is and why:
1. CC vs ^CC 
2. G\*G vs G.\*G 
3. GT$ vs GT 
4. [AC] vs [^AC] 
5. GAG|GAC vs CAG|GAG 
6. TGA|TGG vs TG[AG]
7. CC\* vs CC+
8. CC{1,2} vs CC {1,}
9. \w\w\w vs \w\w\s
10. \d\d\S vs \d\d\D

In [46]:

line = "CCTGGAG123CCCCAGGTGACGT\nTGT"
find_output = re.findall("\d\d\D",line)
print find_output

###^^CC does beginning of the line contain CC
## G*G find G or GG
## G.*G find all encompassed by G
## $ find last
### [AC] find a or c
###[^AC] find not a or c
### find ... | or ...
## TG[AG] find TG plus A or G
### CC* find c or many C's
### CC+ find two or more C's
## CC{1,2} find either two or three C's
## CC{1,} find either two or more C's
### w\w\w = any character next to any character next to any character = 3 characters = codon
# w\w\s = any character next to any character next to \n
# d\d\S = find three numbers
# d\d\D = find two numbers and one character

['23C']


# Regex - Raw string notation
\>\>\> find_output = re.findall("\\\\\\\\",line)

This option searches for (escapes!) (\\)\\ (\\)\ --> \\\\

\>\>\> find_output = re.findall(r"\\\\",line)

This option searches for \\\\ --> \\\\

In [47]:
line = "this\nis\na\\ntest"
find_output = re.findall("\\\\",line)
print find_output

['\\']


In [48]:
line = "this\nis\na\\ntest"
find_output = re.findall(r"\\",line)
print find_output

['\\']


# Regex – other “problems” with strings

Execute the regex below, what does it find?

In [52]:
line = "CctGGAGccCAggGGacGT"
s = line.upper()
find_output = re.findall("CC[ACTG]GG",s)
print find_output

###finds nothing because capitals

['CCTGG', 'CCAGG']


# Regex – FLAGS – exercise

Now try to use the Ignore case flag, what does it find now?

Remember that you can also always use string.upper() or .lower()

In [54]:
line = "CctGGAGccCAggGGacGT"
find_output = re.findall("(?i)CC[ACTG]GG",line)
print find_output

###(?i) = find case insensitive

#find_output = re.findall("CC[ACTG]GG",line, re.I) works as well!!!

# Type here your code

['CctGG', 'cCAgg']


# Regex – FLAGS – exercise – re.S, re.M
- Apply re.S on the example below

In [74]:
line = "CCTGGAGCCC\nAGGGGACGT"
find_output = re.findall("CC.AGG",line, re.S)
print find_output

['CC\nAGG']


# Regex – FLAGS – exercise – re.S, re.M
- Apply re.M on the example below, and after that combine both the re.S and re.M flag on this example.

In [76]:
line = "a\nmultiline test\nto\ntest the multi\nline flag"
find_output = re.findall("^test.*",line, re.M|re.S)
print find_output

['test the multi\nline flag']


# Regex – re.sub – exercise
- In the example below correct the sentence using re.sub

In [83]:
line = "the hedgehog is teh most dangerous animal in teh world"

find_output = re.findall("teh", line, 0)
print find_output

sub = re.sub("teh", "the", line)

print sub


['teh', 'teh']
the hedgehog is the most dangerous animal in the world


# Regex – re.sub – exercise
- In the example below replace the two “colors” by red using re.sub and regex

In [84]:
line = "My computer should be grey and my car should also be gray"

sub = re.sub("gr[a|e]y", "red", line)

print sub


My computer should be red and my car should also be red


# Regex – re.split – exercise
- Split the line below on the numbers and/including the spaces around them
- What happened to the spaces and the numbers within the output?

In [90]:
line = "You 1 should pay attention 2 will pay attention 3 and otherwise you will fail"

output_line = re.split(" 1 | 2 | 3 ", line)

print output_line

##including spaces makes you lose spaces

['You', 'should pay attention', 'will pay attention', 'and otherwise you will fail']


# Regex – re.split – groups – exercise

- In the previous exercise you could split the line, however the number and spaces itself were "lost".
- To keep the split parts of the string we can use groups

Exercise:

- Split the line again only now use ”(\s\*\d\s\*)" what happens?

- And what happens if you use "(\s\*)(\d)(\s\*)"

In [94]:
line = "You 1 should pay attention 2 will pay attention 3 and otherwise you will fail"

output_line = re.split("(\s*\d\s*)" , line)

print output_line

##"(\s*\d\s*)" splits around number, keeps number

output_line2 = re.split("(\s*)(\d)(\s*)" , line)

print output_line2

##"(\s*)(\d)(\s*)" splits around number, keeps number, splits on space as well

['You', ' 1 ', 'should pay attention', ' 2 ', 'will pay attention', ' 3 ', 'and otherwise you will fail']
['You', ' ', '1', ' ', 'should pay attention', ' ', '2', ' ', 'will pay attention', ' ', '3', ' ', 'and otherwise you will fail']


In [None]:
line = "You 1 should pay attention 2 will pay attention 3 and otherwise you will fail"
# Type here your code

# Regex – re.sub – groups – exercise
- These groups are very handy for also substitutions
- See what happens when you use grouping on the line below:
- \g< 1 > stands for group 1 = the first group between ()
- \g< 2 > stands for group 2, etc..

In [97]:
line = "My computer should be grey and my car should also be gray"
find_output = re.sub("(gr[ae]y)", "\g<1>blue", line)
print find_output

My computer should be greyblue and my car should also be grayblue


# Regex – re.sub – groups – exercise
- Try to understand what happens in the example below

In [98]:
line = "My computer should be grey and my car should also be gray"
find_output = re.sub("(gr[ae]y)(\D*)(gr[ae]y)", "\g<1>blue\g<2>not\g<3>black", line)
print find_output

###group one is grey adds blue, second group is gray adds in front as 2 not, adds behind as 3rd group black


My computer should be greyblue and my car should also be notgrayblack


# Regex – re.search
- example:

In [103]:
line = "TGCATAGCGAATTCGAGCGT"
match_output = re.search("GAATTC",line)
print match_output

#search does not work like findall, requires .and extra question

GAATTC


In [100]:
if match_output:
    print "GAATTC site found!"
print match_output.group()  #print the group content
print match_output.start()  #print the starting position
print match_output.end()    #print the ending position
print match_output.span()   #print the starting and ending position

GAATTC site found!
GAATTC
8
14
(8, 14)


# Regex – final exercise
- We are going to digest the DNA sequence below with two restriction enzymes
    - BamH1 G|GATCC
    - AccI GT|MKAC (M=A/C, K=G/T)

- It is forbidden to use str.split(), str.lower(), str.upper()!

Q1: How many times is each restriction enzyme found?

Q2: After digestion, how many DNA fragments are there and what is the length of each product (provide a list)?

Challenge:

Try to answer the questions in as few lines as possible: use groups and nesting

In [104]:
dna = "CGTGACCTTGGACCTCACTCACCATGTAGTACTCCTCTGAGAGGAATTGTACTAGAGGTGAAAACCGATAAGAAATCACAGTCTGATATGCGTGTGTGTCGACATGCATAATGTATACCCCTTACTGAGTCGTATGGGAATATCCGGCATGACGGGAGAAGCCGTAGACCAAAGGTGTGAGTGAGCATCGTTGTGAACAGTCTGGGTAAACGCGCATATGTAATGTAGTGGATCCTGACACACTCTGGACAAGGGCTCTCTGGGGAACTTGATTTTACTAATGGACTCCAAGAAGCGACGCGCACTCGGTTATGGCGCGCACACTAAAGCGAGGGATCCTAAAAGCTCATGAAGAGGTTCGATCGCTGACTAGTATGGTTATACCCGACACCGCACTGTCGCGTAGACCGCTCCTAGGATTAAATGATCACCCGCACATTGATGCGCGCGTTGCGGGTGAAAGTAGTGAACCCAAGAGTACTTGCCCGTCCGTGGCTCTAGCGTGCATACGTTACATTTTGACGCCTAAAGGTGTCTTGTCAGAGCACGTCCGGGCACAGTAGCAGATACCGGATATCTCATACGTCCGGAGCAGCGCGCGTACTCAAAGTGTGCCCAAGCTCGCATCCGAATTCGGATCCTGCCTTGCTCCCCTACACAAACTATCACGAATAAGCGCATATAAAGCGTCCACCACCTGTAACTTTACTGACCAAAGCATGTCGAGGCGATTAAAGTGGCCGTATGGACATCACAGCCCGTGCCCGACCATTATTAGCGCCGCTACTTCTCCGCGCGCATGTTGACGCTTCTGATGTAGGGTGTGCGGGTCCCAATTGATATATTTATTCGGAGTTACAAAACTGGTACAGAGGCTGTCCGTGCTCTA"

In [179]:
BamH1 = "(G)(GATCC)"
AccI = "(GT)([AC][GT]AC)"
BA = "-"

BAM_output = re.findall(BamH1,dna)
print BAM_output
print len(BAM_output)

ACC_output = re.findall(AccI,dna)
print ACC_output
print len(ACC_output)

sub = re.sub( BamH1, "\g<1>-\g<2>", dna)
sub2 = re.sub( AccI, "\g<1>-\g<2>", sub)

print sub2 
spl = re.split( BA , sub2)

print spl
print len(spl)

lenfrag = []


for i in spl:
    lenfrag.append(len(i))
print lenfrag

[('G', 'GATCC'), ('G', 'GATCC'), ('G', 'GATCC')]
3
[('GT', 'CGAC'), ('GT', 'ATAC'), ('GT', 'AGAC'), ('GT', 'AGAC')]
4
CGTGACCTTGGACCTCACTCACCATGTAGTACTCCTCTGAGAGGAATTGTACTAGAGGTGAAAACCGATAAGAAATCACAGTCTGATATGCGTGTGTGT-CGACATGCATAATGT-ATACCCCTTACTGAGTCGTATGGGAATATCCGGCATGACGGGAGAAGCCGT-AGACCAAAGGTGTGAGTGAGCATCGTTGTGAACAGTCTGGGTAAACGCGCATATGTAATGTAGTG-GATCCTGACACACTCTGGACAAGGGCTCTCTGGGGAACTTGATTTTACTAATGGACTCCAAGAAGCGACGCGCACTCGGTTATGGCGCGCACACTAAAGCGAGG-GATCCTAAAAGCTCATGAAGAGGTTCGATCGCTGACTAGTATGGTTATACCCGACACCGCACTGTCGCGT-AGACCGCTCCTAGGATTAAATGATCACCCGCACATTGATGCGCGCGTTGCGGGTGAAAGTAGTGAACCCAAGAGTACTTGCCCGTCCGTGGCTCTAGCGTGCATACGTTACATTTTGACGCCTAAAGGTGTCTTGTCAGAGCACGTCCGGGCACAGTAGCAGATACCGGATATCTCATACGTCCGGAGCAGCGCGCGTACTCAAAGTGTGCCCAAGCTCGCATCCGAATTCG-GATCCTGCCTTGCTCCCCTACACAAACTATCACGAATAAGCGCATATAAAGCGTCCACCACCTGTAACTTTACTGACCAAAGCATGTCGAGGCGATTAAAGTGGCCGTATGGACATCACAGCCCGTGCCCGACCATTATTAGCGCCGCTACTTCTCCGCGCGCATGTTGACGCTTCTGATGTAGGGTGTGCGGGTCCCAATTGATATATTTATTCGGAGTTACAAAACTGGTACAGAGG