# Section 1 BioInformatics

In [24]:
import re

Python has a built-in package called re, which can be used to work with Regular Expressions.
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.
RegEx can be used to check if a string contains the specified search pattern.

Python re methods
* findall:	Returns a list containing all matches
* search:	Returns a Match object if there is a match anywhere in the string
* split:	Returns a list where the string has been split at each match


### re.search method
* re.search will identify a pattern occurring anywhere in the string.



In [25]:
dna="ATCGCGAATTCAC"
if re.search(r"GGACC",dna) or re.search(r"GGTCC",dna):
    print("Found")
# Another method
if re.search(r"GG(A|T)CC",dna):
    print("Found")
else:
    print("Not found")

Not found


In [26]:
# when you need to search for a pattern with any N base at a specific place
# GCNGC
if re.search(r"GC(A|T|G|C)CC",dna):
    print("Found")
# Or 
if re.search(r"GC[ATGC]GC",dna):# this will match GCAGC, GCTGC, GCGGC AND GCCGC
    print("Found")
if re.search(r"GC[ATGC]AA",dna):
    print("Found")

Found


* A question mark ? immediately following a character means that the character is optional
* So, in pattern GAT?C the T is optional, and the pattern will match either GATC or GAC.
* In the pattern GGG(AAA)?TTT the group of three As is optional, so the pattern will match either GGGAAATTT or GGGTTT



In [27]:
dna="GAGCGATC"
if re.search(r"GAT?C",dna):
    print("Found")
dna="TACGGGTTT"
if re.search(r"GGG(AAA)?TTT",dna):
    print("Found")

Found
Found


* Plus sign immediately following a character or a group means that the character or the group must be present but can be repeated any number of times

In [28]:
dna0="GGGTTT"
dna1="GGGATTT"
dna2="GGGAATTT"
dna3="GGGAAATTT"

DNA=[dna0,dna1,dna2,dna3]
for Dna in DNA:
    if re.search(r"GGGA+TTT",Dna):
        print(Dna,", Found")
    else:
        print(Dna,", Not Found")


GGGTTT , Not Found
GGGATTT , Found
GGGAATTT , Found
GGGAAATTT , Found


* An asterisk * immediately following a character or group means that the character or group is optional, but can also be repeated.


In [29]:
DNA=[dna0,dna1,dna2,dna3]
for Dna in DNA:
    if re.search(r"GGGA*TTT",Dna):
        print(Dna,", Found")
    else:
        print(Dna,", Not Found")

GGGTTT , Found
GGGATTT , Found
GGGAATTT , Found
GGGAAATTT , Found


* Following a character or group with a single number inside curly brackets {} will match exactly that number of repeats
* Following a character or group with a pair of numbers inside curly brackets {} separated with a comma allows us to specify an acceptable range of number of repeats

In [30]:
for Dna in DNA:
    if re.search(r"GGGA{2}TTT",Dna):
        print(Dna,", Found")
    else:
        print(Dna,", Not Found")

GGGTTT , Not Found
GGGATTT , Not Found
GGGAATTT , Found
GGGAAATTT , Not Found


In [31]:
for Dna in DNA:
    if re.search(r"GGGA{0,2}TTT",Dna):
        print(Dna,", Found")
    else:
        print(Dna,", Not Found")

GGGTTT , Found
GGGATTT , Found
GGGAATTT , Found
GGGAAATTT , Not Found


* The caret symbol ^ matches the start of a string, and the dollar symbol matches the end of a string.

In [32]:
for Dna in DNA:
    if re.search(r"^GGG",Dna):
        print(Dna,", Found")
    else:
        print(Dna,", Not Found")


GGGTTT , Found
GGGATTT , Found
GGGAATTT , Found
GGGAAATTT , Found


In [33]:
for Dna in DNA:
    if re.search(r"TTT$",Dna):
        print(Dna,", Found")
    else:
        print(Dna,", Not Found")

GGGTTT , Found
GGGATTT , Found
GGGAATTT , Found
GGGAAATTT , Found


### re.search method
* re.search will identify a pattern occurring anywhere in the string.
* re.search returns the part of the string where there was a match
* returns the part of the string where there was a match
* The Match object has properties and methods used to retrieve information about the search, and the result:

1. .span() returns a tuple containing the start-, and end positions of the match.
2. .string returns the string passed into the function
3. .group() returns the part of the string where there was a match

In [34]:
dna = "ATGACGTACGTACGACTG"
# store the match object in the variable m
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print(m)
print("entire match: " + m.group())
print("The first: " + m.group(1))
print("The first Span: " , m.span(1))


<re.Match object; span=(2, 13), match='GACGTACGTAC'>
entire match: GACGTACGTAC
The first: CGT
The first Span:  (4, 7)


#### Getting the position of a match using start() and end()functions

In [55]:
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("start: " + str(m.start()))
print("end: " + str(m.end()))

start: 2
end: 13


In [56]:
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("start: " + str(m.start()))
print("end: " + str(m.end()))
print("group one start: " + str(m.start(1)))
print("group one end: " + str(m.end(1)))
print("group two start: " + str(m.start(2)))
print("group two end: " + str(m.end(2)))

start: 2
end: 13
group one start: 4
group one end: 7
group two start: 9
group two end: 11


### re.split method
* Returns a list where the string has been split at each match
* After [ ] If the first character of the set is '^', all the characters that are not in the set will be matched.

Imagine we have a consensus DNA sequence that contains ambiguity codes, and we
want to extract all runs of contiguous unambiguous bases. We need to split the DNA
string wherever we see a base that isn't A, T, G or C

In [76]:
dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG"
cons = re.split(r"[^ACTG]", dna)# when you find anything other than any ACGT split
print(cons)

['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']


### re.findall() method
The findall() function returns a list containing all matches.

In [77]:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.findall(r"[TC]{1,5}", dna)
print(runs)

['CT', 'C', 'TT', 'T', 'TC', 'T', 'C', 'TT', 'T', 'C', 'C', 'C']


In [80]:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.findall(r"[AT]{4,100}", dna) # returns repeated from 4 till 100
print(runs)

['ATTATAT', 'AAATTATA']


#### Finding multiple matches using re.finditer() function
* If we want to do anything more complicated than simply extracting the text of the matches, we need to use the re.finditer method.
* finditer returns a sequence of match objects

In [81]:
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{3,100}", dna)
for match in runs:
    run_start = match.start()
    run_end = match.end()
    print("AT rich region from " + str(run_start) + " to " + str(run_end))

AT rich region from 5 to 12
AT rich region from 18 to 26


# Introduction to Biopython 

## What is Biopython
* Biopython is a set of freely available tools for biological computation written in Python by an international team of developers.
* The goal of Biopython is to make it as easy as possible to use Python for bioinformatics

* <a href="http://www.biopython.org" target="_top">Biopython website </a> provides an online resource for modules, scripts, and web links for developers of Python-based software for bioinformatics use and research.

### What can Biopython do?

1. Parse bioinformatics files (FASTA, GenBank, Blast output, ...)
2. Perform common operations on sequences, such as translation and transcription.
3. Perform classification of data using K-Nearest Neighbors,Naive Bayes or Support Vector Machines.
4. Access certain databases.

In [82]:
import Bio

In [83]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
print(my_seq)


AGTACACTGGT


In [84]:
print('Complement',my_seq.complement())

Complement TCATGTGACCA


In [85]:
# You can access elements of the sequence in the same way as for strings (but remember, Python counts from zero!):

print(my_seq[0]) #first letter
print(my_seq[2]) #third letter
print(my_seq[-1]) #last letter


A
T
T


In [86]:
# seq.count gives non-overlapping count
my_seq2 = Seq("TATTATAT")
print(my_seq2.count("TAT")) # if overlapped count 

2


### Slicing a sequence

In [87]:
my_seq3 = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
my_seq3[4:12]


Seq('GATGGGCC')

### Concatenating or adding sequences

In [88]:
protein_seq = Seq("EVRNAK")
dna_seq = Seq("ACGT")
protein_seq + dna_seq

Seq('EVRNAKACGT')

## Transcription
changing the coding strand to mRNA (T->U)


In [89]:
coding_dna=Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')
messenger_rna = coding_dna.transcribe()
messenger_rna


Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

#### back-transcription
The Seq object also includes a back-transcription method for going from the mRNA to the coding strand of the DNA. Again, this is a simple U → T substitution:

In [90]:
bck_transcription=messenger_rna.back_transcribe()
bck_transcription

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

## Translation
Translation refers to the process of creating proteins from an mRNA template.
* Note:these use codon table objects derived from the NCBI information

In [91]:
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_id[1]
print(standard_table)


Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [92]:
# you can translate from coding strand DNA sequence
messenger_rna.translate()

Seq('MAIVMGR*KGAR*')

In [93]:
# You can also translate directly from the coding strand DNA sequence:
coding_dna.translate()


Seq('MAIVMGR*KGAR*')

In [94]:
coding_dna.translate(to_stop=True)


Seq('MAIVMGR')

In [95]:
# You can change the stop symbol
coding_dna.translate(stop_symbol="@")

Seq('MAIVMGR@KGAR@')

# Assignment
1. Open Fasta File and print its contents in form 
* Ex:
* gi|2765658|emb|Z78533.1|CIZ78533
* Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC')
* 740
2. Difference between seq and MutableSeq