# Biologically relevant computing

Programming allows us to perform tasks on data which would otherwise be infeasible. For example, calculating the GC content in a gene or genome of interest, deriving the protein sequence from a dna sequence. The most important concepts for computing these tasks are 
-  Reading and writing files 
-  For loops 
-  Conditional statements 
-  Basic data types E.g strings, lists, & dictionaries
-  Functions
-  regular expressions

## Reading and Writing a file

First lets make a file (seqs.txt) with 4 nucleotide exon sequences with a new line character at the end of each sequence.
It's important to note that we assign a variable name to every object. There are certain names you should never use such as, str, list, file. These have special meanings in python and should never be used. 

### Writing

In [20]:
seq1 = 'atgactagtgtcgatgcagctcagaatcctactcgtagcagactggatcgatcgagcagctggctgcgctatgctagtgattcgtcgataccccaatga\n'
seq2 = 'tcattcgcatgtaaacattgagctagcgccgtctctgcgcacggatgaatatgtatagtaacgaatggagcgcgtaagggacatctgcggaaagtgcat\n'
seq3 = 'tcattggggtatcgacgaatcactagcatagcgcagccagctgctcgatcgatccagtctgctacgagtaggattctgagctgcatcgacactagtcat\n'
seq4 = 'atgcactttccgcagatgtcccttacgcgctccattcgttactatacatattcatccgtgcgcagagacggcgctagctcaatgtttacatgcgaatga\n'

outfile  = open("seqs.txt","w")
outfile.write(seq1)
outfile.write(seq2)
outfile.write(seq3)
outfile.write(seq4)
outfile.close()

### Reading

There are 3 different ways to 'read' data in python

-  read
-  readline
-  readlines 

We will look at what the 3 different ways do

In [21]:
file_connection = open("seqs.txt","r") # make connection to the file

seqs = file_connection.read() # load all the data as one string which will be stored \\
#                               in a variable name called seqs

file_connection.close() # always close the connection to the file

print(type(seqs))
print(seqs)

<type 'str'>
atgactagtgtcgatgcagctcagaatcctactcgtagcagactggatcgatcgagcagctggctgcgctatgctagtgattcgtcgataccccaatga
tcattcgcatgtaaacattgagctagcgccgtctctgcgcacggatgaatatgtatagtaacgaatggagcgcgtaagggacatctgcggaaagtgcat
tcattggggtatcgacgaatcactagcatagcgcagccagctgctcgatcgatccagtctgctacgagtaggattctgagctgcatcgacactagtcat
atgcactttccgcagatgtcccttacgcgctccattcgttactatacatattcatccgtgcgcagagacggcgctagctcaatgtttacatgcgaatga



In [22]:
file_connection = open("seqs.txt","r")

seqs1 = file_connection.readline() # load the data oneline at a time

print(type(seqs1)) 
print(seqs1)

seqs1 = file_connection.readline() # read second line 
print(seqs1)

file_connection.close()

<type 'str'>
atgactagtgtcgatgcagctcagaatcctactcgtagcagactggatcgatcgagcagctggctgcgctatgctagtgattcgtcgataccccaatga

tcattcgcatgtaaacattgagctagcgccgtctctgcgcacggatgaatatgtatagtaacgaatggagcgcgtaagggacatctgcggaaagtgcat



In [23]:
file_connection = open("seqs.txt","r")
seqs2 = file_connection.readlines() # load each line of the file and store as a list of sequences
file_connection.close()
print(type(seqs2))
print(seqs2)

<type 'list'>
['atgactagtgtcgatgcagctcagaatcctactcgtagcagactggatcgatcgagcagctggctgcgctatgctagtgattcgtcgataccccaatga\n', 'tcattcgcatgtaaacattgagctagcgccgtctctgcgcacggatgaatatgtatagtaacgaatggagcgcgtaagggacatctgcggaaagtgcat\n', 'tcattggggtatcgacgaatcactagcatagcgcagccagctgctcgatcgatccagtctgctacgagtaggattctgagctgcatcgacactagtcat\n', 'atgcactttccgcagatgtcccttacgcgctccattcgttactatacatattcatccgtgcgcagagacggcgctagctcaatgtttacatgcgaatga\n']


In most cases readlines() is the easiest to use as it allows us to read all the data as a list and easily perform iterations through the list.

## For loops 
In programming languages certain datatypes are iteratable, i,e we can toggle through each entry of the datatype. A list is one such type.

Recall how we read the data using ```readlines()``` as a list called ```seqs2```

The for loop allows to read and process the sequences one at a time in the order they appear in the list 

In [24]:
counter = 0
for seq_iter in seqs2:
    counter += 1 # This is equal to counter = counter + 1
    print("Sequence {}".format(counter)) # insert integer into string recall strings are immutable
    print(seq_iter)

Sequence 1
atgactagtgtcgatgcagctcagaatcctactcgtagcagactggatcgatcgagcagctggctgcgctatgctagtgattcgtcgataccccaatga

Sequence 2
tcattcgcatgtaaacattgagctagcgccgtctctgcgcacggatgaatatgtatagtaacgaatggagcgcgtaagggacatctgcggaaagtgcat

Sequence 3
tcattggggtatcgacgaatcactagcatagcgcagccagctgctcgatcgatccagtctgctacgagtaggattctgagctgcatcgacactagtcat

Sequence 4
atgcactttccgcagatgtcccttacgcgctccattcgttactatacatattcatccgtgcgcagagacggcgctagctcaatgtttacatgcgaatga



## Conditional statements

Often in programming we only want to perform a task if some conditions are met. This can be anything from a variable being equal to some value or object, some mathematical operation being true which we will see later on. 

In [25]:
# if / if not
if seqs2:
    print("seqs 2 true")
if not seqs2:
    print("seqs 2 false")

# when false
seqs3 = None

if seqs3:
    print("seqs 3 true")
if not seqs3:
    print("seqs 3 false")
    
# We can rewrite these statements together using else:
if seqs3:
    print("seqs 3 true")
else:
    print("seqs 3 false")


seqs 2 true
seqs 3 false
seqs 3 false


Often what we are looking for does not have a true or false outcome in these situations we can use elif "else if"



In [26]:
if 9%2 == 0:
    print("divisible by 2")
elif 9%3 == 0:
    print("divisible by 3")
else:
    print("Not divisible by 2 or 3")


divisible by 3


we can also chain statements together using boolean logic (AND OR NOT)


In [27]:
if 12%2 == 0 and 12%3 == 0:
    print("multiple factors")
else:
    print("not divisible by 2 and 3")

if 14%7 == 0 or 14%3 == 0:
    print("divisible by 7 or 3")
else:
    print("not divisible by 7 or 3")


multiple factors
divisible by 7 or 3


we can also negate these terms


In [28]:
if not 9%2 == 0:
    print("not divisible by 2")
elif not 9%3 == 0:
    print("not divisible by 3")
else:
    print("divisible by 2 or 3")

###
if not 12%2 == 0 and not 12%3 == 0:
    print("not multiple factors")
else:
    print("divisible by 2 and 3")

###
if not 14%7 == 0 or not 14%3 == 0:
    print("not multiple factors")
else:
    print("divisible by 7 or 3")

not divisible by 2
divisible by 2 and 3
not multiple factors


In [29]:
test = 10
test1 = 100//10
test2 = str(10)
print(test, id(test))
print(test1, id(test1))
print(test2, id(test2))

(10, 94754865456352)
(10, 94754865456352)
('10', 140267934434288)


# Dictionaries

The central dogma states DNA -> mRNA -> Protein. This essentially means we can convert a string of dna sequence into its corresponding protein sequence. The easiest way to do this is to use a dictionary and some string processing. A dictionary is a datatype that holds a key, value pair. The key must be unique where the value can be any object. In our context we would choose the codon (triplet of dna encoding an amino acid) as our key and the amino acid as the value. The reason for this is multiple codons can code for an amino acid remember the key needs to be unique (TAT = Tyr & TAC = Tyr).     

In [30]:
# a dictionary is defined using {} remember that [] = list, ""= string
codons = {'aaa': 'K', 'aac': 'N', 'aag': 'K', 'aat': 'N', 'aca': 'T', 'acc': 'T',
 'acg': 'T', 'act': 'T', 'aga': 'R', 'agc': 'S', 'agg': 'R', 'agt': 'S',
 'ata': 'I', 'atc': 'I', 'atg': 'M', 'att': 'I', 'caa': 'Q', 'cac': 'H',
 'cag': 'Q', 'cat': 'H', 'cca': 'P', 'ccc': 'P', 'ccg': 'P', 'cct': 'P', 
 'cga': 'R', 'cgc': 'R', 'cgg': 'R', 'cgt': 'R', 'cta': 'L', 'ctc': 'L',
 'ctg': 'L', 'ctt': 'L', 'gaa': 'E', 'gac': 'D', 'gag': 'E', 'gat': 'D',
 'gca': 'A', 'gcc': 'A', 'gcg': 'A', 'gct': 'A', 'gga': 'G', 'ggc': 'G',
 'ggg': 'G', 'ggt': 'G', 'gta': 'V', 'gtc': 'V', 'gtg': 'V', 'gtt': 'V',
 'taa': '*', 'tac': 'Y', 'tag': '*', 'tat': 'Y', 'tca': 'S', 'tcc': 'S',
 'tcg': 'S', 'tct': 'S', 'tga': '*', 'tgc': 'C', 'tgg': 'W', 'tgt': 'C',
 'tta': 'L', 'ttc': 'F', 'ttg': 'L', 'ttt': 'F'}

Dictionaries have a very nice property that allows them to be iterated by key, value or key value together we will look at what I mean by this

In [31]:
print(codons.values())

['Y', 'C', 'G', 'S', 'F', 'C', '*', '*', 'Y', 'F', 'S', 'L', 'L', 'S', 'S', 'A', 'V', 'A', 'V', 'A', 'V', 'R', 'V', 'A', 'D', 'L', 'P', 'R', 'R', 'L', 'T', 'R', 'G', 'G', 'G', 'E', 'T', 'D', 'P', 'E', 'T', 'M', 'K', 'K', 'I', 'N', 'I', 'R', 'P', 'S', 'R', 'H', 'N', 'I', 'L', 'L', 'T', 'H', '*', 'Q', 'S', 'Q', 'P', 'W']


In [32]:
print(codons.keys())

['tat', 'tgt', 'ggt', 'tct', 'ttt', 'tgc', 'tag', 'taa', 'tac', 'ttc', 'tcg', 'tta', 'ttg', 'tcc', 'tca', 'gca', 'gta', 'gcc', 'gtc', 'gcg', 'gtg', 'cgt', 'gtt', 'gct', 'gat', 'ctt', 'cct', 'cga', 'cgc', 'ctc', 'aca', 'cgg', 'ggg', 'gga', 'ggc', 'gag', 'acg', 'gac', 'ccg', 'gaa', 'acc', 'atg', 'aag', 'aaa', 'atc', 'aac', 'ata', 'agg', 'cca', 'agc', 'aga', 'cat', 'aat', 'att', 'ctg', 'cta', 'act', 'cac', 'tga', 'caa', 'agt', 'cag', 'ccc', 'tgg']


In [33]:
print(codons.items())

[('tat', 'Y'), ('tgt', 'C'), ('ggt', 'G'), ('tct', 'S'), ('ttt', 'F'), ('tgc', 'C'), ('tag', '*'), ('taa', '*'), ('tac', 'Y'), ('ttc', 'F'), ('tcg', 'S'), ('tta', 'L'), ('ttg', 'L'), ('tcc', 'S'), ('tca', 'S'), ('gca', 'A'), ('gta', 'V'), ('gcc', 'A'), ('gtc', 'V'), ('gcg', 'A'), ('gtg', 'V'), ('cgt', 'R'), ('gtt', 'V'), ('gct', 'A'), ('gat', 'D'), ('ctt', 'L'), ('cct', 'P'), ('cga', 'R'), ('cgc', 'R'), ('ctc', 'L'), ('aca', 'T'), ('cgg', 'R'), ('ggg', 'G'), ('gga', 'G'), ('ggc', 'G'), ('gag', 'E'), ('acg', 'T'), ('gac', 'D'), ('ccg', 'P'), ('gaa', 'E'), ('acc', 'T'), ('atg', 'M'), ('aag', 'K'), ('aaa', 'K'), ('atc', 'I'), ('aac', 'N'), ('ata', 'I'), ('agg', 'R'), ('cca', 'P'), ('agc', 'S'), ('aga', 'R'), ('cat', 'H'), ('aat', 'N'), ('att', 'I'), ('ctg', 'L'), ('cta', 'L'), ('act', 'T'), ('cac', 'H'), ('tga', '*'), ('caa', 'Q'), ('agt', 'S'), ('cag', 'Q'), ('ccc', 'P'), ('tgg', 'W')]


In [34]:
for key in codons.keys():
    if key =="atg":
        print("found {} in dictionary\n".format(key))
        
for value in codons.values():
    if value == "T":
        print("found {} in dictionary".format(value))
        
for key,value in codons.items():
    if value == "T":
        print("The triplets encoding {} are {}".format(value, key))

found atg in dictionary

found T in dictionary
found T in dictionary
found T in dictionary
found T in dictionary
The triplets encoding T are aca
The triplets encoding T are acg
The triplets encoding T are acc
The triplets encoding T are act


We also need to perform some checks on our sequences. Is there a start triplet (ATG) present? Do we need to reverse complement our sequence? First we need to do some string processing. The string needs to be of length divisible by 3 ***!*** python indices begin at 0 

In [35]:
for seq_iter in seqs2:
    if len(seq_iter.strip())%3 == 0: # remember the string has a newline characrter
        print("good")

good
good
good
good


Protein sequences at the dna level begin with '*atg*' 

In [36]:
counter = 0
for seq_iter in seqs2:
    counter += 1
    if seq_iter.strip().startswith('atg'):
        print("sequence {} good ".format(counter))
    elif seq_iter.strip().endswith('cat'):
        print("reverse complement sequence {}".format(counter))

sequence 1 good 
reverse complement sequence 2
reverse complement sequence 3
sequence 4 good 


Due to the double stranded nature of dna a sequence may be in the reverse complementary orientation ie 'atg' -> 'cat (a:t,c:g pairng with 5' and 3' direction switch). We can use a simple function to do this. We will look at functions later. The simple function uses string processing ```string[::-1]``` which reads the string from right to left, allowing all strings(dna sequences) to be orientated in the same direction 5' -> 3'. We also  use a for loop and a dictionary we can call the join function to convert the list back to a string. The for loop is inlist comprehension form.

In [37]:
def reverse_complement(dna):
    complement = {'a': 't', 'c': 'g', 'g': 'c', 't': 'a'}
    return ''.join([complement[base] for base in dna[::-1]])

reverse_complement('atg')

'cat'

In [38]:
seq_aligned = []
for seq_iter in seqs2:
    if seq_iter.strip().startswith('atg'):
        seq_aligned.append(seq_iter.strip())
    elif seq_iter.strip().endswith('cat'):
        seq_aligned.append(reverse_complement(seq_iter.strip()))

print(seqs2)
print(seq_aligned) # newline characyer (\n) removed and reverse complemented


['atgactagtgtcgatgcagctcagaatcctactcgtagcagactggatcgatcgagcagctggctgcgctatgctagtgattcgtcgataccccaatga\n', 'tcattcgcatgtaaacattgagctagcgccgtctctgcgcacggatgaatatgtatagtaacgaatggagcgcgtaagggacatctgcggaaagtgcat\n', 'tcattggggtatcgacgaatcactagcatagcgcagccagctgctcgatcgatccagtctgctacgagtaggattctgagctgcatcgacactagtcat\n', 'atgcactttccgcagatgtcccttacgcgctccattcgttactatacatattcatccgtgcgcagagacggcgctagctcaatgtttacatgcgaatga\n']
['atgactagtgtcgatgcagctcagaatcctactcgtagcagactggatcgatcgagcagctggctgcgctatgctagtgattcgtcgataccccaatga', 'atgcactttccgcagatgtcccttacgcgctccattcgttactatacatattcatccgtgcgcagagacggcgctagctcaatgtttacatgcgaatga', 'atgactagtgtcgatgcagctcagaatcctactcgtagcagactggatcgatcgagcagctggctgcgctatgctagtgattcgtcgataccccaatga', 'atgcactttccgcagatgtcccttacgcgctccattcgttactatacatattcatccgtgcgcagagacggcgctagctcaatgtttacatgcgaatga']


We will also want to break our dna sequence up into triple codons. We can do this by slicing our dna sequence string every 3 bases using ```string[pos:pos+3]``` and appending to a list called ```triplets```

In [39]:
for seq_iter in seq_aligned:
    triplets = []
    for i in range(0, len(seq_iter), 3):
        triplets.append(seq_iter[i:i+3])
    print(triplets) 

['atg', 'act', 'agt', 'gtc', 'gat', 'gca', 'gct', 'cag', 'aat', 'cct', 'act', 'cgt', 'agc', 'aga', 'ctg', 'gat', 'cga', 'tcg', 'agc', 'agc', 'tgg', 'ctg', 'cgc', 'tat', 'gct', 'agt', 'gat', 'tcg', 'tcg', 'ata', 'ccc', 'caa', 'tga']
['atg', 'cac', 'ttt', 'ccg', 'cag', 'atg', 'tcc', 'ctt', 'acg', 'cgc', 'tcc', 'att', 'cgt', 'tac', 'tat', 'aca', 'tat', 'tca', 'tcc', 'gtg', 'cgc', 'aga', 'gac', 'ggc', 'gct', 'agc', 'tca', 'atg', 'ttt', 'aca', 'tgc', 'gaa', 'tga']
['atg', 'act', 'agt', 'gtc', 'gat', 'gca', 'gct', 'cag', 'aat', 'cct', 'act', 'cgt', 'agc', 'aga', 'ctg', 'gat', 'cga', 'tcg', 'agc', 'agc', 'tgg', 'ctg', 'cgc', 'tat', 'gct', 'agt', 'gat', 'tcg', 'tcg', 'ata', 'ccc', 'caa', 'tga']
['atg', 'cac', 'ttt', 'ccg', 'cag', 'atg', 'tcc', 'ctt', 'acg', 'cgc', 'tcc', 'att', 'cgt', 'tac', 'tat', 'aca', 'tat', 'tca', 'tcc', 'gtg', 'cgc', 'aga', 'gac', 'ggc', 'gct', 'agc', 'tca', 'atg', 'ttt', 'aca', 'tgc', 'gaa', 'tga']


We can directly use the above triplets to get the protein sequence by querying our dictionary as a look up table. *i.e* for this nucleotide triplet what is the corresponding amino acid.

In [40]:
proteins_list = []

for seq_iter in seq_aligned:
    triplets = []
    protein = []
    for i in range(0, len(seq_iter), 3):
        triplets.append(seq_iter[i:i+3])
    for trip in triplets:
        protein.append(codons[trip])
    proteins_list.append("".join(protein))

print(proteins_list)     

['MTSVDAAQNPTRSRLDRSSSWLRYASDSSIPQ*', 'MHFPQMSLTRSIRYYTYSSVRRDGASSMFTCE*', 'MTSVDAAQNPTRSRLDRSSSWLRYASDSSIPQ*', 'MHFPQMSLTRSIRYYTYSSVRRDGASSMFTCE*']


## Functions
lets make a function that
-  takes a file as input
-  checks to see if the dna sequences begin with atg and end with a stop codon
-  if no atg and end with a stop codon check reverse
-  return a list of protein sequence



In [41]:
def reverse_complement(dna):
    complement = {'a': 't', 'c': 'g', 'g': 'c', 't': 'a'}
    return ''.join([complement[base] for base in dna[::-1]])

def get_prot_seq(infile):
    file_connection = open(infile,"r")
    seqs2 = file_connection.readlines() 
    file_connection.close()
    seq_aligned = []
    proteins_list=[]
    for seq_iter in seqs2:
        if seq_iter.strip().startswith('atg') and seq_iter.strip().endswith('taa') or seq_iter.strip().endswith('tag') or seq_iter.strip().endswith('tga'):
            seq_aligned.append(seq_iter.strip())
        elif reverse_complement(seq_iter.strip()).startswith('atg') and reverse_complement(seq_iter.strip()).endswith('taa') or reverse_complement(seq_iter.strip()).endswith('tag') or reverse_complement(seq_iter.strip()).endswith('tga'):
            seq_aligned.append(reverse_complement(seq_iter.strip()))
        else:
            continue # if a sequence doesnt begin with atg dont process
    for seq_iter2 in seq_aligned:
        triplets = []
        protein = []
        for i in range(0, len(seq_iter2), 3):
            triplets.append(seq_iter2[i:i+3])
        for trip in triplets:
            protein.append(codons[trip])
        proteins_list.append("".join(protein))
    return proteins_list


proteins = get_prot_seq("seqs.txt")

print(proteins)

['MTSVDAAQNPTRSRLDRSSSWLRYASDSSIPQ*', 'MHFPQMSLTRSIRYYTYSSVRRDGASSMFTCE*', 'MTSVDAAQNPTRSRLDRSSSWLRYASDSSIPQ*', 'MHFPQMSLTRSIRYYTYSSVRRDGASSMFTCE*']


## Regular Expressions

Regular expressions or regex can be used to find strings that match a pattern. For a broader description and examples of regex go [here](http://www.maths.nuigalway.ie/~rossmann/cs103/). We will start by looking at a simplified examples and then a more typical biological scenario.


An example of how regex are used is identifying sequences of dna where a protein is predicted to bind. E,g proteinX binds at *caga* sites in the genome.
We can use re.search to find if a a pattern is contained in the dna string.

In [42]:
import re 
for seq_iter2 in seq_aligned:
    match = re.search('caga', seq_iter2)
    print(match)

<_sre.SRE_Match object at 0x7f92ac69c440>
<_sre.SRE_Match object at 0x7f92ac69c168>
<_sre.SRE_Match object at 0x7f92ac69c440>
<_sre.SRE_Match object at 0x7f92ac69c168>


In [43]:
counter = 0
for seq_iter2 in seq_aligned:
    counter += 1
    match = re.search('caga', seq_iter2)
    if match:
        print("sequence {} has match starting at {} ".format(counter,match.start()))

sequence 1 has match starting at 21 
sequence 2 has match starting at 12 
sequence 3 has match starting at 21 
sequence 4 has match starting at 12 


In [44]:
# count the number find regions of repetitive dinucleotides ATs  
dna = "ACTGCATTATATCGTACGAAATTATACGCGCG"
runs = re.finditer(r"[AT]{4,50}", dna)   
for match in runs:
    print("AT run from {} to {}".format(match.start(),match.end()))

AT run from 5 to 12
AT run from 18 to 26


In [45]:
# search GA, any 3 nuc, AC,any 2 nuc, AC
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("entire match: " + m.group())
print("first bit: " + m.group(1))
print("second bit: " + m.group(2))

entire match: GACGTACGTAC
first bit: CGT
second bit: GT


In [46]:
dna = "A34T53GCGTG56ACT23"
runs = re.split("[^ATGC]+", dna) # split on characters ACGT 
print(runs)
runs = re.split('(\d+ |\D+)', dna) # split on digits
print(runs)
runs = re.findall("[^ATCG]+", dna)
print(runs)
runs = re.findall("(\d+ |\D+)", dna)
print(runs)

['A', 'T', 'GCGTG', 'ACT', '']
['', 'A', '34', 'T', '53', 'GCGTG', '56', 'ACT', '23']
['34', '53', '56', '23']
['A', 'T', 'GCGTG', 'ACT']


Using a regex pattern we can revisit our function to calculate a protein sequence. We will compare the two functions

In [47]:
import time
def get_prot_seq2(infile):
    file_connection = open(infile,"r")
    seqs2 = file_connection.readlines() 
    file_connection.close()
    seq_aligned = []
    proteins_list=[]
    for seq_iter in seqs2:
        if seq_iter.strip().startswith('atg') and seq_iter.strip().endswith('taa') or seq_iter.strip().endswith('tag') or seq_iter.strip().endswith('tga'):
            seq_aligned.append(seq_iter.strip())
        elif reverse_complement(seq_iter.strip()).startswith('atg') and reverse_complement(seq_iter.strip()).endswith('taa') or reverse_complement(seq_iter.strip()).endswith('tag') or reverse_complement(seq_iter.strip()).endswith('tga'):
            seq_aligned.append(reverse_complement(seq_iter.strip()))
        else:
            continue # if a sequence doesnt begin with atg dont process
    for seq_iter2 in seq_aligned:
        triplets = []
        protein = []
        triplets = re.findall(r'[^ACGT]{3}',seq_iter2)                              #for i in range(0, len(seq_iter2), 3):
                                             #triplets.append(seq_iter2[i:i+3])
        for trip in triplets:
            protein.append(codons[trip])
        proteins_list.append("".join(protein))
    return proteins_list

# time new regex function
start2 = time.clock()
proteins_new = get_prot_seq2('seqs.txt')
end2 = time.clock()

# time original function
start1= time.clock()
proteins_orig = get_prot_seq('seqs.txt')
end1 = time.clock()

print("function 1 took {}".format(end1-start1))
print("function 2 took {}".format(end2-start2))

function 1 took 0.000237
function 2 took 0.001331


The regex function takes much longer to compute because it is search for defined pattern matches. The original function will make triplets regardless of what is in the string *i.e* non- ACGT included