**`Biological sequences are arguably the central object in Bioinformatics. `**

We’ll introduce the Biopython mechanism for dealing with sequences, the `Seq` object.

Sequences are essentially strings of letters like `AGTACACTGGT`, which seems very natural since this is the most common way that sequences are seen in biological file formats.
The most important difference between `Seq` objects and standard Python strings is they have different methods. Although the `Seq` object supports many of the same methods as a plain string, its `translate()` method differs by doing biological translation, and there are also additional biologically relevant methods like `reverse_complement()`.

In most ways, we can deal with Seq objects as if they were normal Python strings, for example getting the length, or iterating over the elements:

### Sequences act like strings

In [1]:
from Bio.Seq import Seq

my_seq = Seq("GATCG")
for i, l in enumerate(my_seq):
    print("%i %s " % (i, l))

0 G 
1 A 
2 T 
3 C 
4 G 


In [2]:
len(my_seq)

5

In [4]:
my_seq[0]

'G'

In [5]:
my_seq[2]

'T'

In [7]:
x = Seq("AAAA")
x.count("A"), x.count("AA")

(4, 2)

In [8]:
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
len(my_seq)

32

In [9]:
my_seq.count("G")

9

In [12]:
my_seq.count("C")

6

#### Activity
*`Write a simple python code to calculate the fraction or percentage of 'GC' content in a sequence.`*

In [None]:
((my_seq.count("G")+my_seq.count("C"))/len(my_seq)) * 100

46.875

In [14]:
from Bio.Seq import Seq
from Bio.SeqUtils import gc_fraction

my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
gc_fraction(my_seq)

0.46875

### Slicing a sequence

In [2]:
from Bio.Seq import Seq

my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
my_seq[4:12]

Seq('GATGGGCC')

In [3]:
my_seq[::-1]

Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG')

In [16]:
my_seq[0::3]

Seq('GCTGTAGTAAG')

In [17]:
my_seq[1::3]

Seq('AGGCATGCATC')

In [18]:
my_seq[2::3]

Seq('TAGCTAAGAC')

In [19]:
my_seq[::-1]

Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG')

### Turning Seq objects into strings

If you really do just need a plain string, for example to write to a file, or insert into a database, then this is very easy to get:

In [21]:
str(my_seq)

'GATCGATGGGCCTATATAGGATCGAAAATCGC'

In [23]:
fasta_format_string = ">Name\n%s\n" % my_seq
print(fasta_format_string)

>Name
GATCGATGGGCCTATATAGGATCGAAAATCGC



### Concatenating or adding sequences

In [None]:
from Bio.Seq import Seq

seq1 = Seq("ACGT")
seq2 = Seq("AACCGG")
seq1 + seq2

Seq('ACGTAACCGG')

In [25]:
list_of_seqs = [Seq("ACGT"), Seq("AACC"), Seq("GGTT")]

concatenated = Seq("")

for s in list_of_seqs:
    concatenated += s
concatenated

Seq('ACGTAACCGGTT')

In [26]:
c = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]

spacer = Seq("N" * 10)

spacer.join(c)

Seq('ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCA')

### Changing case

In [28]:
dna_seq = Seq("acgtACGT")
dna_seq

Seq('acgtACGT')

In [29]:
dna_seq.upper()

Seq('ACGTACGT')

In [30]:

dna_seq.lower()

Seq('acgtacgt')

### Nucleotide sequences and (reverse) complements

For nucleotide sequences, you can easily obtain the complement or reverse complement of a `Seq` object using its built-in methods.

In [31]:
from Bio.Seq import Seq

my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
my_seq

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC')

In [32]:
my_seq.complement()

Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG')

In [4]:
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
my_seq

Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC')

In [33]:
my_seq.reverse_complement()

Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC')

### Transcription

\begin{gathered}
    \text{DNA coding strand (aka Crick strand, strand } +1 \text{)} \\
    \text{5'} \qquad \texttt{ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG} \qquad \text{3'} \\
    \texttt{|||||||||||||||||||||||||||||||||||||||} \\
    \text{3'} \qquad \texttt{TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC} \qquad \text{5'} \\
    \text{DNA template strand (aka Watson strand, strand } -1 \text{)}
\end{gathered}

\begin{gathered}
    \text{5'} \qquad \texttt{AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG} \qquad \text{3'} \\
    \text{Single-stranded messenger RNA}
\end{gathered}

In [34]:
from Bio.Seq import Seq

coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
coding_dna

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

In [35]:
template_dna = coding_dna.reverse_complement()
template_dna

Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT')

In [36]:
messenger_rna = coding_dna.transcribe()
messenger_rna

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

In [37]:
messenger_rna.back_transcribe()

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

### Translation

In [None]:
from Bio.Seq import Seq

messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
messenger_rna

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG')

In [40]:
messenger_rna.translate()

Seq('MAIVMGR*KGAR*')

In [41]:
# You can also translate directly from the coding strand DNA sequence:

from Bio.Seq import Seq

coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

coding_dna.translate()

Seq('MAIVMGR*KGAR*')

In [43]:
coding_dna.translate(table="Vertebrate Mitochondrial")

Seq('MAIVMGRWKGAR*')

In [48]:
# You can also specify the table using the NCBI table number which is shorter, and often included in the feature annotation of GenBank files:

coding_dna.translate(table=2)

Seq('MAIVMGRWKGAR*')

In [45]:
# Now, you may want to translate the nucleotides up to the first in frame stop codon, and then stop (as happens in nature):

print(coding_dna.translate())
print(coding_dna.translate(to_stop=True))
print(coding_dna.translate(table=2))
print(coding_dna.translate(table=2, to_stop=True))


MAIVMGR*KGAR*
MAIVMGR
MAIVMGRWKGAR*
MAIVMGRWKGAR


In [50]:
# You can even specify the stop symbol if you don’t like the default asterisk:

coding_dna.translate(table=2, stop_symbol="$")

Seq('MAIVMGRWKGAR$')

In [51]:
gene = Seq(
    "GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"
    "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT"
    "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"
    "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT"
    "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA"
)

gene

Seq('GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCC...TAA')

In [52]:
gene.translate(table="Bacterial")

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*')

In the bacterial genetic code GTG is a valid start codon, and while it does normally encode Valine, if used as a start codon it should be translated as methionine. This happens if you tell Biopython your sequence is a complete CDS:

In [53]:
gene.translate(table="Bacterial", cds=True)

Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR')

In [57]:
from IPython.display import display, Markdown

display(Markdown("""In addition to telling Biopython to translate an alternative start codon as methionine, using this option also makes sure your sequence really is a valid CDS (you’ll get an exception if not)."""))

In addition to telling Biopython to translate an alternative start codon as methionine, using this option also makes sure your sequence really is a valid CDS (you’ll get an exception if not).

### Translation Tables

Standard translation table, and the translation table for Vertebrate Mitochondrial DNA.

Proteins do the real work: enzymes, hormones, structure, signaling

DNA is the instruction manual

RNA is the messenger

Protein is the final product

In [63]:
from Bio.Data import CodonTable

standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]

In [62]:
# Alternatively, these tables are labeled with ID numbers 1 and 2, respectively:

# standard_table = CodonTable.unambiguous_dna_by_id[1]
# mito_table = CodonTable.unambiguous_dna_by_id[2]

In [64]:
print(standard_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [65]:
print(mito_table)

Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   

In [67]:
mito_table.start_codons, mito_table.stop_codons

(['ATT', 'ATC', 'ATA', 'ATG', 'GTG'], ['TAA', 'TAG', 'AGA', 'AGG'])

In [68]:
mito_table.forward_table["ACG"]

'T'

### Comparing Seq objects

In [72]:
from Bio.Seq import Seq

seq1 = Seq("ACGT")
"ACGT" == seq1

True

In [73]:
seq1 == "ACGT"

True

### Sequences with unknown sequence contents

In [75]:
from Bio.Seq import Seq

unknown_seq = Seq(None, 10)

# The Seq object thus created has a well-defined length. Any attempt to access the sequence contents, however, will raise an UndefinedSequenceError:

In [76]:
unknown_seq

Seq(None, length=10)

In [78]:
len(unknown_seq)

10

In [79]:
print(unknown_seq)

UndefinedSequenceError: Sequence content is undefined

### Sequences with partially defined sequence contents

In [80]:
from Bio.Seq import Seq

seq = Seq({117512683: "TTGAAAACCTGAATGTGAGAGTCAGTCAAGGATAGT"}, length=159345973)

In [81]:
seq[1000:1020]

Seq(None, length=20)

In [82]:
seq[117512690:117512700]

Seq('CCTGAATGTG')

In [83]:
seq[117512670:117512690]

Seq({13: 'TTGAAAA'}, length=20)

In [86]:
seq = Seq("ACGT")
undefined_seq = Seq(None, length=10)
x = seq + undefined_seq + seq
x

Seq({0: 'ACGT', 14: 'ACGT'}, length=18)

### MutableSeq objects

In [90]:
from Bio.Seq import Seq

my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
my_seq[5] = "G"

TypeError: 'Seq' object does not support item assignment

In [91]:
from Bio.Seq import MutableSeq

mutable_seq = MutableSeq(my_seq)
mutable_seq

MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [94]:
# Alternatively, you can create a MutableSeq object directly from a string:

from Bio.Seq import MutableSeq

mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
mutable_seq

MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [95]:
mutable_seq[5] = "C"
mutable_seq

MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [96]:
mutable_seq.remove("T")
mutable_seq

MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [97]:
mutable_seq.reverse()
mutable_seq

MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG')

In [98]:
# Once you have finished editing your a MutableSeq object, it’s easy to get back to a read-only Seq object should you need to:

from Bio.Seq import Seq

new_seq = Seq(mutable_seq)
new_seq

Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG')

### Finding subsequences

Sequence objects have find, rfind, index, and rindex methods that perform the same function as the corresponding methods on plain string objects. The only difference is that the subsequence can be a string (str), bytes, bytearray, Seq, or MutableSeq object:

In [111]:
from Bio.Seq import Seq, MutableSeq

seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
seq.index("ATGGGCCGC")

9

In [103]:
seq.index(b"ATGGGCCGC")

9

In [105]:
seq.index(bytearray(b"ATGGGCCGC"))

9

In [104]:
seq.index(Seq("ATGGGCCGC"))

9

In [106]:
seq.index(MutableSeq("ATGGGCCGC"))

9

In [107]:
# A ValueError is raised if the subsequence is not found:

seq.index("ACTG")

ValueError: subsection not found

In [108]:
# while the find method returns -1 if the subsequence is not found:

seq.find("ACTG")

-1

In [109]:
seq.find("CC")

1

In [110]:
seq.rfind("CC")

29

In [113]:
# Use the search method to search for multiple subsequences at the same time. This method returns an iterator:

for index, sub in seq.search(["CC", "GGG", "CC"]):
    print(index, sub)

1 CC
11 GGG
14 CC
23 GGG
28 CC
29 CC


### Working with strings directly

In [114]:
from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate


my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"

print(reverse_complement(my_string))

print(transcribe(my_string))

print(back_transcribe(my_string))

print(translate(my_string))

CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC
GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG
GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG
AVMGRWKGGRAAG*
