# Chapter 3: Sequence objects

* Sequences are essentially strings of letters like AGTACACTGGT, which seems very natural since this is the most common way that sequences are seen in biological le formats
* The most important dierence between Seq objects and standard Python strings is they have dierent methods. Although the Seq object supports many of the same methods as a plain string, its translate() method diers by doing biological translation, and there are also additional biologically relevant methods like reverse_complement().

## 3.1 Sequences act like strings

In [7]:
from Bio.Seq import Seq
my_seq = Seq("GATCG")
for index, letter in enumerate(my_seq):
    print("%i %s" % (index, letter))

print( "First element: " + my_seq[0] )
print( "Total elements: " + str( len( my_seq ) ) )

0 G
1 A
2 T
3 C
4 G
First element: G
Total elements: 5


### Count elements

In [10]:
from Bio.Seq import Seq
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")

In [11]:
len(my_seq)

32

In [12]:
my_seq.count("G")

9

In [15]:
my_seq.count("GA")

4

#### GC%

In [18]:
100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)

46.875

In [19]:
from Bio.SeqUtils import GC
GC( my_seq )

46.875

## 3.2 Slicing a sequence

In [20]:
my_seq[4:12]

Seq('GATGGGCC')

In [23]:
my_seq[0::2]

Seq('GTGTGCTTTGACAATG')

In [27]:
# Reverse
print( my_seq )
print( my_seq[::-1] )

GATCGATGGGCCTATATAGGATCGAAAATCGC
CGCTAAAAGCTAGGATATATCCGGGTAGCTAG


In [28]:
# Back to string:
str(my_seq)

'GATCGATGGGCCTATATAGGATCGAAAATCGC'

# 3.4 Concatenating or adding sequences

In [30]:
from Bio.Seq import Seq
seq_1 = Seq("ATATAGG")
seq_2 = Seq("ACGT")
seq_1 + seq_2

Seq('ATATAGGACGT')

## Loop-based concat

In [32]:
from Bio.Seq import Seq
list_of_seqs = [Seq("ACGT"), Seq("AACC"), Seq("GGTT")]
concatenated = Seq("")
for s in list_of_seqs:
    concatenated += s
concatenated

Seq('ACGTAACCGGTT')

## Join-based concat

## TODO: Not working but code shows this example?

In [46]:
from Bio.Seq import Seq
concatenated = Seq('NNNNN').join([Seq("AAA")])
concatenated

AttributeError: 'Seq' object has no attribute 'join'

In [47]:
from Bio.Seq import Seq
contigs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]
spacer = Seq("N"*10)
spacer.join(contigs)

AttributeError: 'Seq' object has no attribute 'join'

## 3.6 Nucleotide sequences and (reverse) complements

* DNA: A=T, G≡C
* RNA: A=U, G≡C

* Nucleic acid sequence of bases that can form a double- stranded structure by matching base pairs. For example, the **complementary sequence** to C-A-T-G (where each letter stands for one of the bases in DNA) is G-T-A-C
* The **reverse complement** of a DNA sequence is formed by reversing the letters, interchanging A and T and interchanging C and G. Thus the reverse complement of ACCTGAG is CTCAGGT.

In [48]:
from Bio.Seq import Seq
my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC")
print( my_seq )
print( my_seq.complement() )
print( my_seq.reverse_complement() )

GATCGATGGGCCTATATAGGATCGAAAATCGC
CTAGCTACCCGGATATATCCTAGCTTTTAGCG
GCGATTTTCGATCCTATATAGGCCCATCGATC


# 3.7 Transcription

In [56]:
from Bio.Seq import Seq
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
template_dna = coding_dna.reverse_complement()
print("5' - " + coding_dna + " - 3'")
print("3' - " + template_dna[::-1] + " - 5'")

5' - ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG - 3'
3' - TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC - 5'


## Transcribe

### DNA Coding Strand transcription

"As you can see, all this does is to replace T by U."

In [57]:
coding_dna.transcribe()

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', RNAAlphabet())

In [58]:
coding_dna

Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')

### Template Strand transcription

In [59]:
template_dna.reverse_complement().transcribe()

Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', RNAAlphabet())

### mRNA to DNA

"Again, this is a simple U -> T substitution:"

In [60]:
from Bio.Seq import Seq
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
print( messenger_rna )
print( messenger_rna.back_transcribe() )

AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG


# 3.8 Translation

Translation is done from NCBI coding schemes.

* [NCBI Coding Scheme](https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)

## Translate from mRNA

In [61]:
from Bio.Seq import Seq
messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG")
print( messenger_rna )
print( messenger_rna.translate() )

AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
MAIVMGR*KGAR*


## Translate from coding DNA

In [62]:
from Bio.Seq import Seq
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
print( coding_dna )
print( coding_dna.translate() )

ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG
MAIVMGR*KGAR*


### Take into account the DNA type

In [65]:
print( coding_dna.translate(table="Vertebrate Mitochondrial") )

MAIVMGRWKGAR*


### NCBI table number (instead of using Vertebrate Mitochondrial)

In [67]:
print( coding_dna.translate(table=2) )

MAIVMGRWKGAR*


### To Stop Codon

In [68]:
print( coding_dna.translate(to_stop=True) )

MAIVMGR


In [69]:
print( coding_dna.translate(to_stop=True, table="Vertebrate Mitochondrial") )

MAIVMGRWKGAR


## Complete coding sequence CDS

Telling BioPython to use a CDS means the coding becomes **Methionine** instead of **Valine**.

In [71]:
from Bio.Seq import Seq
gene = Seq( "GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"
            "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT"
            "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"
            "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT"
            "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA")

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [72]:
gene.translate(table="Bacterial")

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [73]:
gene.translate(table="Bacterial", cds = True)

Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())

### And it will throw an exception if wrong.

"Sequence length 296 is not a multiple of three"

In [79]:
gene = Seq( "GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"
            "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCACTAAAATTACAGATAGGCGATCGTGAT"
            "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"
            "TATGAATGGCGAGGCAGTCGCTGGCACCTACACGGACCGCCGCCACCGCCCGCCACCAT"
            "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA")
gene.translate(table="Bacterial", cds = True)

TranslationError: Sequence length 296 is not a multiple of three

# 3.9 Translation Tables

* Internally these use codon table objects derived from the NCBI information at [gc.prt](ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/)
* Also shown on https: [wprintgc.cgi](//www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi )

In [80]:
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]

In [81]:
print(standard_table)

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------

In [82]:
print(mito_table)

Table 2 Vertebrate Mitochondrial, SGC1

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA W   | A
T | TTG L   | TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L   | CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T   | AAT N   | AGT S   | T
A | ATC I(s)| ACC T   | AAC N   | AGC S   | C
A | ATA M(s)| ACA T   | AAA K   | AGA Stop| A
A | ATG M(s)| ACG T   | AAG K   | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V(s)| GCG A   | GAG E   | GGG G   

## Example: Find Codons

In [84]:
standard_table.stop_codons

['TAA', 'TAG', 'TGA']

In [85]:
standard_table.start_codons

['TTG', 'CTG', 'ATG']

In [86]:
print( mito_table.forward_table["ACG"] )
print( mito_table.forward_table["CCG"] )

T
P


# 3.10 Comparing Seq objects

As of Biopython 1.65, sequence comparison only looks at the sequence and compares like the Python string.

# 3.11 MutableSeq objects

In [88]:
from Bio.Seq import Seq
my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA")
my_seq[5] = "G" # Throws exception

TypeError: 'Seq' object does not support item assignment

In [97]:
mutable_seq = my_seq.tomutable()
mutable_seq

MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [98]:
mutable_seq[0] = "C"
mutable_seq

MutableSeq('CCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [99]:
mutable_seq.remove("T")
mutable_seq

MutableSeq('CCCATGTAATGGGCCGCTGAAAGGGTGCCCGA')

## Back to immutable

In [101]:
immutable_again_seq = mutable_seq.toseq()
immutable_again_seq

Seq('CCCATGTAATGGGCCGCTGAAAGGGTGCCCGA')

In [103]:
immutable_again_seq[5] = "G"

TypeError: 'Seq' object does not support item assignment

# 3.12 UnknownSeq objects

A (more) memory efficient way of holding a sequence of unknown letters.

In [104]:
from Bio.Seq import UnknownSeq
unk = UnknownSeq(20)
print(unk)

????????????????????


* DNA is commonly labeled with N
* Protein is commonly labeled with X

In [106]:
unk_dna = UnknownSeq(20,character="N")
unk_dna

UnknownSeq(20, character='N')

In [108]:
unk_protein = UnknownSeq(20,character="X")
unk_protein

UnknownSeq(20, character='X')