# Biopython's SeqUtils module

The SeqUtils module provides several miscellaneous functions that could be used for many kinds of analyses.

An overview of this module and its capabilities can be found at https://biopython.org/DIST/docs/api/Bio.SeqUtils-module.html

Before exploring the SeqUtils tools, we will create a SeqRecord object that will be used throughout this notebook: 

In [5]:
from Bio import SeqIO
record = SeqIO.read("Cebus_albifrons_NC_002763.1.gb", "genbank")
record

SeqRecord(seq=Seq('GTTAATGTAGCTTAATACTCAAAGCAAGGCACTGAAAATGCCTAGACGGGTATT...TAA', IUPACAmbiguousDNA()), id='NC_002763.1', name='NC_002763', description='Cebus albifrons mitochondrion, complete genome', dbxrefs=['Project:11945', 'BioProject:PRJNA11945'])

Having done that, some of this module's capabilities are described below:

---
**Calculate sequence checksum** : Could be useful to assert that two sequences are the same. The available checksum fucntions (each calculates a different type of checksum) accept as input either a string ("ATGTGCATG") or a Seq object (record.seq):

In [6]:
from Bio import SeqIO

- crc32 (case sensitive, returns integer)

In [7]:
print("Uppercase: {}".format(CheckSum.crc32("ACGTACGTACGT")))
print("Lowercase: {}".format(CheckSum.crc32("acgtACGTacgt")))
print("Seq object: {}".format(CheckSum.crc32(record.seq)))

Uppercase: 20049947
Lowercase: 1688586483
Seq object: 1882896850


- crc64 (case sensitive, returns string)

In [8]:
print("Uppercase: {}".format(CheckSum.crc64("ACGTACGTACGT")))
print("Lowercase: {}".format(CheckSum.crc64("acgtACGTacgt")))
print("Seq object: {}".format(CheckSum.crc64(record.seq)))

Uppercase: CRC-C4FBB762C4A87EBD
Lowercase: CRC-DA4509DC64A87EBD
Seq object: CRC-97D76EC2A49D9D31


- gcg (case insensitive - all converted to uppercase, returns integer)

In [9]:
print("Uppercase: {}".format(CheckSum.gcg("ACGTACGTACGT")))
print("Lowercase: {}".format(CheckSum.gcg("acgtACGTacgt")))
print("Seq object: {}".format(CheckSum.gcg(record.seq)))

Uppercase: 5688
Lowercase: 5688
Seq object: 3012


- seguid (case insensitive - returns string)

In [10]:
print("Uppercase: {}".format(CheckSum.seguid("ACGTACGTACGT")))
print("Lowercase: {}".format(CheckSum.seguid("acgtACGTacgt")))
print("Seq object: {}".format(CheckSum.seguid(record.seq)))

Uppercase: If6HIvcnRSQDVNiAoefAzySc6i4
Lowercase: If6HIvcnRSQDVNiAoefAzySc6i4
Seq object: kxq7y1tosvw+Rmo0bjAOiW/Hx5A


---
**Calculate sequence metrics**: Functions for calculating several sequence metrics. Generally receives as parameter a string or Seq object and are case insensitive:

In [11]:
from Bio.SeqUtils import *

- GC(seq) - calculates G+C content, returns percentage as float:

In [12]:
print("Uppercase: {}".format(GC("ACGTACGTACGT")))
print("Lowercase: {}".format(GC("acgtACGTacgt")))
print("Seq object: {}".format(GC(record.seq)))

Uppercase: 50.0
Lowercase: 50.0
Seq object: 39.08420925456083


- GC123(seq) - Calculates G+C content for: total, first, second and third codon positions. Returns a tuple with these 4 floats.

In [13]:
print("Uppercase: {}".format(GC123("ACGTACGTACGT")))
print("Lowercase: {}".format(GC123("acgtACGTacgt")))
print("Seq object: {}".format(GC123(record.seq)))

Uppercase: (50.0, 50.0, 50.0, 50.0)
Lowercase: (50.0, 50.0, 50.0, 50.0)
Seq object: (39.08420925456083, 38.945270025371514, 39.03588256614715, 39.271475172163825)


- GC_skew(seq, window=100) - Calculate GC skew (G-C)/(G+C) for multiple windows along the sequence. Returns a list with skew calculation for the given windows. If window size > length of sequence, returns a list with a single '0.0' float.

In [14]:
print("Window larger than seq: {}".format(GC_skew("ACGTACGTACGT")))
print("Window smaller than seq: {}".format(GC_skew("acgtACGTacgt", window=3)))
print("Seq object: {}".format(GC_skew(record.seq, window=1000)))

Window larger than seq: [0.0]
Window smaller than seq: [0.0, -1.0, 1.0, 0.0]
Seq object: [-0.1543942992874109, -0.2, -0.2146341463414634, -0.4010416666666667, -0.5647382920110193, -0.23515439429928742, -0.15404699738903394, -0.3424657534246575, -0.4469135802469136, -0.3385416666666667, -0.4904109589041096, -0.43243243243243246, -0.38974358974358975, -0.5736842105263158, -0.42278481012658226, -0.33665835411471323, -0.20987654320987653]


- xGC_skew(seq, window=1000, zoom=100, r=300, px=100, py=100) - Calculate and plot normal and accumulated GC skew.

In [15]:
xGC_skew(record.seq, window=100, zoom=50, r=150, px=50, py=50)

- nt_search(seq, subseq) - Search for a DNA subseq in sequence, return list of [subseq, positions]. In order to work with Seq objects, the sequence must be converted to string.

In [49]:
print("Subsequences from string: {}".format(nt_search("ACGTACGTACGT", "GTACG")))
print("Subsequences from Seq object: {}".format(nt_search(str(record.seq), "ATGCAA")))

Subsequences from string: ['GTACG', 2, 6]
Subsequences from Seq object: ['ATGCAA', 117, 2879, 7532, 10329, 11638]


- seq3(seq, custom_map=None, undef_code='Xaa') - Convert protein sequence from one-letter (string or Seq object) to three-letter code.

In [16]:
protein = ""
for feature in record.features:
    if feature.type == 'CDS' and feature.qualifiers.get('gene') == ['COX1']:
        protein = feature.qualifiers.get('translation')[0]
print(protein)

MFMNRWLFSTNHKDIGTLYLMFGAWAGATGTALSLLIRAELGQPGSLMEDDHVYNVIVTSHAFIMIFFMVMPIMIGGFGNWLVPLMIGAPDMAFPRMNNMSFWLLPPSLLLLLASSTLEAGVGTGWTVYPPLAGNMSHPGASVDLTIFSLHLAGISSILGAINFITTIINMKPPATTQYQTPLFVWSVLITAILLLLSLPVLAAGITMLLTDRNLNTTFFDPAGGGDPILYQHLFWFFGHPEVYILILPGFGMISHIVTYYSNKKEPFGYMGMVWAMMSIGFLGFIVWAHHMFTVGMDVDTRAYFTSATMIIAIPTGVKVFSWLATLHGGNIKWSPAMLWALGFIFLFTVGGLTGVVLANSSLDIVLHDTYYVVAHFHYVLSMGAVFAIMGGFIHWFPLFSGYTLDQTYAKIHFTIMFVGVNMTFFPQHFLGLSGMPRRYSDYPDAYTAWNIISSVGSFISLAAVILMIFMIWEAFSSKRKVLTIEQMSTNLEWLYGCPPPYHTFEEATYGKTPK


In [17]:
Three_letter_aa = seq3(protein)
print("Three-letter aminoacid translation:\n{}".format(Three_letter_aa))

Three-letter aminoacid translation:
MetPheMetAsnArgTrpLeuPheSerThrAsnHisLysAspIleGlyThrLeuTyrLeuMetPheGlyAlaTrpAlaGlyAlaThrGlyThrAlaLeuSerLeuLeuIleArgAlaGluLeuGlyGlnProGlySerLeuMetGluAspAspHisValTyrAsnValIleValThrSerHisAlaPheIleMetIlePhePheMetValMetProIleMetIleGlyGlyPheGlyAsnTrpLeuValProLeuMetIleGlyAlaProAspMetAlaPheProArgMetAsnAsnMetSerPheTrpLeuLeuProProSerLeuLeuLeuLeuLeuAlaSerSerThrLeuGluAlaGlyValGlyThrGlyTrpThrValTyrProProLeuAlaGlyAsnMetSerHisProGlyAlaSerValAspLeuThrIlePheSerLeuHisLeuAlaGlyIleSerSerIleLeuGlyAlaIleAsnPheIleThrThrIleIleAsnMetLysProProAlaThrThrGlnTyrGlnThrProLeuPheValTrpSerValLeuIleThrAlaIleLeuLeuLeuLeuSerLeuProValLeuAlaAlaGlyIleThrMetLeuLeuThrAspArgAsnLeuAsnThrThrPhePheAspProAlaGlyGlyGlyAspProIleLeuTyrGlnHisLeuPheTrpPhePheGlyHisProGluValTyrIleLeuIleLeuProGlyPheGlyMetIleSerHisIleValThrTyrTyrSerAsnLysLysGluProPheGlyTyrMetGlyMetValTrpAlaMetMetSerIleGlyPheLeuGlyPheIleValTrpAlaHisHisMetPheThrValGlyMetAspValAspThrArgAlaTyrPheThrSerAlaThrMetIleIleAlaIleProThrGlyValLysValPheS

In [18]:
print("Translating a single aminoacid (T) to three-letter code:\n{}".format(seq3("T")))

Translating a single aminoacid (T) to three-letter code:
Thr


In [19]:
from Bio.Data import CodonTable
for num, i in enumerate(CodonTable.unambiguous_dna_by_id[2].back_table.keys(), 1):
    if i != None:
        print("{}: {} - {}".format(num, i, seq3(i)))

1: K - Lys
2: N - Asn
3: T - Thr
4: S - Ser
5: M - Met
6: I - Ile
7: Q - Gln
8: H - His
9: P - Pro
10: R - Arg
11: L - Leu
12: E - Glu
13: D - Asp
14: A - Ala
15: G - Gly
16: V - Val
17: Y - Tyr
18: W - Trp
19: C - Cys
20: F - Phe


You can set a custom translation of the codon termination code using the dictionary "custom_map" argument (which defaults to {'*': 'Ter'}), e.g.

In [20]:
seq3("MAIVMGRWKGAR*", custom_map={"*": "***"})

'MetAlaIleValMetGlyArgTrpLysGlyAlaArg***'

You can also set a custom translation for non-amino acid characters, such as '-', using the "undef_code" argument, e.g.

In [21]:
seq3("MAIVMGRWKGA--R*", undef_code='---')

'MetAlaIleValMetGlyArgTrpLysGlyAla------ArgTer'

- seq1(seq, custom_map=None, undef_code='X') - Convert protein sequence from three-letter to one-letter code. It is the exact reverse function of seq3, even having the same arguments.

In [22]:
print(Three_letter_aa, "\n")
print("Three letter converted to single letter:\n{}".format(seq1(Three_letter_aa)))

MetPheMetAsnArgTrpLeuPheSerThrAsnHisLysAspIleGlyThrLeuTyrLeuMetPheGlyAlaTrpAlaGlyAlaThrGlyThrAlaLeuSerLeuLeuIleArgAlaGluLeuGlyGlnProGlySerLeuMetGluAspAspHisValTyrAsnValIleValThrSerHisAlaPheIleMetIlePhePheMetValMetProIleMetIleGlyGlyPheGlyAsnTrpLeuValProLeuMetIleGlyAlaProAspMetAlaPheProArgMetAsnAsnMetSerPheTrpLeuLeuProProSerLeuLeuLeuLeuLeuAlaSerSerThrLeuGluAlaGlyValGlyThrGlyTrpThrValTyrProProLeuAlaGlyAsnMetSerHisProGlyAlaSerValAspLeuThrIlePheSerLeuHisLeuAlaGlyIleSerSerIleLeuGlyAlaIleAsnPheIleThrThrIleIleAsnMetLysProProAlaThrThrGlnTyrGlnThrProLeuPheValTrpSerValLeuIleThrAlaIleLeuLeuLeuLeuSerLeuProValLeuAlaAlaGlyIleThrMetLeuLeuThrAspArgAsnLeuAsnThrThrPhePheAspProAlaGlyGlyGlyAspProIleLeuTyrGlnHisLeuPheTrpPhePheGlyHisProGluValTyrIleLeuIleLeuProGlyPheGlyMetIleSerHisIleValThrTyrTyrSerAsnLysLysGluProPheGlyTyrMetGlyMetValTrpAlaMetMetSerIleGlyPheLeuGlyPheIleValTrpAlaHisHisMetPheThrValGlyMetAspValAspThrArgAlaTyrPheThrSerAlaThrMetIleIleAlaIleProThrGlyValLysValPheSerTrpLeuAlaThrLeuHisGlyGlyAsnIleLysT

- molecular_weight(seq, seq_type=None, double_stranded=False, circular=False, monoisotopic=False) - Calculate the molecular mass of DNA, RNA or protein sequences as float.

Only unambiguous letters are allowed. Nucleotide sequences are assumed to have a 5' phosphate.

Arguments:
- seq: String or Biopython sequence object.
- seq_type: The default (None) is to take the alphabet from the seq argument, or assume DNA if the seq argument is a string. Override this with a string 'DNA', 'RNA', or 'protein'.
- double_stranded: Calculate the mass for the double stranded molecule?
- circular: Is the molecule circular (has no ends)?
- monoisotopic: Use the monoisotopic mass tables?



Note that for backwards compatibility, if the seq argument is a string, or Seq object with a generic alphabet, and no seq_type is specified (i.e. left as None), then DNA is assumed.

In [23]:
print("String with no seq_type argument: %0.2f" % molecular_weight("AGC"))
print("Seq object with no specified alphabet: %0.2f" % molecular_weight(Seq("AGC")))

String with no seq_type argument: 949.61
Seq object with no specified alphabet: 949.61


However, it is always better to be explicit - for example with strings, for the same sequence:

In [24]:
print("'AGC' DNA weight: %0.2f" % molecular_weight("AGC", "DNA"))
print("'AGC' RNA weight: %0.2f" % molecular_weight("AGC", "RNA"))
print("'AGC' Protein weight: %0.2f" % molecular_weight("AGC", "protein"))

'AGC' DNA weight: 949.61
'AGC' RNA weight: 997.61
'AGC' Protein weight: 249.29


Or, with the sequence alphabet:

In [25]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna, generic_rna, generic_protein

print("'AGC' DNA weight: %0.2f" % molecular_weight(Seq("AGC", generic_dna)))
print("'AGC' RNA weight: %0.2f" % molecular_weight(Seq("AGC", generic_rna)))
print("'AGC' Protein weight: %0.2f" % molecular_weight(Seq("AGC", generic_protein)))


'AGC' DNA weight: 949.61
'AGC' RNA weight: 997.61
'AGC' Protein weight: 249.29


Also note that contradictory sequence alphabets and seq_type will also give an exception:

In [26]:
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
print("%0.2f" % molecular_weight(Seq("AGC", generic_dna), "RNA"))

ValueError: seq_type='RNA' contradicts DNA from seq alphabet

- six_frame_translations(seq, genetic_code=1) - Return pretty string showing the 6 frame translations and GC content.

In [27]:
print(six_frame_translations("AUGGCCAUUGUAAUGGGCCGCUGA"))


GC_Frame: a:5 t:0 g:8 c:5 
Sequence: auggccauug ... gggccgcuga, 24 nt, 54.17 %GC


1/1
  G  H  C  N  G  P  L
 W  P  L  *  W  A  A
M  A  I  V  M  G  R  *
auggccauuguaaugggccgcuga   54 %
uaccgguaacauuacccggcgacu
A  M  T  I  P  R  Q 
 H  G  N  Y  H  A  A  S
  P  W  Q  L  P  G  S




We can also get the six-frame translation from the COX1 gene of our SeqRecord object (Vertebrate mitochondrial table, genetic code number 2).

In [28]:
gene = ''
for feature in record.features:
    if feature.type == 'CDS' and feature.qualifiers.get('gene') == ['COX1']:
        gene = feature.location.extract(record.seq)
print(gene, "\n")
print(six_frame_translations(gene, genetic_code=2))

ATGTTCATAAATCGCTGACTATTTTCAACTAACCATAAAGATATTGGTACACTGTATTTAATATTTGGTGCATGAGCCGGAGCAACAGGAACAGCCTTAAGTCTTCTAATTCGAGCTGAGCTGGGCCAACCAGGAAGCCTAATAGAAGACGACCATGTTTACAATGTTATTGTTACCTCTCACGCATTTATTATAATCTTCTTCATGGTCATACCAATTATAATTGGCGGCTTCGGGAATTGATTAGTGCCTCTAATAATCGGCGCCCCCGATATAGCTTTTCCTCGTATAAATAATATAAGCTTCTGACTCCTCCCCCCATCCCTTCTTCTCCTACTTGCCTCTTCAACTCTAGAGGCTGGTGTTGGAACTGGCTGAACAGTCTACCCTCCCCTGGCAGGAAATATATCACACCCTGGAGCCTCTGTAGACTTAACTATCTTTTCACTGCATCTAGCGGGTATTTCCTCTATTCTAGGGGCTATTAACTTTATTACAACAATTATTAATATAAAACCACCAGCCACAACCCAATATCAAACACCCCTATTTGTATGATCCGTACTCATTACAGCAATCCTTCTACTTCTCTCCCTCCCAGTCCTAGCTGCTGGAATTACTATACTATTAACCGACCGTAACCTTAACACCACTTTCTTCGACCCTGCTGGTGGTGGTGACCCCATTCTATATCAACACCTATTTTGATTTTTTGGTCACCCCGAAGTTTATATCCTTATTTTACCAGGGTTTGGGATAATCTCACATATTGTAACATATTACTCTAACAAAAAAGAGCCATTTGGGTATATAGGAATAGTTTGAGCCATAATATCCATTGGCTTCCTAGGCTTTATCGTATGAGCTCACCATATATTCACAGTAGGAATAGATGTGGATACACGCGCGTATTTTACATCAGCTACCATAATCATTGCCATCCCCACTGGGGTCAAAGTATTTAGCTGGTTGGCTACACTACATGGAGGCAACATCAAGT

---
**Calculate Codon Usage**: Several methods for codon usage calculations. Does not support multiple genetic codes, so the use of [CAI](https://github.com/Benjamin-Lee/CodonAdaptationIndex) package is recommended when working with mitochondrial genomes.

In order to use the codon usage, we need to first import it and generate an empty CodonAdaptationIndex object:

In [81]:
from Bio.SeqUtils import CodonUsage

cdusg = CodonUsage.CodonAdaptationIndex()
print(cdusg)

<Bio.SeqUtils.CodonUsage.CodonAdaptationIndex object at 0x7ff486f16b38>


And then, provide it with a fasta file (apparently, this module is not compatible with SeqRecord objects, so it is necessary to create a fasta file, even if a temporary one):

In [98]:
import os

with open("tempfile", "w+") as tempfile:
    tempfile.write(record.format("fasta"))
    cdusg.generate_index("tempfile")
    os.remove("tempfile")

After that, a lot of metrics available to codon usage can be obtained through methods associated with the CodonAdaptationIndex object:

In [99]:
print(dir(cdusg))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_count_codons', 'cai_for_gene', 'codon_count', 'generate_index', 'index', 'print_index', 'set_cai_index']


Through the 'codon_count' method, we can get the number of times each codon appears in the sequence, as a dictionary:

In [100]:
print(cdusg.codon_count)

{'TTT': 110, 'TTC': 95, 'TTA': 159, 'TTG': 49, 'CTT': 108, 'CTC': 102, 'CTA': 153, 'CTG': 56, 'ATT': 149, 'ATC': 100, 'ATA': 160, 'ATG': 63, 'GTT': 35, 'GTC': 24, 'GTA': 61, 'GTG': 31, 'TAT': 219, 'TAC': 136, 'TAA': 176, 'TAG': 91, 'CAT': 108, 'CAC': 122, 'CAA': 153, 'CAG': 80, 'AAT': 206, 'AAC': 161, 'AAA': 189, 'AAG': 71, 'GAT': 53, 'GAC': 46, 'GAA': 56, 'GAG': 42, 'TCT': 120, 'TCC': 89, 'TCA': 87, 'TCG': 26, 'CCT': 182, 'CCC': 127, 'CCA': 132, 'CCG': 48, 'ACT': 193, 'ACC': 129, 'ACA': 110, 'ACG': 28, 'GCT': 51, 'GCC': 61, 'GCA': 50, 'GCG': 13, 'TGT': 28, 'TGC': 40, 'TGA': 52, 'TGG': 33, 'CGT': 30, 'CGC': 41, 'CGA': 26, 'CGG': 24, 'AGT': 58, 'AGC': 98, 'AGA': 75, 'AGG': 69, 'GGT': 29, 'GGC': 49, 'GGA': 33, 'GGG': 23}


Of course, as the name of the class implies, we can also calculate the Codon Adaptation Index a sequence:

In [102]:
print(cdusg.index)

{'TGT': 0.7, 'TGC': 1.0, 'GAT': 1.0, 'GAC': 0.8679245283018868, 'TCT': 1.0, 'TCG': 0.21666666666666665, 'TCA': 0.725, 'TCC': 0.7416666666666666, 'AGC': 0.8166666666666667, 'AGT': 0.48333333333333334, 'CAA': 1.0, 'CAG': 0.5228758169934641, 'ATG': 1.0, 'AAC': 0.7815533980582524, 'AAT': 1.0, 'CCT': 1.0, 'CCG': 0.26373626373626374, 'CCA': 0.7252747252747253, 'CCC': 0.6978021978021979, 'AAG': 0.37566137566137564, 'AAA': 1.0, 'TAG': 0.5170454545454546, 'TGA': 0.29545454545454547, 'TAA': 1.0, 'ACC': 0.6683937823834196, 'ACA': 0.5699481865284974, 'ACG': 0.14507772020725387, 'ACT': 1.0, 'TTT': 1.0, 'TTC': 0.8636363636363636, 'GCA': 0.8196721311475409, 'GCC': 1.0, 'GCG': 0.21311475409836064, 'GCT': 0.8360655737704917, 'GGT': 0.5918367346938775, 'GGG': 0.4693877551020408, 'GGA': 0.673469387755102, 'GGC': 1.0, 'ATC': 0.625, 'ATA': 1.0, 'ATT': 0.93125, 'TTA': 1.0, 'TTG': 0.3081761006289308, 'CTC': 0.6415094339622641, 'CTT': 0.679245283018868, 'CTG': 0.35220125786163525, 'CTA': 0.9622641509433963, '

**The CodonUsage module is dependant on the information contained in the CodonUsageIndices module**. The default index used for CAI calculation is the Ecoli Index, described in [Sharp & Li (1987)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC340524/):

In [110]:
print(CodonUsageIndices.SharpEcoliIndex)

{'GCA': 0.586, 'GCC': 0.122, 'GCG': 0.424, 'GCT': 1, 'AGA': 0.004, 'AGG': 0.002, 'CGA': 0.004, 'CGC': 0.356, 'CGG': 0.004, 'CGT': 1, 'AAC': 1, 'AAT': 0.051, 'GAC': 1, 'GAT': 0.434, 'TGC': 1, 'TGT': 0.5, 'CAA': 0.124, 'CAG': 1, 'GAA': 1, 'GAG': 0.259, 'GGA': 0.01, 'GGC': 0.724, 'GGG': 0.019, 'GGT': 1, 'CAC': 1, 'CAT': 0.291, 'ATA': 0.003, 'ATC': 1, 'ATT': 0.185, 'CTA': 0.007, 'CTC': 0.037, 'CTG': 1, 'CTT': 0.042, 'TTA': 0.02, 'TTG': 0.02, 'AAA': 1, 'AAG': 0.253, 'ATG': 1, 'TTC': 1, 'TTT': 0.296, 'CCA': 0.135, 'CCC': 0.012, 'CCG': 1, 'CCT': 0.07, 'AGC': 0.41, 'AGT': 0.085, 'TCA': 0.077, 'TCC': 0.744, 'TCG': 0.017, 'TCT': 1, 'ACA': 0.076, 'ACC': 1, 'ACG': 0.099, 'ACT': 0.965, 'TGG': 1, 'TAC': 1, 'TAT': 0.239, 'GTA': 0.495, 'GTC': 0.066, 'GTG': 0.221, 'GTT': 1}


But you can overwrite these values before CAI calculation, by passing a dictionary similar to the SharpEcoliIndex in the CodonUsageIndices module:

In [94]:
help(cdusg.set_cai_index)

Help on method set_cai_index in module Bio.SeqUtils.CodonUsage:

set_cai_index(index) method of Bio.SeqUtils.CodonUsage.CodonAdaptationIndex instance
    Set up an index to be used when calculating CAI for a gene.
    
    Just pass a dictionary similar to the SharpEcoliIndex in the
    CodonUsageIndices module.



---
**Protein analysis**: The Module ProtParam gives us several methods for simple protein analysis.

In [121]:
from Bio.SeqUtils.ProtParam import ProteinAnalysis

First, we need to create an ProteinAnalysis object:

In [125]:
X = ProteinAnalysis(protein)
type(X)

Bio.SeqUtils.ProtParam.ProteinAnalysis

There are several methods for this class that gives us some information on the protein sequence:

In [126]:
print(dir(X))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_weight_list', 'amino_acids_content', 'amino_acids_percent', 'aromaticity', 'count_amino_acids', 'flexibility', 'get_amino_acids_percent', 'gravy', 'instability_index', 'isoelectric_point', 'length', 'molar_extinction_coefficient', 'molecular_weight', 'monoisotopic', 'protein_scale', 'secondary_structure_fraction', 'sequence']


In [138]:
print("We can count the number of any aminoacids in the sequence. E.g. 'A' occurs {} times.".format(X.count_amino_acids()['A']))
print("So, 'A' corresponds to %0.2f percent of the sequence" % X.get_amino_acids_percent()['A'])
print("The molecular weight of the sequence is %0.2f" % X.molecular_weight())
print("Its aromacity is %0.2f" % X.aromaticity())
print("Intability index: %0.2f" % X.instability_index())
print("Isoelectric point: %0.2f" % X.isoelectric_point())
sec_struc = X.secondary_structure_fraction() # [helix, turn, sheet]
print("Secondary structure fraction (helix): %0.2f" % sec_struc[0])
epsilon_prot = X.molar_extinction_coefficient() # [reduced, oxidized]
print("Molar extinction coeficient (reduced): {}".format(epsilon_prot[0]))
print("Molar extinction coeficient (oxidized): {}".format(epsilon_prot[1]))

We can count the number of any aminoacids in the sequence. E.g. 'A' occurs 38 times.
So, 'A' corresponds to 0.07 percent of the sequence
The molecular weight of the sequence is 57274.83
Its aromacity is 0.15
Intability index: 32.25
Isoelectric point: 6.29
Secondary structure fraction (helix): 0.41
Molar extinction coeficient (reduced): 119290
Molar extinction coeficient (oxidized): 119290


---
**Local composition Complexity**: The 'lcc' module allows the calculation of sequence complexity, using an unambiguous DNA sequence as input.
- **lcc_mult(seq, wsize)** - Calculate Local Composition Complexity (LCC) values over sliding window. Returns a list of floats, the LCC values for a sliding window over the sequence.
- **lcc_simp(seq)** - Calculate Local Composition Complexity (LCC) for a sequence. Returns a float.

In [152]:
from Bio.SeqUtils.lcc import *
print("Local Composition Complexity for whole sequence: {}\n".format(lcc_simp(record.seq)))
print("Local Composition Complexity using sliding window (1000):\n{}".format(lcc_mult(record.seq, 1000)))

Local Composition Complexity for whole sequence: 1.9279076250476705

Local Composition Complexity using sliding window (1000):
[0, 1.9597577375135669, 1.9597577375135669, 1.9597577375135669, 1.960151402155542, 1.9607021249155308, 1.9607021249155308, 1.9606477909866498, 1.9602541263446747, 1.9606477909866498, 1.961579912483908, 1.9606477909866498, 1.960151402155542, 1.9595903201265248, 1.9599635677523357, 1.9605308416245717, 1.9614591542221003, 1.9614591542221003, 1.9614591542221003, 1.9614591542221003, 1.9614591542221003, 1.961513716499784, 1.9624298522519787, 1.963333842655896, 1.963333842655896, 1.96292375358996, 1.963333842655896, 1.963333842655896, 1.963333842655896, 1.9629721504854547, 1.9625541344838044, 1.9620602331458232, 1.9620602331458232, 1.9615561518285567, 1.9619257709347122, 1.9615561518285567, 1.9624722875807514, 1.9630084466678661, 1.9630084466678661, 1.9639082062884734, 1.963871910106062, 1.9635040787892593, 1.9635342618387976, 1.9630326434137497, 1.9625025974595083, 1

There are other modules in the BioUtils package. Feel free to mess around with them! Some of the modules not covered here include:

- **Module IsoelectricPoint**
- **Module MeltingTemp**