# Tutorial Part 21: Introduction to Bioinformatics

So far in this tutorial, we've primarily worked on the problems of cheminformatics. We've been interested in seeing how we can use the techniques of machine learning to make predictions about the properties of molecules. In this tutorial, we're going to shift a bit and see how we can use classical computer science techniques and machine learning to tackle problems in bioinformatics.

For this, we're going to use the venerable [biopython](https://biopython.org/) library to do some basic bioinformatics. A lot of the material in this notebook is adapted from the extensive official [Biopython tutorial]http://biopython.org/DIST/docs/tutorial/Tutorial.html). We strongly recommend checking out the official tutorial after you work through this notebook!

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/21_Introduction_to_Bioinformatics.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

In [1]:
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  3489  100  3489    0     0  47148      0 --:--:-- --:--:-- --:--:-- 47148


add /root/miniconda/lib/python3.6/site-packages to PYTHONPATH
python version: 3.6.9
fetching installer from https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
done
installing miniconda to /root/miniconda
done
installing rdkit, openmm, pdbfixer
added omnia to channels
added conda-forge to channels
done
conda packages installation finished!


# conda environments:
#
base                  *  /root/miniconda



In [2]:
!pip install --pre deepchem
import deepchem
deepchem.__version__

Collecting deepchem
[?25l  Downloading https://files.pythonhosted.org/packages/b5/d7/3ba15ec6f676ef4d93855d01e40cba75e231339e7d9ea403a2f53cabbab0/deepchem-2.4.0rc1.dev20200805054153.tar.gz (351kB)
[K     |█                               | 10kB 15.4MB/s eta 0:00:01[K     |█▉                              | 20kB 3.1MB/s eta 0:00:01[K     |██▉                             | 30kB 4.1MB/s eta 0:00:01[K     |███▊                            | 40kB 4.4MB/s eta 0:00:01[K     |████▋                           | 51kB 3.5MB/s eta 0:00:01[K     |█████▋                          | 61kB 3.9MB/s eta 0:00:01[K     |██████▌                         | 71kB 4.2MB/s eta 0:00:01[K     |███████▌                        | 81kB 4.5MB/s eta 0:00:01[K     |████████▍                       | 92kB 4.9MB/s eta 0:00:01[K     |█████████▎                      | 102kB 4.7MB/s eta 0:00:01[K     |██████████▎                     | 112kB 4.7MB/s eta 0:00:01[K     |███████████▏                    | 122kB 4

'2.4.0-rc1.dev'

We'll use `pip` to install `biopython`

In [3]:
!pip install biopython

Collecting biopython
[?25l  Downloading https://files.pythonhosted.org/packages/a8/66/134dbd5f885fc71493c61b6cf04c9ea08082da28da5ed07709b02857cbd0/biopython-1.77-cp36-cp36m-manylinux1_x86_64.whl (2.3MB)
[K     |████████████████████████████████| 2.3MB 4.5MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.77


In [4]:
import Bio
Bio.__version__

'1.77'

In [5]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACATTG")
my_seq

Seq('AGTACACATTG')

In [6]:
my_seq.complement()

Seq('TCATGTGTAAC')

In [7]:
my_seq.reverse_complement()

Seq('CAATGTGTACT')

## Parsing Sequence Records

We're going to download a sample `fasta` file from the Biopython tutorial to use in some exercises. This file is a set of hits for a sequence (of lady slipper orcid genes).

In [8]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta

--2020-08-05 14:50:55--  https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76480 (75K) [text/plain]
Saving to: ‘ls_orchid.fasta’


2020-08-05 14:50:55 (4.97 MB/s) - ‘ls_orchid.fasta’ saved [76480/76480]



Let's take a look at what the contents of this file look like:

In [9]:
from Bio import SeqIO

for seq_record in SeqIO.parse('ls_orchid.fasta', 'fasta'):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet())
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet())
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT', SingleLetterAlphabet())
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA', SingleLetterAlphabet())
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC', SingleLetterAlphabet())
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT', SingleLetterAlphabet())
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GC

## Sequence Objects

A large part of the biopython infrastructure deals with tools for handlings sequences. These could be DNA sequences, RNA sequences, amino acid sequences or even more exotic constructs. To tell biopython what type of sequence it's dealing with, you can specify the alphabet explicitly.

In [10]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("ACAGTAGAC", IUPAC.unambiguous_dna)
my_seq

Seq('ACAGTAGAC', IUPACUnambiguousDNA())

In [11]:
my_seq.alphabet

IUPACUnambiguousDNA()

If we want to code a protein sequence, we can do that just as easily.

In [12]:
my_prot = Seq("AAAAA", IUPAC.protein) # Alanine pentapeptide
my_prot

Seq('AAAAA', IUPACProtein())

In [13]:
my_prot.alphabet

IUPACProtein()

We can take the length of sequences and index into them like strings.

In [14]:
print(len(my_prot))

5


In [15]:
my_prot[0]

'A'

You can also use slice notation on sequences to get subsequences.

In [16]:
my_prot[0:3]

Seq('AAA', IUPACProtein())

You can concatenate sequences if they have the same type so this works.

In [17]:
my_prot + my_prot

Seq('AAAAAAAAAA', IUPACProtein())

But this fails

In [18]:
my_prot + my_seq

TypeError: ignored

## Transcription

Transcription is the process by which a DNA sequence is converted into messenger RNA. Remember that this is part of the "central dogma" of biology in which DNA engenders messenger RNA which engenders proteins. Here's a nice representation of this cycle borrowed from a Khan academy [lesson](https://cdn.kastatic.org/ka-perseus-images/20ce29384b2e7ff0cdea72acaa5b1dbd7287ab00.png).

<img src="https://cdn.kastatic.org/ka-perseus-images/20ce29384b2e7ff0cdea72acaa5b1dbd7287ab00.png">

Note from the image above that DNA has two strands. The top strand is typically called the coding strand, and the bottom the template strand. The template strand is used for the actual transcription process of conversion into messenger RNA, but in bioinformatics, it's more common to work with the coding strand. Let's now see how we can execute a transcription computationally using Biopython.

In [19]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

coding_dna = Seq("ATGATCTCGTAA", IUPAC.unambiguous_dna)
coding_dna

Seq('ATGATCTCGTAA', IUPACUnambiguousDNA())

In [20]:
template_dna = coding_dna.reverse_complement()
template_dna

Seq('TTACGAGATCAT', IUPACUnambiguousDNA())

Note that these sequences match those in the image below. You might be confused about why the `template_dna` sequence is shown reversed. The reason is that by convention, the template strand is read in the reverse direction.

Let's now see how we can transcribe our `coding_dna` strand into messenger RNA. This will only swap 'T' for 'U' and change the alphabet for our object.

In [21]:
messenger_rna = coding_dna.transcribe()
messenger_rna

Seq('AUGAUCUCGUAA', IUPACUnambiguousRNA())

We can also perform a "back-transcription" to recover the original coding strand from the messenger RNA.

In [22]:
messenger_rna.back_transcribe()

Seq('ATGATCTCGTAA', IUPACUnambiguousDNA())

## Translation

Translation is the next step in the process, whereby a messenger RNA is transformed into a protein sequence. Here's a beautiful diagram [from Wikipedia](https://en.wikipedia.org/wiki/Translation_(biology)#/media/File:Ribosome_mRNA_translation_en.svg) that lays out the basics of this process.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b1/Ribosome_mRNA_translation_en.svg/1000px-Ribosome_mRNA_translation_en.svg.png">

Note how 3 nucleotides at a time correspond to one new amino acid added to the growing protein chain. A set of 3 nucleotides which codes for a given amino acid is called a "codon." We can use the `translate()` method on the messenger rna to perform this transformation in code.

messenger_rna.translate()

The translation can also be performed directly from the coding sequence DNA

In [23]:
coding_dna.translate()

Seq('MIS*', HasStopCodon(IUPACProtein(), '*'))

Let's now consider a longer genetic sequence that has some more interesting structure for us to look at.

In [24]:
coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
coding_dna.translate()

Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))

In both of the sequences above, '*' represents the [stop codon](https://en.wikipedia.org/wiki/Stop_codon). A stop codon is a sequence of 3 nucleotides that turns off the protein machinery. In DNA, the stop codons are 'TGA', 'TAA', 'TAG'. Note that this latest sequence has multiple stop codons. It's possible to run the machinery up to the first stop codon and pause too.

In [25]:
coding_dna.translate(to_stop=True)

Seq('MAIVMGR', IUPACProtein())

We're going to introduce a bit of terminology here. A complete coding sequence CDS is a nucleotide sequence of messenger RNA which is made of a whole number of codons (that is, the length of the sequence is a multiple of 3), starts with a "start codon" and ends with a "stop codon". A start codon is basically the opposite of a stop codon and is mostly commonly the sequence "AUG", but can be different (especially if you're dealing with something like bacterial DNA).

Let's see how we can translate a complete CDS of bacterial messenger RNA.

In [26]:
from Bio.Alphabet import generic_dna

gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
           "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
           "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
           "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
           "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
           generic_dna)
# We specify a "table" to use a different translation table for bacterial proteins
gene.translate(table="Bacterial")

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In [27]:
gene.translate(table="Bacterial", to_stop=True)

Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein())

# Handling Annotated Sequences

Sometimes it will be useful for us to be able to handle annotated sequences where there's richer annotations, as in GenBank or EMBL files. For these purposes, we'll want to use the `SeqRecord` class.

In [28]:
from Bio.SeqRecord import SeqRecord
help(SeqRecord)

Help on class SeqRecord in module Bio.SeqRecord:

class SeqRecord(builtins.object)
 |  A SeqRecord object holds a sequence and information about it.
 |  
 |  Main attributes:
 |   - id          - Identifier such as a locus tag (string)
 |   - seq         - The sequence itself (Seq object or similar)
 |  
 |  Additional attributes:
 |   - name        - Sequence name, e.g. gene name (string)
 |   - description - Additional text (string)
 |   - dbxrefs     - List of database cross references (list of strings)
 |   - features    - Any (sub)features defined (list of SeqFeature objects)
 |   - annotations - Further information about the whole sequence (dictionary).
 |     Most entries are strings, or lists of strings.
 |   - letter_annotations - Per letter/symbol annotation (restricted
 |     dictionary). This holds Python sequences (lists, strings
 |     or tuples) whose length matches that of the sequence.
 |     A typical use would be to hold a list of integers
 |     representing sequenc

Let's write a bit of code involving `SeqRecord` and see how it comes out looking.

In [29]:
from Bio.SeqRecord import SeqRecord

simple_seq = Seq("GATC")
simple_seq_r = SeqRecord(simple_seq)

In [30]:
simple_seq_r.id = "AC12345"
simple_seq_r.description = "Made up sequence"
print(simple_seq_r.id)
print(simple_seq_r.description)

AC12345
Made up sequence


Let's now see how we can use `SeqRecord` to parse a large fasta file. We'll pull down a file hosted on the biopython site.

In [31]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna

--2020-08-05 14:52:05--  https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9853 (9.6K) [text/plain]
Saving to: ‘NC_005816.fna’


2020-08-05 14:52:05 (50.1 MB/s) - ‘NC_005816.fna’ saved [9853/9853]



In [32]:
from Bio import SeqIO

record = SeqIO.read("NC_005816.fna", "fasta")
record

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])

Note how there's a number of annotations attached to the `SeqRecord` object!

Let's take a closer look.

In [33]:
record.id

'gi|45478711|ref|NC_005816.1|'

In [34]:
record.name

'gi|45478711|ref|NC_005816.1|'

In [35]:
record.description

'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

Let's now look at the same sequence, but downloaded from GenBank. We'll download the hosted file from the biopython tutorial website as before.

In [36]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb

--2020-08-05 14:52:19--  https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.gb
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31838 (31K) [text/plain]
Saving to: ‘NC_005816.gb’


2020-08-05 14:52:20 (3.80 MB/s) - ‘NC_005816.gb’ saved [31838/31838]



In [37]:
from Bio import SeqIO

record = SeqIO.read("NC_005816.gb", "genbank")
record

SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])

## SeqIO Objects

TODO(rbharath): Continue filling this up in future PRs.