# Tutorial Part 21: Introduction to Bioinformatics

So far in this tutorial, we've primarily worked on the problems of cheminformatics. We've been interested in seeing how we can use the techniques of machine learning to make predictions about the properties of molecules. In this tutorial, we're going to shift a bit and see how we can use classical computer science techniques and machine learning to tackle problems in bioinformatics.

For this, we're going to use the venerable [biopython](https://biopython.org/) library to do some basic bioinformatics. A lot of the material in this notebook is adapted from the extensive official [Biopython tutorial]http://biopython.org/DIST/docs/tutorial/Tutorial.html). We strongly recommend checking out the official tutorial after you work through this notebook!

## Colab

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/21_Introduction_to_Bioinformatics.ipynb)

## Setup

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

In [None]:
!wget -c https://repo.anaconda.com/archive/Anaconda3-2019.10-Linux-x86_64.sh
!chmod +x Anaconda3-2019.10-Linux-x86_64.sh
!bash ./Anaconda3-2019.10-Linux-x86_64.sh -b -f -p /usr/local
!conda install -y -c deepchem -c rdkit -c conda-forge -c omnia deepchem-gpu=2.3.0
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')

We'll use `pip` to install `biopython`

In [1]:
!pip install biopython



In [2]:
import Bio
Bio.__version__

'1.76'

In [3]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACATTG")
my_seq

Seq('AGTACACATTG')

In [4]:
my_seq.complement()

Seq('TCATGTGTAAC')

In [5]:
my_seq.reverse_complement()

Seq('CAATGTGTACT')

## Parsing Sequence Records

We're going to download a sample `fasta` website from the Biopython tutorial to use in some exercises. This file is a set of hits for a sequence (of lady slipper orcid genes).

In [6]:
!wget https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta

--2020-03-07 20:24:08--  https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.40.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.40.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76480 (75K) [text/plain]
Saving to: ‘ls_orchid.fasta’


2020-03-07 20:24:08 (2.81 MB/s) - ‘ls_orchid.fasta’ saved [76480/76480]



Let's take a look at what the contents of this file look like:

In [8]:
from Bio import SeqIO

for seq_record in SeqIO.parse('ls_orchid.fasta', 'fasta'):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet())
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet())
748
gi|2765655|emb|Z78530.1|CMZ78530
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAAACAACAT...CAT', SingleLetterAlphabet())
744
gi|2765654|emb|Z78529.1|CLZ78529
Seq('ACGGCGAGCTGCCGAAGGACATTGTTGAGACAGCAGAATATACGATTGAGTGAA...AAA', SingleLetterAlphabet())
733
gi|2765652|emb|Z78527.1|CYZ78527
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...CCC', SingleLetterAlphabet())
718
gi|2765651|emb|Z78526.1|CGZ78526
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAG...TGT', SingleLetterAlphabet())
730
gi|2765650|emb|Z78525.1|CAZ78525
Seq('TGTTGAGATAGCAGAATATACATCGAGTGAATCCGGAGGACCTGTGGTTATTCG...GC

## Sequence Objects

A large part of the biopython infrastructure deals with tools for handlings sequences. These could be DNA sequences, RNA sequences, amino acid sequences or even more exotic constructs. To tell biopython what type of sequence it's dealing with, you can specify the alphabet explicitly.

In [12]:
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
my_seq = Seq("ACAGTAGAC", IUPAC.unambiguous_dna)
my_seq

Seq('ACAGTAGAC', IUPACUnambiguousDNA())

In [14]:
my_seq.alphabet

IUPACUnambiguousDNA()

If we want to code a protein sequence, we can do that just as easily.

In [15]:
my_prot = Seq("AAAAA", IUPAC.protein) # Alanine pentapeptide
my_prot

Seq('AAAAA', IUPACProtein())

In [16]:
my_prot.alphabet

IUPACProtein()

We can take the length of sequences and index into them like strings.

In [17]:
print(len(my_prot))

5


In [18]:
my_prot[0]

'A'

You can also use slice notation on sequences to get subsequences.

In [19]:
my_prot[0:3]

Seq('AAA', IUPACProtein())

In [20]:
my_prot + my_prot

Seq('AAAAAAAAAA', IUPACProtein())

But this fails

In [21]:
my_prot + my_seq

TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()

## Transcription

Transcription is the process by which a DNA sequence is converted into messenger RNA. Remember that this is part of the "central dogma" of biology in which DNA engenders messenger RNA which engenders proteins. Here's a nice representation of this cycle borrowed from a Khan academy [lesson](https://cdn.kastatic.org/ka-perseus-images/20ce29384b2e7ff0cdea72acaa5b1dbd7287ab00.png).

<img src="https://cdn.kastatic.org/ka-perseus-images/20ce29384b2e7ff0cdea72acaa5b1dbd7287ab00.png">

Note from the image above that the reverse compl