# Biopython 
For anyone working with DNA or protein sequence information, `biopython` provides an extremely helpful set of tools. biopython gives the user the ability to programatically interact with biological sequence data and includes plugins to popular alignment and homology search algorithms such as BLAST or CLUSTAL, phylogenetic packages, and much more. We will barely scratch the surface today in the kinds of things that one can accomplish, so if you are interested you can start reading the documation [here](https://biopython.org/wiki/Documentation).

## Installing biopython
The first order of business is to install biopython on your system. Do that by using the command line or the anaconda prompt (on Windows systems) and type

`conda install -c anaconda biopython`

that should be all that is needed

## Working with sequences
The first use case for us will be working with DNA sequences using biopython. biopython provides for us a `Seq` object, that contains at it's heart a string of biological sequence but that "knows" how to do certain tricks

In [7]:
from Bio.Seq import Seq
my_seq = Seq("AGTATCTTTGGT")
print(my_seq)

print(my_seq.complement())
print(my_seq.reverse_complement())

AGTATCTTTGGT
TCATAGAAACCA
ACCAAAGATACT


Aside from containing strings `Seq` objects also have an alphabet that can be set so that the object is even a bit smarter. For instance

In [9]:
my_seq = Seq("AGTATCTTTGGT")
#check the alphabet of my_seq
print(my_seq.alphabet) #returns a generic thing

#set alphabet specifically
from Bio.Alphabet import IUPAC
my_seq = Seq("AGTATCTTTGGT",IUPAC.unambiguous_dna)
print(my_seq.alphabet)

Alphabet()
IUPACUnambiguousDNA()


sequences generally behave as strings, meaning that you can index them and iterate over them, etc.

In [13]:
for c in my_seq:
    print(c)
    
print("here is my_seq[0]: ",my_seq[0])

A
G
T
A
T
C
T
T
T
G
G
T
here is my_seq[0]:  A


In [26]:
#compute GC percentage / 6-frame tx
from Bio.SeqUtils import GC,six_frame_translations
print(my_seq)
print("percent GC ",GC(my_seq))
print("\n")
print(six_frame_translations(my_seq))

AGTATCTTTGGT
percent GC  33.333333333333336


GC_Frame: a:2 t:6 g:3 c:1 
Sequence: agtatctttggt, 12 nt, 33.33 %GC


1/1
  Y  L  W
 V  S  L
S  I  F  G
agtatctttggt   33 %
tcatagaaacca
I  K  P 
 T  D  K  T
  Y  R  Q




there are many other basic sequence utilities that biopython provides. you have to wade through the sequtils documentation a bit to find out everything that it can do out of the box.

## Reading in sequences
Perhaps the single most useful thing that biopython provides is basic utilities to read and write from common data formats such as fasta and fastq. These parsers really aid in our ability to quickly make headway on even sophisticated datasets. We will work with a set of orchid rRNA gene sequences that you can download [here](https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/ls_orchid.fasta) although I have also included the file in the github repo notebooks directory.



In [31]:
from Bio import SeqIO
for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
    print("id: ",seq_record.id,"length: ",len(seq_record.seq))


id:  gi|2765658|emb|Z78533.1|CIZ78533 length:  740
id:  gi|2765657|emb|Z78532.1|CCZ78532 length:  753
id:  gi|2765656|emb|Z78531.1|CFZ78531 length:  748
id:  gi|2765655|emb|Z78530.1|CMZ78530 length:  744
id:  gi|2765654|emb|Z78529.1|CLZ78529 length:  733
id:  gi|2765652|emb|Z78527.1|CYZ78527 length:  718
id:  gi|2765651|emb|Z78526.1|CGZ78526 length:  730
id:  gi|2765650|emb|Z78525.1|CAZ78525 length:  704
id:  gi|2765649|emb|Z78524.1|CFZ78524 length:  740
id:  gi|2765648|emb|Z78523.1|CHZ78523 length:  709
id:  gi|2765647|emb|Z78522.1|CMZ78522 length:  700
id:  gi|2765646|emb|Z78521.1|CCZ78521 length:  726
id:  gi|2765645|emb|Z78520.1|CSZ78520 length:  753
id:  gi|2765644|emb|Z78519.1|CPZ78519 length:  699
id:  gi|2765643|emb|Z78518.1|CRZ78518 length:  658
id:  gi|2765642|emb|Z78517.1|CFZ78517 length:  752
id:  gi|2765641|emb|Z78516.1|CPZ78516 length:  726
id:  gi|2765640|emb|Z78515.1|MXZ78515 length:  765
id:  gi|2765639|emb|Z78514.1|PSZ78514 length:  755
id:  gi|2765638|emb|Z78513.1|PB