# Introduction to Biopython

## Installing Biopython

Up to now, every Python library we've needed to use, such as Pandas, Numpy, and Matplotlib, came included with our Anaconda Python distribution.  However, there are many useful Python libraries that are not included in Anaconda Python by default, or are not directly accessible via the Anaconda package manager. 

In class we'll review three ways to install modules:

### Installing modules using the Anaconda GUI

1. Run the Anaconda Navigator program
2. Select "Environments" tab on the left
3. Select the environment you want to install a packge into -- "base" by default
4. Select "All" in package pull down menu in right pane 
5. Search for the package of interest -- e.g. "biopython"
6. Click checkbox next to packages you wish to install and then select the "Apply" button

### Installing modules using the `conda` command line tool

1. Search for the package of interest -- `conda search biopython`
2. Install the package of interest -- `conda install biopython`

### Installing modules using pip

1. Search for packages of interest on [PyPI](https://pypi.org/) or via the command line, e.g. `pip search gff`
2. Install via `pip install` command -- e.g. `pip install gffpandas`



## Biopython

Biopython is a library that contains a wide variety of functions and classes for working with bioinformatics data of various kinds.  Nucleotide and protein sequence information is particularly well supported, but Biopython has tools for a wide variety of tasks, such as running automated data base searches over the internet, working with 3D structural data,  running population genetic simulations, etc.  Today we'll focus primarily on working with sequence data and associated metadata.

In [1]:
import Bio  # base library, this is a check to see if we installed it correctly

### How do I start to learn a new library?

1. Find the documentation and look for a tutorial
2. Read, test, and extend code examples illustrating how the library works
3. Learning how to effectively use API documentation
4. Learn how to query Python objects in an interactive session
5. Read the source code

We'll illustrate all of these steps today as we start to get acquainted with Biopython

## CLASS TODO
1. Find the Biopython home page
2. Find the link to the Biopython documentation
3. Go the the API (application programmers interface) documentation

### Seq objects

In [2]:
from Bio.Seq import Seq

In [3]:
# create a sequence object
s1 = Seq("ATGCGCGATGA")

In [4]:
# looks like a fancy string
s1

Seq('ATGCGCGATGA')

In [5]:
s1[0] # we can index Seqs like strings

'A'

In [6]:
s1[:3] # we can slice Seqs like strings

Seq('ATG')

In [7]:
# We can cast Seqs to strings
str(s1)

'ATGCGCGATGA'

## CLASS TODO

1. Find the Bio.Seq page in API docs
2. Skim the documentation for the non-dunder methods to get a sense of what sort of built-in functionality Seq objects have

### Python tools for introspection and documentation

How can we discover what we can do with objects in Python? A great way of course is to read the documentation online, but there are a number of ways to explore what sorts of methods are associated with objects from within the interpreter itself.


#### The type() function 
A good starting place is to use the `type` function to query an object about its type.

In [8]:
type(s1) 

Bio.Seq.Seq

### Accessing doc strings

Once you know what type of object you're dealing with you can see if there is any useful information in the doc strings for that type.  Two ways to do that are using the standard Python `help` function or using the Jupyter `?` command (specific to using the Jupyter environment).  The `?` command usually produces shorter, more succinct output so I'll usually try that first before the `help` command.

In [9]:
Bio.Seq.Seq?

[0;31mInit signature:[0m [0mBio[0m[0;34m.[0m[0mSeq[0m[0;34m.[0m[0mSeq[0m[0;34m([0m[0mdata[0m[0;34m,[0m [0malphabet[0m[0;34m=[0m[0mAlphabet[0m[0;34m([0m[0;34m)[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Read-only sequence object (essentially a string with an alphabet).

Like normal python strings, our basic sequence object is immutable.
This prevents you from doing my_seq[5] = "A" for example, but does allow
Seq objects to be used as dictionary keys.

The Seq object provides a number of string like methods (such as count,
find, split and strip), which are alphabet aware where appropriate.

In addition to the string like sequence, the Seq object has an alphabet
property. This is an instance of an Alphabet class from Bio.Alphabet,
for example generic DNA, or IUPAC DNA. This describes the type of molecule
(e.g. RNA, DNA, protein) and may also indicate the expected symbols
(letters).

The Seq object also provides some biological methods, 

### The `dir` function

The `dir` function is another that is useful for introspection. `dir` returns a list of strings that give the names of all the "attributes" or "fields" (methods and non-method attributes) associated with an object or type.

In [10]:
dir(s1) # could also do dir(Bio.Seq.Seq)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__mul__',
 '__ne__',
 '__new__',
 '__radd__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_data',
 '_get_seq_str_and_check_alphabet',
 'alphabet',
 'back_transcribe',
 'complement',
 'count',
 'count_overlap',
 'encode',
 'endswith',
 'find',
 'index',
 'join',
 'lower',
 'lstrip',
 'reverse_complement',
 'rfind',
 'rindex',
 'rsplit',
 'rstrip',
 'split',
 'startswith',
 'strip',
 'tomutable',
 'transcribe',
 'translate',
 'ungap',
 'upper']

All the names with underscores (e.g. of the form `__name__`) are called "dunder methods". These dunder methods are important in terms of how the functionality of an object is implemented but tend to be less important in terms of how we as users of these objects are likely to use them.  So sometimes it can be useful to filter them out.

In [11]:
[i for i in dir(s1) if not i.startswith("_")]  # get the attributes, hiding the "dunders"

['alphabet',
 'back_transcribe',
 'complement',
 'count',
 'count_overlap',
 'encode',
 'endswith',
 'find',
 'index',
 'join',
 'lower',
 'lstrip',
 'reverse_complement',
 'rfind',
 'rindex',
 'rsplit',
 'rstrip',
 'split',
 'startswith',
 'strip',
 'tomutable',
 'transcribe',
 'translate',
 'ungap',
 'upper']

We can recursively apply the `type`, `help` and `dir` functions to the attributes of an object, as illustrated below:

## The Jupyter environment facilitates introspection

Since we're primarily working in the Jupyter notebook environment, we can take advantage of the features that Jupyter provides for exploring objects.

### Use code completion to explore object attributes

Jupyter's code completion facilities can help you identify object attributes and access docstrings.  

To use these feature start by typing the variable name of an object, followed by a period and then hit the tab key on your keyboard. This prompts Jupyter to show you the objects attributes:

* `s1.<Tab>` -- show object attributes

If you're working with a function or method and you can't recall the details of the arguments, Jupyter can help you as well.  Type the name of the the function/method followed by the parentheses and with your cursor inside the parentheses hold the shift key down and press tab `<Shift-Tab>` and Jupyter will show you a pop-up window with the function signature and a short version of the functions doc strings.

* `s1.translate(<cursor here, <Shift-Tab>>)` -- show doc strings


### Methods on Bio.Seq objects

`Seq` objects define a number of useful methods, some related to string-like manipulations and others that are specific to biological sequences.

In [12]:
s1.complement()

Seq('TACGCGCTACT')

In [13]:
s1.reverse_complement()

Seq('TCATCGCGCAT')

In [14]:
s1.transcribe() 

Seq('AUGCGCGAUGA', RNAAlphabet())

In [15]:
s1.translate()



Seq('MRD', ExtendedIUPACProtein())

In [16]:
s1.count("TG") # count the number of occurences of the substring "TG"

2

In [17]:
s1.find("TG") # find the index of the first occurence of "TG"

1

### Specifying the Alphabet of a Seq object

You may have noticed in the examples above that when we used the `transcribe` and `translate` methods we got back sequences that had additional information when we displayed them, for example: `Seq('AUGCGCGAUGA', RNAAlphabet())`. This is because `Seq` objects always have an associated "Alphabet" that specifies the type of sequence type we're dealing with (DNA, RNA, or Protein).  If we don't specify an Alphabet type when the `Seq` is created, then a generic Alphabet is used.

In [18]:
s1.alphabet  # generic alphabet

Alphabet()

In [19]:
s2 = Seq("ATGCAT", alphabet=Bio.Alphabet.DNAAlphabet())  # explicitly create a DNA sequence
s2

Seq('ATGCAT', DNAAlphabet())

In [20]:
s1.alphabet = Bio.Alphabet.DNAAlphabet()  # You can change the alphabet by setting the alphabet attribute
s1

Seq('ATGCGCGATGA', DNAAlphabet())

In [21]:
r2 = s2.transcribe()
r2.alphabet  # the transcribe function returns a new sequence whose alphabet is RNAAlphabet

RNAAlphabet()

In [22]:
p2 = s2.translate()
p2.alphabet  # the translate function returns a new sequences whose alphabet is ExtendedIUPACProtein

ExtendedIUPACProtein()

### Parsing sequence records from a FASTA file

## CLASS TODO

1. Read the Bio.SeqIO.parse docs and short examples

In [23]:
from Bio import SeqIO

In [24]:
# use a for loop to iterate over fasta records in a file
# change the file path as needed

for rec in SeqIO.parse("../data/covid-S-and-E.fsa", format="fasta"):
    print(rec.name)

YP_009724390.1
YP_009724392.1


In [25]:
# use a list comprehension to get all the fasta records out of a file and store them in a list
recs = [rec for rec in SeqIO.parse("../data/covid-S-and-E.fsa","fasta")]

In [26]:
recs

[SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT', SingleLetterAlphabet()), id='YP_009724390.1', name='YP_009724390.1', description='YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[]),
 SeqRecord(seq=Seq('MYSFVSEETGTLIVNSVLLFLAFVVFLLVTLAILTALRLCAYCCNIVNVSLVKP...LLV', SingleLetterAlphabet()), id='YP_009724392.1', name='YP_009724392.1', description='YP_009724392.1 envelope protein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[])]

In [27]:
len(recs)

2

In [28]:
recs[0]

SeqRecord(seq=Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT', SingleLetterAlphabet()), id='YP_009724390.1', name='YP_009724390.1', description='YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]', dbxrefs=[])

In [29]:
type(recs[0])

Bio.SeqRecord.SeqRecord

In [30]:
rec0 = recs[0]  # assign the first rec to the variable rec0

## CLASS TODO

1. Find the SeqRecord page in API docs
2. What are the non-method attributes associated with SeqRecords?
3. What are the methods associated with SeqRecords

In [31]:
rec0.id

'YP_009724390.1'

In [32]:
rec0.name

'YP_009724390.1'

In [33]:
rec0.description

'YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]'

In [34]:
rec0.seq  # get the sequence associated with rec0

Seq('MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDL...HYT', SingleLetterAlphabet())

In [35]:
rec0.features # get any features associated with rec0

[]

## Parsing records from a Genbank file

`SeqIO.parse` also works with Genbank files (and other commons file types as well, see SeqIO documentation):

In [36]:
filename = "../data/NC_045512.gb"
covidrecs = [rec for rec in SeqIO.parse(filename, format="genbank")]

In [37]:
covidrecs

[SeqRecord(seq=Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', IUPACAmbiguousDNA()), id='NC_045512.2', name='NC_045512', description='Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome', dbxrefs=['BioProject:PRJNA485481'])]

In [38]:
len(covidrecs)

1

In [39]:
covidref = covidrecs[0]

In [40]:
covidref.name, covidref.description

('NC_045512',
 'Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome')

In [41]:
covidref.seq

Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', IUPACAmbiguousDNA())

In [42]:
len(covidref.features)  # Since the genbank file specifies features, we can access them here

57

In [43]:
covidref.features[:3]

[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29903), strand=1), type='source'),
 SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(265), strand=1), type="5'UTR"),
 SeqFeature(FeatureLocation(ExactPosition(265), ExactPosition(21555), strand=1), type='gene')]

In [44]:
ftr0 = covidref.features[0]
ftr0

SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29903), strand=1), type='source')

In [45]:
type(ftr0)

Bio.SeqFeature.SeqFeature

In [46]:
ftr0.id, ftr0.type, ftr0.qualifiers

('<unknown id>',
 'source',
 OrderedDict([('organism',
               ['Severe acute respiratory syndrome coronavirus 2']),
              ('mol_type', ['genomic RNA']),
              ('isolate', ['Wuhan-Hu-1']),
              ('host', ['Homo sapiens']),
              ('db_xref', ['taxon:2697049']),
              ('country', ['China']),
              ('collection_date', ['Dec-2019'])]))

**IMPORTANT**: features have a location, the coordinates of which are 0-indexed (i.e. Biopython converts the 1-indexed coordinates used in the Genbank files to 0-indexed coordinate so that when we use the translated coordinates to slice Seq objects we get the right substrings)

In [47]:
ftr0.location

FeatureLocation(ExactPosition(0), ExactPosition(29903), strand=1)

In [48]:
type(ftr0.location)

Bio.SeqFeature.FeatureLocation

In [49]:
ftr0.location.start, ftr0.location.end

(ExactPosition(0), ExactPosition(29903))

In [50]:
covidref.seq[ftr0.location.start:ftr0.location.end]

Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', IUPACAmbiguousDNA())

Let's iterate over all the features to find out their types:

In [51]:
# iterate over all the features, printing the respective index and their types

for i in range(len(covidref.features)):
    ftrtype = covidref.features[i].type
    print(f"Feature {i} is a {ftrtype}")

Feature 0 is a source
Feature 1 is a 5'UTR
Feature 2 is a gene
Feature 3 is a CDS
Feature 4 is a mat_peptide
Feature 5 is a mat_peptide
Feature 6 is a mat_peptide
Feature 7 is a mat_peptide
Feature 8 is a mat_peptide
Feature 9 is a mat_peptide
Feature 10 is a mat_peptide
Feature 11 is a mat_peptide
Feature 12 is a mat_peptide
Feature 13 is a mat_peptide
Feature 14 is a mat_peptide
Feature 15 is a mat_peptide
Feature 16 is a mat_peptide
Feature 17 is a mat_peptide
Feature 18 is a mat_peptide
Feature 19 is a CDS
Feature 20 is a mat_peptide
Feature 21 is a mat_peptide
Feature 22 is a mat_peptide
Feature 23 is a mat_peptide
Feature 24 is a mat_peptide
Feature 25 is a mat_peptide
Feature 26 is a mat_peptide
Feature 27 is a mat_peptide
Feature 28 is a mat_peptide
Feature 29 is a mat_peptide
Feature 30 is a mat_peptide
Feature 31 is a stem_loop
Feature 32 is a stem_loop
Feature 33 is a gene
Feature 34 is a CDS
Feature 35 is a gene
Feature 36 is a CDS
Feature 37 is a gene
Feature 38 is a CDS
F

## CLASS TODO

1. Read the SeqFeature and FeatureLocation docs
2. What are the non-method attributes associated with SeqRecords?
3. What are the methods associated with SeqRecords
4. What are the attributes of FeatureLocation

In [52]:
# get all the gene features
genes = [ftr for ftr in covidref.features if ftr.type == "gene"]

In [53]:
len(genes)

11

In [54]:
# the qualifiers attribute is a dictionary that contains useful information about the features
genes[0].qualifiers 

OrderedDict([('gene', ['ORF1ab']),
             ('locus_tag', ['GU280_gp01']),
             ('db_xref', ['GeneID:43740578'])])

In [55]:
for i, gene in enumerate(genes):  # lookup the help on the enumerate function
    print(f"Gene {i} ({gene.qualifiers['gene'][0]}) is located at {gene.location.start}...{gene.location.end}\n")

Gene 0 (ORF1ab) is located at 265...21555

Gene 1 (S) is located at 21562...25384

Gene 2 (ORF3a) is located at 25392...26220

Gene 3 (E) is located at 26244...26472

Gene 4 (M) is located at 26522...27191

Gene 5 (ORF6) is located at 27201...27387

Gene 6 (ORF7a) is located at 27393...27759

Gene 7 (ORF7b) is located at 27755...27887

Gene 8 (ORF8) is located at 27893...28259

Gene 9 (N) is located at 28273...29533

Gene 10 (ORF10) is located at 29557...29674



In [56]:
gene0 = genes[0]
gene0.location.start, gene0.location.end

(ExactPosition(265), ExactPosition(21555))

In [57]:
# one way to get the sequence corresponding to this gene
covidref[gene0.location.start:gene0.location.end]

SeqRecord(seq=Seq('ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTG...TAA', IUPACAmbiguousDNA()), id='NC_045512.2', name='NC_045512', description='Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome', dbxrefs=[])

In [58]:
# an easier way to do the same thing, using the extract method
gene0.extract(covidref)

SeqRecord(seq=Seq('ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTG...TAA', IUPACAmbiguousDNA()), id='NC_045512.2', name='NC_045512', description='Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome', dbxrefs=[])