<div align=right>
<img src="img/logosmall.png" width="100px" align=right>
</div>

# Working with 3rd party libraries and modules

Let's refresh our memory about the overall geography of the Python ecosystem:

![Python ecosystem](img/Python ecosystem.png)

* At the core, we have the Python language itself.


* Shipping with the Python language is a selection of high-quality modules known as the Python *standard library*.


* Around that, there's a boundless sea of third-party modules and libraries, written by a large array of individuals, organisations and ad hoc groups of developers all over the world.

If you want to do something in Python, and this thing feels like it's even remotely something that somebody else might have done before, you can probably count on there being a module somewhere out there that does this thing already.  In fact, you're likely to find *several*.

The challenges facing you are:

* How do I find a module that does what I need?
* How to I tell if any modules I find are of decent quality / trustworthy?
* How do I install a module I have found?
* How do I learn do use the module properly?

Let's give ourselves a simple enough challenge:

**We need the tools to read FASTA files reliably and efficiently, and parse the data contained in order to use it in our own programs.**

Parsing a basic FASTA file is not very hard — we've already seen ways of doing so in this course.  The *problem* with parsing FASTA files is that the format is not really standardised in any way.  Hence, there are a myriad little differences between the formats of FASTA files written by different applications.  Writing a FASTA parser that takes all (or even most) of this into account is a boring, time-consuming and not-very-interesting undertaking.

It's also almost certainly something that someone has already done.

In addition to making allowances for variation in FASTA formats, it would also be good to have a FASTA parser that's *efficient* in terms of computing resources.  One that, for instance, doesn't attempt to read an entire file into memory.

Let's venture out onto the Internet and see what we can come up with!

## Navigating the world of add-on modules

Venturing into the universe of 3rd party libraries from the relative safety of the standard library can be a little scary at first.  Fortunately, there are a couple of tools out there to make our life a little easier:

Your first stop when searching for a module that fulfills some need you have should probably be the official Python Package Index, *PyPI*:

<http://pypi.python.org>

PyPI is a voluntary registry where developers can add any modules they've created and made publicly available.  There's no real barrier to entry to this index — very little intrinsic quality control.  This means that the modules listed on PyPI are (a) vast in number and (b) widely varying in quality.

Let's go to PyPI and type the term “`fasta`” in the search box on the top right…

*Oh dear…*

Yes, there are dozens upon dozens of modules that somehow relate to the FASTA format.  On the one hand this is great, but on the other hand, how to we distinguish as well-written, trustworthy, continually maintained module from … something that someone wrote in an afternoon and published on a lark?

The unfortunate answer is, "not easily".  There are a couple of pointers one can use:

* Firstly, just click on an individual module with a likely-sounding name.  Read its description.  Make *sure* it's what you need;  very often names alone can be misleading.


* Check its version number.  Sure, versions numbers are arbitrary, but if the developer of a module assigns a version number like `0.0.2`, it's a very likely sign that they don't feel that the module is very mature yet themselves.


* Check when it was last updated.  Well-maintained modules get fairly regular updates.  Modules that have languished for a couple of years without support are often not worth bothering with.

Even keeping all of this in mind, it's often still hard to make an informed choice.  Often, a good solution is to turn to the community.  There are many virtual places where the bioinformatics community gather, e.g.:


* The Q&A site [Biostars](http://www.biostars.org)


* The [SEQanswers](http://seqanswers.com) forum

And if all else fails, there's always … *Google!*  Let's google the term "python parse fasta":  [Click here](http://lmgtfy.com/?q=python+parse+fasta) to do the search…

The very first two results seem interesting:

![Google search](img/google_search.png)

* The first result is a [link to a question on Biostars](https://www.biostars.org/p/710/).  You can already see from the few lines of preview text on Google that it concerns something called "Biopython".  Please go and look at the answer thread there.


* The second result is actually a [link into the documentation](http://biopython.org/wiki/SeqIO) of something called `SeqIO` on a site called <http://biopython.org>.


On Biostars, the top-rated answer is from user Zhaorong, who suggests that you "just use Biopython" and provides this code snippet:

In [None]:
from Bio import SeqIO

fasta_sequences = SeqIO.parse(open(input_file), 'fasta')

with open(output_file) as out_file:
    for fasta in fasta_sequences:
        name, sequence = fasta.id, fasta.seq.tostring()
        new_sequence = some_function(sequence)
        write_fasta(out_file)

That seems… really easy.  So let's learn more about this Biopython by looking at its home page:

<http://biopython.org>

>Biopython is a set of **freely available** tools for biological computation written in Python by an **international team of developers**.

>It is a **distributed collaborative effort** to develop Python libraries and applications which address the needs of current and future work in bioinformatics. The **source code is made available** under the Biopython License, which is extremely liberal and compatible with almost every license in the world. We **work along with the Open Bioinformatics Foundation**, who generously host our website, bug tracker, and mailing lists.

The emphasis in that text was mine.  All those words in **bold** make me happy — Biopython sounds like a worthwhile toolset.  But words should be backed up by deeds.  By following some links on the Biopython home page, I find that the Biopython source code is hosted on GitHub, so let's have a look there:

<https://github.com/biopython>

The first thing we notice is that the Biopython code is maintained by a group of developers.  This isn't just one person's bedroom project.  So far, so good.  Now let's look at the actual repository:

<https://github.com/biopython/biopython>

10400 commits (as I write this).  That sounds significant.

Clicking around on the tabs at the top, I get every impression that this is an *active* project (last commit 6 days ago as I write this) with a *long history* (going back to 2000!)

I'm almost convinced.  This seems like a great tool to have in my toolbox to do far more than just parse fasta files, though it seems it will serve that immediate need as well.

One more thing:  Let's browse the official documentation and see what this thing can do:

<http://biopython.org/wiki/Documentation>

OK, there's a *lot* of it, but a quick browse around satisfies me that Biopython provides lots (and lots) of tools I would use on a regular basis.  Time to install it.

## Installing a 3rd party module with `conda`

Let's first try the obvious thing, and search for `biopython` using the `conda` tool on the command line:

In [None]:
!conda search biopython

As you can see, the *Anaconda Cloud* contains quite a few packaged versions of Biopython.  The one with the dot (“`.`”) next to it is the one that's most suited for your system — the latest version of Biopython compiled for the right version of Python, and the right computer architecture.

At this point, installing Biopython becomes as simple as:

In [None]:
!conda install -y biopython

>The `-y` flag tells `conda` to assume the answer `yes` to all questions, so it won't stop and wait for your input before performing the installation.

Once the installation has finished, we can check whether Biopython has been successfully installed by trying to `import` its toplevel namespace, `Bio`:

In [1]:
import Bio

If executing the `import` statement didn't raise any error, then congratulations, you've installed a major third party Python module.

>No really, you have no idea how lucky you are.  Even just a few years ago, installing something like Biopython would've taken a knowledgeable user the best part of a morning.

## Using a 3rd party module

Let's try to use Biopython to parse a simple FASTA file.  In the `files` subdirectory, there's a FASTA file called `sample2.fa`.

In [2]:
%cd files

/Users/sabineurban/EVOP2017/files


If we want to figure out the intricacies of parsing FASTA files (and many other formats used in bioinformatics) we should read the documentation for the Biopython module `SeqIO`:

<http://biopython.org/wiki/SeqIO>

Here's the short version:

In [3]:
from Bio import SeqIO

for record in SeqIO.parse(open("sample2.fa", 'r'), 'fasta'):
    id, seq = record.id, record.seq
    print("Record {} has sequence {}".format(id, seq), end='\n'*2)

Record YL069W-1.334 has sequence CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATATTGAAACGCTAACAA

Record YAL068C-7235.2170 has sequence TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGATACATTACGTGCAACCAAAAGTGTAAAATGATTGGTTGCAATGTTTCACCTAAATTACTT

Record YAL070W-223.3355 has sequence CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATCTAATATGCCT



## What did we achieve?

Arguably, just writing our own small FASTA parser would've been easier than all the hoops we jumped through to install and use Biopython.

>That might not have been the case if we had wanted to parse something considerably more difficult, such as BLAST output.

For our trouble, though, we have gained a bulletproof FASTA parser that should parse almost any FASTA file we ever come across.  Additionally, it has more features than we could currently imagine we'd ever need … until we need them one day.

As always when installing a 3rd party module, it pays richly to sit down and read the documeentation to get a good idea of what your new tool can accomplish.

## Other useful modules

There are too many other useful 3rd party modules to list.  But here are a few that are very generally useful in the scientific endeavour:

**NumPy** is an underlying part of many science-focused modules.  It provides high-performance support for numerical computing, including super-efficient vectors and matrices.

* <http://numpy.org>

In [None]:
!conda install -y numpy

**Matplotlib** is the most widely used scientific plotting and visualisation library in Python.  It's syntax is reminiscent of Matlab, and some find it a little archaic.  Hence, it has some up-and-coming competition from new more modern plotting libraries.  For now, though, it remains the standard and interacts closely with NumPy.

* <http://matplotlib.org>

In [None]:
!conda install -y matplotlib

**scikit-bio** is — like Biopython — a collection of general-purpose tools useful in Bioinformatics.

* <http://scikit-bio.org>

In [None]:
!conda install -y scikit-bio

**Pandas** is a toolkit for statistical data analysis, built on top of NumPy and closely integrated with Matplotlib.  It provides an equivalent of R's Data Frame.

* <http://pandas.pydata.org>

In [None]:
!conda install -y pandas

## Care and feeding of `conda`

Good 3rd party modules are continually developed and updated, and it's a good idea to update your installed modules from time to time.

To update the `conda` toolkit itself, you should perform this command from time to time:

In [None]:
!conda update -y conda

To update an individual 3rd party module — say Biopython — you can again use `conda update`:

In [None]:
!conda update -y biopython

To update *all* your 3rd party modules installed by `conda`, you can do the following:

In [None]:
!conda update -y --all

Do take some care that you don't *break backwards compatibility*, i.e. update a module to a new version, only to find the code you wrote against that module now breaks.

This shouldn't happen very often with well-written 3rd party modules, but it remains a concern.

One way to overcome this problem is to use `conda`'s facility to create separate "environments", each with its own set of installed modules.  That is beyond the scope of this course, but you can look it up in the Anaconda documentation:

<http://conda.pydata.org/docs/test-drive.html>

---

# Take-home exercise

The FASTA format file `exons.fasta` in the `files` subdirectory contains DNA sequence data. 

Use the Biopython FASTA parser to parse this file, and answer the following questions:

* How many records are in the file?

* How many records have a sequence length of 3408?

* What is the header for the record with the shortest sequence? Is there more than one record with that length?

* What is the title for the record with the longest sequence? Is there more than one record with that length?

* How many records have sequences which contain 20-nucleotide repeats (the same nucleotide repeated at least 20 consecutive times) in their sequences?

* Do any records contain 100% identical sequences?

* The records in the file represent exons.  How many exons can you find for the gene with  Ensembl id `ENSG00000006831`?  What are their exon IDs?

* Which of the exons of `ENSG00000006831` has the highest GC content?