In [None]:
%%html
<link rel='stylesheet' type='text/css' href='custom.css'/>

In [None]:
!rm data/converted-seqs.fasta data/converted-seqs.qual data/not-yasf.fna

![](assets/logo.svg)

# A Bioinformatics Library for Data Scientists, Students, and Developers

Jai Rideout and Evan Bolyen

*[Caporaso Lab](http://caporasolab.us), Northern Arizona University*

## What is scikit-bio?

Wed Jul  8 20:26:12 CDT 2015


A Python bioinformatics library for:

- data scientists

- students

- developers

- high-level API designed for biological data munging
- extensive docs and companion texts (scikit-bio cookbook, IAB)
- it's a scikit; guarantees about API stability

> "The first step in developing a new genetic analysis algorithm is to decide how to make the input data file format different from all pre-existing analysis data file formats." - [Law's First Law](http://www.bioinformatics.roslin.ed.ac.uk/lawslaws/)















<span style='line-height:2em; word-spacing:2em'>Axt BAM SAM BED bedGraph bigBed bigGenePred table bigWig Chain GenePred table GFF GTF HAL MAF Microarray Net Personal Genome SNP format PSL VCF WIG  abi ace clustal embl fasta fastq genbank ig imgt nexus phred phylip pir seqxml sff stockholm swiss tab qual uniprot-xml emboss PhyolXML NexML newick CDAO MDL bcf caf gcproj scf SBML lsmat ordination qseq BIOM ASN.1 .2bit .nib ENCODE ... </span>

In [None]:
#TODO: review the list

<span style='line-height:2em; word-spacing:2em'>Axt BAM SAM BED bedGraph bigBed bigGenePred table bigWig Chain GenePred table GFF GTF HAL MAF Microarray Net Personal Genome SNP format PSL VCF WIG  abi ace <span class='supio'>clustal</span> embl <span class='supio'>fasta</span> <span class='supio'>fastq</span> genbank ig imgt nexus phred <span class='supio'>phylip</span> pir seqxml sff stockholm swiss tab qual uniprot-xml emboss PhyolXML NexML <span class='supio'>newick</span> CDAO MDL bcf caf gcproj scf SBML <span class='supio'>lsmat</span> <span class='supio'>ordination</span> <span class='supio'>qseq</span> BIOM ASN.1 .2bit .nib ENCODE ... </span>

## I/O in bioinformatics is hard


- format redundancy (many-to-many)

- Multiple file formats can be **read** into the same object.
- A single object can be **written** in multiple formats.

- format ambiguity

- heterogeneous sources

## How can we solve this?


# An I/O Registry!


- file format implemented in single submodule
- registry provides simple API to implement format against
- (messy) format logic separate from object implementation

## Format redundancy (many-to-many)


In [None]:
from skbio import DNA

seq1 = DNA.read('data/seqs.fasta', qual='data/seqs.qual')
seq2 = DNA.read('data/seqs.fastq', variant='illumina1.8')
seq1

In [None]:
seq1 == seq2

## Efficient format conversion

In [None]:
import skbio.io
stream_of_seqs = skbio.io.read("data/seqs.fastq", format='fastq', 
                               variant='illumina1.8')
stream_of_seqs

In [None]:
skbio.io.write(stream_of_seqs, format='fasta', into='data/converted-seqs.fasta', 
               qual='data/converted-seqs.qual')

In [None]:
!head -2 data/converted-seqs.fasta

In [None]:
!head -2 data/converted-seqs.qual

## Format ambiguity

It is often unclear to the user what the format of a file is

Extensions aren'y formalized, or do not exist (fasta/fna/txt)

In [None]:
skbio.io.sniff('data/mystery_file.gz')

## Heterogeneous sources

#### Read a gzip file from a URL:

In [None]:
from skbio import TreeNode

tree1 = skbio.io.read('http://localhost:8888/files/data/newick.gz', 
                      into=TreeNode)
print(tree1.ascii_art())

#### Read a bz2 file from a file path:

In [None]:
import io 

with io.open('data/newick.bz2', mode='rb') as open_filehandle:
    tree2 = skbio.io.read(open_filehandle, into=TreeNode)

print(tree2.ascii_art())

#### Read a list of lines:

In [None]:
tree3 = skbio.io.read(['((a, b, c), d:15):0;'], into=TreeNode)
print(tree3.ascii_art())

## Let's make a format!

#YASF (Yet Another Sequence Format)

In [None]:
!cat data/yasf-seq.yml

In [None]:
import yaml

yasf = skbio.io.create_format('yasf')

@yasf.sniffer()
def yasf_sniffer(fh):
    return fh.readline().rstrip() == "#YASF", {}

@yasf.reader(DNA)
def yasf_to_dna(fh):
    seq = yaml.load(fh.read())
    return DNA(seq['Sequence'], metadata={
        'id': seq['ID'],
        'location': seq['Location'],
        'description': seq['Description']
    })

In [None]:
seq = DNA.read("data/yasf-seq.yml")
seq

## Convert YASF to FASTA

In [None]:
seq.write("data/not-yasf.fna", format='fasta')
!cat data/not-yasf.fna

Talk about how developers using scikit-bio can rely on our object model to support current and future file formats

## We are in beta - should you even use our software?

#YES!

## API Lifecycle
![](assets/stability-state-diagram.svg)


In [None]:
from skbio.util._decorator import stable

@stable(as_of='0.4.0')
def add(a, b):
    """add two numbers.
    
    Parameters
    ----------
    a, b : int
        Numbers to add.
        
    Returns
    -------
    int
        Sum of `a` and `b`.
    
    """
    return a + b

In [None]:
help(add)

### What is stable:

- `skbio.io` 
- `skbio.sequence`

&nbsp;
&nbsp;
###What is next:

- `skbio.alignment`
- `skbio.tree`
- `skbio.diversity`
- `skbio.stats`
- &lt;`your awesome subpackage!`&gt;

## Sequence API: putting the *scikit* in scikit-bio

Interoperability with scipy-stack

"numpythonic" API

performance

In [None]:
seq = DNA("AacgtGTggA", lowercase='exon')
seq

## Made with numpy

In [None]:
seq.values

## And a pinch of pandas

In [None]:
seq.positional_metadata

## Slicing with positional metadata:

In [None]:
seq[seq.positional_metadata['exon']]

## Application: building a taxonomy classifier