## Overview
```
Author: Florian Wagner
Email: florian.wagner@duke.edu
```
This notebook downloads all data used in the DMAP GO-PCA demo:
- Human gene metadata from [NCBI](ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/), used to map Entrez gene IDs to gene names
- Human genome annotations from [Ensembl](http://www.ensembl.org/info/data/ftp/index.html), used to generate a list of all human protein-coding genes
- The DMAP expression dataset from the [Broad Differentiation Map Portal (DMAP)](http://www.broadinstitute.org/dmap/home)
- The GO annotations from the [UniProt-GOA database](http://www.ebi.ac.uk/GOA/downloads)
- The Gene Ontology from the [Gene Ontology Consortium](http://geneontology.org/page/download-ontology)

### Programs and third-party Python packages used

In [1]:
from pkg_resources import require

print 'Curl:'
!curl --version | head -n 1
print

print 'Gzip:'
!gzip --version | head -n 1
print

print 'Python:'
!python -V
print

print 'Third-party Python packages:'
print str(require('configparser')[0])
print str(require('genometools')[0])

Curl:
curl 7.42.1 (x86_64-unknown-linux-gnu) libcurl/7.42.1 OpenSSL/1.0.2a zlib/1.2.8

Gzip:
gzip 1.4

Python:
Python 2.7.9

Third-party Python packages:
configparser 3.3.0.post2
genometools 1.2rc5


## Read config file (config.ini)

In [2]:
import codecs
from configparser import ConfigParser

config_file = 'config.ini'

config = ConfigParser()
with codecs.open(config_file, 'rb', encoding = 'UTF-8') as fh:
    config.read_file(fh)
    
config = config['Demo']

In [3]:
# create data directory, if necessary
import os
if not os.path.isdir(config['data_dir']):
    os.mkdir(config['data_dir'])

## Download data

### Download NCBI data

In [4]:
# Downloaded gene2accession.gz file from NCBI FTP server on Jan 18, 2016:
# $ curl -o gene2accession_2016-01-18.tsv.gz ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

# Filtered for human genes:
# $ gunzip -c gene2accession_2016-01-18.tsv.gz | grep -P '^9606\t' | gzip > gene2accession_2016-01-18_human.tsv.gz

# Download this file from Dropbox:
!curl -L -o "{config['gene2accession_file']}" "{config['gene2accession_url']}"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   506    0   506    0     0    691      0 --:--:-- --:--:-- --:--:--   694
100 15.9M  100 15.9M    0     0  3543k      0  0:00:04  0:00:04 --:--:-- 4566k


### Download human Ensembl genome annotations

In [5]:
!curl -o "{config['genome_annotation_file']}" "{config['genome_annotation_url']}"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 43.5M  100 43.5M    0     0  3795k      0  0:00:11  0:00:11 --:--:-- 7424k


### Download DMAP expression dataset

In [6]:
!curl -o "{config['expression_file']}" "{config['expression_url']}"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.4M  100 22.4M    0     0  1007k      0  0:00:22  0:00:22 --:--:-- 1097k


### Download GO annotations

In [7]:
!curl -o "{config['go_annotation_file']}" "{config['go_annotation_url']}"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6244k  100 6244k    0     0  1663k      0  0:00:03  0:00:03 --:--:-- 1797k


### Download Gene Ontology
We want to make sure that we download the version of the Gene Ontology that matches the version of the GO annotations.
We can find the version used in the GO annotations in the first lines of the GO annotation file:

In [8]:
!gunzip -c "{config['go_annotation_file']}" | head -n 10

!gaf-version: 2.1
!
!The set of protein accessions included in this file is based on UniProt complete proteomes, which may provide more than one protein per gene.
!They include all Swiss-Prot entries for the species plus any TrEMBL entries that have an Ensembl DR line. The TrEMBL entries are likely to overlap with the Swiss-Prot entries or their isoforms.
!If a particular protein accession is not annotated with GO, then it will not appear in this file.
!
!Note that the annotation set in this file is filtered in order to reduce redundancy; the full, unfiltered set can be found in
!ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz
!
!Generated: 2015-12-07 09:35

gzip: stdout: Broken pipe


The version (date) of the Gene Ontology used is 2015-12-07 (see the line starting with "!Generated:").

In [9]:
!curl -o "{config['gene_ontology_file']}" "{config['gene_ontology_url']}"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 30.0M    0 30.0M    0     0  6684k      0 --:--:--  0:00:04 --:--:-- 7051k


## Copyright and License

Copyright (c) 2016 Florian Wagner.

This work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).