# Retrieving data

## Introduction

- Repertoire sequence datasets are large in size.
- Generated by a variety of platforms, e.g.
  - SFF (Roche 454);
  - Illumina (Native/SRF);
  - PacBio.

## Sequence Read Archive (SRA)

- The Sequence Read Archive (SRA) stores raw sequence data from next-generation sequencing technologies.
- The SRA is part of the international partnership of archives (INSDC) at the NCBI, the European Bioinformatics Institute and the DNA Database of Japan.
  - Data submitted to any of the three organizations are shared among them.
- These database differ:
  - In the way that metadata are stored.
  - In how data are retrieved.


## BioProject

- Holds metadata on why the analysis was performed
  - Project title and abstract
  - Aims and objectives
  - Organism(s) sequenced
  - Optional: Funding sources, publications, etc.
- This links to an *SRA Study* that provides technical details.

## BioSample

- Holds metadata on what was sequenced.
  - Descriptive sample information
  - Often in tabular format
  - Examples: Organism(s), age(s), gender(s), location data, cell line(s), etc.
- Links to *SRA Sample* that provides technical details.

## SRA Experiment

- A description of a sample-specific sequencing library
- How the sequencing was performed
  - Sequencing methods
  - Kits used
  - Instrument model(s)
- Multiple Experiments can 'point' to a single Sample, but not vice-versa
- An Experiment is required for
  - Different sequencing or preparation methods
  - Biological or technical replicates

## SRA Run

- Runs link data files to an Experiment
- This represents the raw data that we will need to access
- Stored in a proprietary format that needs to be converted e.g. to fastq before analysis

## Accessing SRA using the web

- Located at [https://www.ncbi.nlm.nih.gov/sra/](https://www.ncbi.nlm.nih.gov/sra/)
- Access any data type stored in the SRA independently of any other data type.
- Access related data from other NCBI resources that are integrated with SRA.
- Retrieve data based on ancillary information and/or sequence comparisons.
- Review the descriptions of studies and experiments (metadata) independently of experimental data.

## Searching NCBI

- Can start at multiple points
  - Publication in `pubmed`
  - Project in `bioproject`

## Entrez Direct

- While using a web browser is convenient for searching for data, it lacks reproducibility, and is hard to share with others
- [Entrez Direct](https://www.ncbi.nlm.nih.gov/books/NBK179288/) (EDirect) is a method for accessing the NCBI's suite of interconnected databases from the command line, made up of a series of Perl scripts.
 - `esearch` performs a new Entrez search using terms in indexed fields.
 - `elink` looks up neighbors (within a database) or links (between databases).
 - `efilter` filters or restricts the results of a previous query.
 - `efetch` downloads records or reports in a designated format.

## Other EDirect utilities

- `einfo` obtains information on indexed fields in an Entrez database.
- `epost` uploads unique identifiers (UIDs) or sequence accession numbers.
- `nquire` sends a URL request to a web page or CGI service.
- `xtract` converts XML into a table of data values

In [4]:
%%bash
esearch -db bioproject -query "immunoglobulin"

<ENTREZ_DIRECT>
  <Db>bioproject</Db>
  <WebEnv>NCID_1_77295837_130.14.18.34_9001_1476509839_1876082318_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>239</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>


## SRA Toolkit

- The [SRA Toolkit](https://www.ncbi.nlm.nih.gov/books/NBK158900/) allows programmatic access to data in the SRA and conversion of the SRA format into other formats
- Comprises of several 'data dump' programs
  - `fastq-dump`: converts to fastq and fasta formats
- When combined with EDirect, this allows one to download data in fastq format, via a list of SRR accessions

## Example

- Finding the data from [Jiang et al. 2013](https://www.ncbi.nlm.nih.gov/pubmed/23390249)

## Searching Pubmed

In [2]:
%%bash
esearch -db pubmed -query "Jiang Quake lineage influenza"

<ENTREZ_DIRECT>
  <Db>pubmed</Db>
  <WebEnv>NCID_1_904445_130.14.22.215_9001_1477917337_1277602431_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>


## Getting the abstract

In [3]:
%%bash
esearch -db pubmed -query "Jiang Quake lineage influenza" | efetch -format abstract


1. Sci Transl Med. 2013 Feb 6;5(171):171ra19. doi: 10.1126/scitranslmed.3004794.

Lineage structure of the human antibody repertoire in response to influenza
vaccination.

Jiang N(1), He J, Weinstein JA, Penland L, Sasaki S, He XS, Dekker CL, Zheng NY, 
Huang M, Sullivan M, Wilson PC, Greenberg HB, Davis MM, Fisher DS, Quake SR.

Author information: 
(1)Department of Bioengineering, Stanford University School of Medicine,
Stanford, CA 94305, USA.

Erratum in
    Sci Transl Med. 2013 Jul 10;5(193):193er8.

The human antibody repertoire is one of the most important defenses against
infectious disease, and the development of vaccines has enabled the conferral of 
targeted protection to specific pathogens. However, there are many challenges to 
measuring and analyzing the immunoglobulin sequence repertoire, including that
each B cell's genome encodes a distinct antibody sequence, that the antibody
repertoire changes over time, and the high similarity between antibody sequences.
We have ad

## Searching SRA

In [8]:
%%bash
esearch -db sra -query SRA058972

<ENTREZ_DIRECT>
  <Db>sra</Db>
  <WebEnv>NCID_1_1433884_130.14.22.215_9001_1477919254_952558587_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>


In [9]:
%%bash
esearch -db sra -query SRA058972 | efetch -format runinfo

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
SRR735691,2013-02-16,2013-02-16,14890,3305580,0,222,1,,https://sra-download.ncbi.nlm.nih.gov/srapub/SRR735691,SRX190717,,RNA-Seq,PCR,TRANSCRIPTOMIC,SINGLE,0,0,LS454,454 GS FLX Titanium,SRP015957,PRJNA176314,,176314,SRS366185,SAMN01737268,simple,9606,Homo sapiens,6,,,,,,,no,,,,,STANFORD UNIVERSITY,SRA058972,,public,EB5AD9CFD23FD4DDC17733913E95D003,42BEB11212FB634A454A1050BA36EDB9
SRR747232,2013-02-19,2013-02-19,790903,270580865,0,342,145,,https://sra-download.ncbi.nlm.nih.gov/

## Munging the runinfo table

- Once we have the runinfo table, the first column (run) can be extracted
  - Python code for this is given in the exercise
  - Some bash code is given below

In [11]:
%%bash
esearch -db sra -query SRA058972 | efetch -format runinfo | cut -d ',' -f 1 | grep SRR

SRR735691
SRR747232
SRR747256
SRR747758
SRR747759
SRR747760
SRR747761
SRR747763
SRR747764
SRR747766
SRR747767
SRR747768
SRR747785
SRR765688
SRR770500


## Downloading fastq files

- We can pipe a list of accessions to `fastq-dump` in order to download the data

```
cat accessions.txt | xargs fastq-dump --outdir SRA058972
```

- If we had paired end reads (which we don't here), we would need two more options to `fastq-dump`:


```
cat accessions.txt | xargs fastq-dump --split-files --readids --outdir SRA058972
```