# Data practical using the Sequence Read Archive

The goal of this practical is to download the data associated with this paper:


Greiff V, Menzel U, Haessler U, Cook SC, Friedensohn S, Khan TA, Pogson M, Hellmann I, Reddy ST. (2014) Quantitative assessment of the robustness of next-generation sequencing of antibody variable gene repertoires from immunized mice. *BMC Immunol.* 2014 **15**:40.

You will do this in two ways:
  - Using a web browser
  - Using the EDirect utilities

Use this notebook to conduct your searches using EDirect, and feel free to add any other notes that help you.

## Finding accessions using a web browser

- Go tp [PubMed](https://www.ncbi.nlm.nih.gov/pubmed)
- Search for the publication
- In the 'Related Information' field on the right hand side, click SRA.

## Finding accessions using EDirect

You will need to piece together a command separated by pipes (`|`) in order to obtain a list of accessions that can be passed to `fastq-dump`.

First search PubMed for the paper using `esearch` with the database set to `pubmed` and an appropriate query.

In [1]:
%%bash
esearch -db pubmed -query "Greiff Reddy Quantitative assessment"

<ENTREZ_DIRECT>
  <Db>pubmed</Db>
  <WebEnv>NCID_1_12863221_130.14.18.34_9001_1477921695_380838357_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>


Now pipe the output of the above to `elink -target sra` to get the linked records.

In [2]:
%%bash
esearch -db pubmed -query "Greiff Reddy Quantitative assessment" | elink -target sra

<ENTREZ_DIRECT>
  <Db>sra</Db>
  <WebEnv>NCID_1_2124901_130.14.22.215_9001_1477921696_661660241_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>3</QueryKey>
  <Count>6</Count>
  <Step>2</Step>
</ENTREZ_DIRECT>


Now fetch the run info using `efetch -format runinfo`.

In [3]:
%%bash
esearch -db pubmed -query "Greiff Reddy Quantitative assessment" | elink -target sra | efetch -format runinfo

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR346596,2013-11-02,2013-10-27,1453261,729537022,1453261,502,394,,https://sra-download.ncbi.nlm.nih.gov/srapub/ERR346596,ERX319436,Replicate-3-M1,AMPLICON,RT-PCR,TRANSCRIPTOMIC,PAIRED,330,30,ILLUMINA,Illumina MiSeq,ERP003950,PRJEB4643,2,224769,ERS351324,SAMEA2240927,simple,10090,Mus musculus,E-MTAB-1896:one mouse,,,,,,,no,,,,,"Quantitative Genomics Facility, ETH Zurich, Basel",ERA251331,,public,D4302A4E05DBE0D551D1D13983D3F66F,469B971EC19E8CC94C03DA6FA29D13AB
ERR346597,2013-

Now save this as a CSV called 'greiff_runinfo.csv' using `>`.

In [4]:
%%bash
esearch -db pubmed -query "Greiff Reddy Quantitative assessment" | elink -target sra | efetch -format runinfo > greiff_runinfo.csv

This bit of Python code will display the file you just downloaded.

In [5]:
import pandas as pd
runinfo = pd.read_csv("greiff_runinfo.csv")
runinfo

Unnamed: 0,Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,...,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
0,ERR346596,2013-11-02,2013-10-27,1453261,729537022,1453261,502,394,,https://sra-download.ncbi.nlm.nih.gov/srapub/E...,...,,,,,"Quantitative Genomics Facility, ETH Zurich, Basel",ERA251331,,public,D4302A4E05DBE0D551D1D13983D3F66F,469B971EC19E8CC94C03DA6FA29D13AB
1,ERR346597,2013-11-02,2013-10-27,1569925,788102350,1569925,502,440,,https://sra-download.ncbi.nlm.nih.gov/srapub/E...,...,,,,,"Quantitative Genomics Facility, ETH Zurich, Basel",ERA251331,,public,7AB10FBDC2FB16E649AA834E338129C9,51CFC57B33552809B262078415EE91B0
2,ERR346598,2013-11-02,2013-10-27,2779764,1395441528,2779764,502,755,,https://sra-download.ncbi.nlm.nih.gov/srapub/E...,...,,,,,"Quantitative Genomics Facility, ETH Zurich, Basel",ERA251331,,public,84E337878764A96E1DBC18A179DBB490,A7EBCC3D312725600DF39E78E1B4EB9C
3,ERR346599,2013-11-02,2013-10-27,1466959,736413418,1466959,502,393,,https://sra-download.ncbi.nlm.nih.gov/srapub/E...,...,,,,,"Quantitative Genomics Facility, ETH Zurich, Basel",ERA251331,,public,745F3EC51981E20E0BC16B18C8DC0C5F,DC533836C37ED7443A8BF3B39845694E
4,ERR346600,2013-11-02,2013-10-27,1085869,545106238,1085869,502,287,,https://sra-download.ncbi.nlm.nih.gov/srapub/E...,...,,,,,"Quantitative Genomics Facility, ETH Zurich, Basel",ERA251331,,public,4B3E3EB334E81D07D652F6F63AB605EA,EDCA48A77FCBC0958B17AAE1870B65CB
5,ERR346601,2013-11-02,2013-10-27,1467650,736760300,1467650,502,393,,https://sra-download.ncbi.nlm.nih.gov/srapub/E...,...,,,,,"Quantitative Genomics Facility, ETH Zurich, Basel",ERA251331,,public,76D93527BA4BBCDBD761A2D850E51343,2ED84386B4555075FEE0E673265944FC


This will save the run accessions to a file.

In [6]:
runinfo["Run"].to_csv("greiff_accessions.txt",header=False,index=False)

In [7]:
%%bash
cat greiff_accessions.txt

ERR346596
ERR346597
ERR346598
ERR346599
ERR346600
ERR346601


You can now pipe the output of these to `fastq-dump`.