---
layout: post  
title:  Downloading RefSeq Genomes  
date:   2019-07-16  

---

To download all of the reference genomes in the refseq database, we first need to start with the metadata file available [here](ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt)

In [1]:
if !isfile( "../_data/assembly_summary_refseq.txt")
    download("ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_refseq.txt", "../_data/assembly_summary_refseq.txt")
end

In [2]:
using uCSV
using DataFrames

In [3]:
refseq_metadata = DataFrame(uCSV.read("../_data/assembly_summary_refseq.txt", delim="\t", header=2))

Unnamed: 0_level_0,# assembly_accession,bioproject,biosample,wgs_master,refseq_category,taxid,species_taxid,organism_name,infraspecific_name,isolate,version_status,assembly_level,release_type,genome_rep,seq_rel_date,asm_name,submitter,gbrs_paired_asm,paired_asm_comp,ftp_path,excluded_from_refseq,relation_to_type_material
Unnamed: 0_level_1,String,String,String,String,String,Int64,Int64,String,String,String,String,String,String,String,String,String,String,String,String,String,String,String
1,GCF_000001215.4,PRJNA164,SAMN02803731,,reference genome,7227,7227,Drosophila melanogaster,,,latest,Chromosome,Major,Full,2014/08/01,Release 6 plus ISO1 MT,The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics,GCA_000001215.4,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT,,
2,GCF_000001405.39,PRJNA168,,,reference genome,9606,9606,Homo sapiens,,,latest,Chromosome,Patch,Full,2019/02/28,GRCh38.p13,Genome Reference Consortium,GCA_000001405.28,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13,,
3,GCF_000001635.26,PRJNA169,,,reference genome,10090,10090,Mus musculus,,,latest,Chromosome,Patch,Full,2017/09/15,GRCm38.p6,Genome Reference Consortium,GCA_000001635.8,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6,,
4,GCF_000001735.4,PRJNA116,SAMN03081427,,reference genome,3702,3702,Arabidopsis thaliana,ecotype=Columbia,,latest,Chromosome,Minor,Full,2018/03/15,TAIR10.1,The Arabidopsis Information Resource (TAIR),GCA_000001735.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1,,
5,GCF_000001765.3,PRJNA18793,SAMN00779672,AADE00000000.1,representative genome,46245,7237,Drosophila pseudoobscura pseudoobscura,strain=MV2-25,,latest,Chromosome,Major,Full,2013/04/11,Dpse_3.0,Baylor College of Medicine,GCA_000001765.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/765/GCF_000001765.3_Dpse_3.0,,
6,GCF_000001895.5,PRJNA12455,SAMN02808228,AABR00000000.7,representative genome,10116,10116,Rattus norvegicus,strain=mixed,,latest,Chromosome,Major,Full,2014/07/01,Rnor_6.0,Rat Genome Sequencing Consortium,GCA_000001895.4,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/895/GCF_000001895.5_Rnor_6.0,,
7,GCF_000001905.1,PRJNA70973,SAMN02953622,AAGU00000000.3,representative genome,9785,9785,Loxodonta africana,,ISIS603380,latest,Scaffold,Major,Full,2009/07/15,Loxafr3.0,Broad Institute,GCA_000001905.1,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/905/GCF_000001905.1_Loxafr3.0,,
8,GCF_000001985.1,PRJNA32665,SAMN02953685,ABAR00000000.1,representative genome,441960,37727,Talaromyces marneffei ATCC 18224,strain=ATCC 18224,,latest,Scaffold,Major,Full,2008/10/29,JCVI-PMFA1-2.0,J. Craig Venter Institute,GCA_000001985.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/985/GCF_000001985.1_JCVI-PMFA1-2.0,,assembly from type material
9,GCF_000002035.6,PRJNA13922,SAMN06930106,,reference genome,7955,7955,Danio rerio,,,latest,Chromosome,Major,Full,2017/05/09,GRCz11,Genome Reference Consortium,GCA_000002035.4,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/035/GCF_000002035.6_GRCz11,,
10,GCF_000002075.1,PRJNA209509,SAMN02953658,AASC00000000.3,representative genome,6500,6500,Aplysia californica,,F4 #8,latest,Scaffold,Major,Full,2013/05/15,AplCal3.0,Broad Institute,GCA_000002075.2,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/075/GCF_000002075.1_AplCal3.0,,


There are many genomes available in refseq, but to start we will only get the "highest quality" genomes, which I am defining as the set of genomes that are the latest version with full representation and are considered a reference

In [4]:
refseq_metadata_subset = 
    refseq_metadata[
        (refseq_metadata[:version_status] .== "latest") .&
        (refseq_metadata[:genome_rep] .== "Full") .&
        (refseq_metadata[:refseq_category] .== "reference genome")
        , :]

Unnamed: 0_level_0,# assembly_accession,bioproject,biosample,wgs_master,refseq_category,taxid,species_taxid,organism_name,infraspecific_name,isolate,version_status,assembly_level,release_type,genome_rep,seq_rel_date,asm_name,submitter,gbrs_paired_asm,paired_asm_comp,ftp_path,excluded_from_refseq,relation_to_type_material
Unnamed: 0_level_1,String,String,String,String,String,Int64,Int64,String,String,String,String,String,String,String,String,String,String,String,String,String,String,String
1,GCF_000001215.4,PRJNA164,SAMN02803731,,reference genome,7227,7227,Drosophila melanogaster,,,latest,Chromosome,Major,Full,2014/08/01,Release 6 plus ISO1 MT,The FlyBase Consortium/Berkeley Drosophila Genome Project/Celera Genomics,GCA_000001215.4,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/215/GCF_000001215.4_Release_6_plus_ISO1_MT,,
2,GCF_000001405.39,PRJNA168,,,reference genome,9606,9606,Homo sapiens,,,latest,Chromosome,Patch,Full,2019/02/28,GRCh38.p13,Genome Reference Consortium,GCA_000001405.28,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13,,
3,GCF_000001635.26,PRJNA169,,,reference genome,10090,10090,Mus musculus,,,latest,Chromosome,Patch,Full,2017/09/15,GRCm38.p6,Genome Reference Consortium,GCA_000001635.8,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.26_GRCm38.p6,,
4,GCF_000001735.4,PRJNA116,SAMN03081427,,reference genome,3702,3702,Arabidopsis thaliana,ecotype=Columbia,,latest,Chromosome,Minor,Full,2018/03/15,TAIR10.1,The Arabidopsis Information Resource (TAIR),GCA_000001735.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/735/GCF_000001735.4_TAIR10.1,,
5,GCF_000002035.6,PRJNA13922,SAMN06930106,,reference genome,7955,7955,Danio rerio,,,latest,Chromosome,Major,Full,2017/05/09,GRCz11,Genome Reference Consortium,GCA_000002035.4,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/035/GCF_000002035.6_GRCz11,,
6,GCF_000002985.6,PRJNA158,SAMEA3138177,,reference genome,6239,6239,Caenorhabditis elegans,strain=Bristol N2,,latest,Complete Genome,Major,Full,2013/02/07,WBcel235,C. elegans Sequencing Consortium,GCA_000002985.3,different,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/985/GCF_000002985.6_WBcel235,,
7,GCF_000005845.2,PRJNA57779,SAMN02604091,,reference genome,511145,562,Escherichia coli str. K-12 substr. MG1655,strain=K-12 substr. MG1655,,latest,Complete Genome,Major,Full,2013/09/26,ASM584v2,Univ. Wisconsin,GCA_000005845.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2,,
8,GCF_000006745.1,PRJNA57623,SAMN02603969,,reference genome,243277,666,Vibrio cholerae O1 biovar El Tor str. N16961,strain=N16961,,latest,Complete Genome,Major,Full,2001/01/09,ASM674v1,TIGR,GCA_000006745.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/745/GCF_000006745.1_ASM674v1,,
9,GCF_000006765.1,PRJNA57945,SAMN02603714,,reference genome,208964,287,Pseudomonas aeruginosa PAO1,strain=PAO1,,latest,Complete Genome,Major,Full,2006/07/07,ASM676v1,PathoGenesis Corporation,GCA_000006765.1,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/765/GCF_000006765.1_ASM676v1,,
10,GCF_000006785.2,PRJNA57845,SAMN02604089,,reference genome,160490,1314,Streptococcus pyogenes M1 GAS,strain=SF370,,latest,Complete Genome,Major,Full,2014/04/01,ASM678v2,Univ. Oklahoma,GCA_000006785.2,identical,ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/785/GCF_000006785.2_ASM678v2,,


download all of these genomes

In [5]:
if !isdir("../_data/refseq_reference_genomes")
    mkdir("../_data/refseq_reference_genomes")
end
for ftp in refseq_metadata_subset[:ftp_path]
    genome_file = basename(ftp) * "_genomic.fna.gz"
    genome_file_ftp_link = ftp * "/" * genome_file
    local_file = "../_data/refseq_reference_genomes/$genome_file"
    if !isfile(local_file)
        download(genome_file_ftp_link, local_file)
    else
        println("$genome_file has already been downloaded, skipping...")
    end
end

GCF_000001215.4_Release_6_plus_ISO1_MT_genomic.fna.gz has already been downloaded, skipping...
GCF_000001405.39_GRCh38.p13_genomic.fna.gz has already been downloaded, skipping...
GCF_000001635.26_GRCm38.p6_genomic.fna.gz has already been downloaded, skipping...
GCF_000001735.4_TAIR10.1_genomic.fna.gz has already been downloaded, skipping...
GCF_000002035.6_GRCz11_genomic.fna.gz has already been downloaded, skipping...
GCF_000002985.6_WBcel235_genomic.fna.gz has already been downloaded, skipping...
GCF_000005845.2_ASM584v2_genomic.fna.gz has already been downloaded, skipping...
GCF_000006745.1_ASM674v1_genomic.fna.gz has already been downloaded, skipping...
GCF_000006765.1_ASM676v1_genomic.fna.gz has already been downloaded, skipping...
GCF_000006785.2_ASM678v2_genomic.fna.gz has already been downloaded, skipping...
GCF_000006845.1_ASM684v1_genomic.fna.gz has already been downloaded, skipping...
GCF_000006865.1_ASM686v1_genomic.fna.gz has already been downloaded, skipping...
GCF_0000069