HOWTO:_Getting_Genomic_Sequences

Naohisa Goto edited this page Jun 14, 2016 · 1 revision

Get all NCBI ESTs for a given taxon

Developped a quick script following this biostar question:

#!~/bin/ruby
require 'bio'
Bio::NCBI.default_email = "your.email@bioruby.org" # required by NCBI

ncbi = Bio::NCBI::REST.new

hymenopteranSequenceIDs = ncbi.esearch("Hymenoptera[organism]",
                                       { "db"=>"protein", "rettype"=>"gb", "retmax"=> 10000000})


outputFile = 'genbankHymenopteranProts.fasta'
raise IOError, 'File exists!!' + outputFile if File.exists?(outputFile)

# get sequences, 100 at a time to show progression for many sequences. Writing to file in realtime in case the connection dies.
while (hymenopteranSequenceIDs.length > 1)
    puts hymenopteranSequenceIDs.length.to_s + ' sequences left to download'
    sequenceIdSlice = hymenopteranSequenceIDs.slice!(1..100)
    sequences       = ncbi.efetch(ids  = sequenceIdSlice,
                               hash ={"db"=>"protein", "rettype"=>"fasta", "retmax"=> 10000000})
    sequences.gsub!("\n\n", "\n")  # ncbi returns a single big string with records separated by two newlines
    File.open(outputFile, 'a') {|f| f.write(sequences + "\n") }
end