ncbi_search

FILE: ncbi_search.pl
AUTH: Paul Stothard stothard@ualberta.ca
DATE: April 18, 2020
VERS: 1.2

This script uses NCBI's Entrez Programming Utilities to perform searches of NCBI databases. This script can return either the complete database records, or the IDs of the records.

For additional information on NCBI's Entrez Programming Utilities see: https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch

USAGE:
   perl ncbi_search.pl [-arguments]

  -q [STRING]     : Entrez query text (Required).
  -o [FILE]       : output file to create (Required). If the -s option is used,
                    this is the output directory to create.
  -d [STRING]     : name of the NCBI database to search, such as 'nuccore',
                    'protein', or 'gene' (Required).
  -r [STRING]     : the type of information to download. For sequences, 'fasta'
                    is typically specified. The accepted formats depend on the
                    database being queried (Optional).
  -m [INTEGER]    : the maximum number of records to return (Optional; default
                    is to return all matches satisfying the query).
  -s              : request each record separately and save as a separate file.
                    This option is only supported for -r values of 'gb'
                    and 'gbwithparts' (Optional).
  -v              : provide progress messages (Optional).
  -h              : show this message (Optional).

Example usage:

Download a sequence in GenBank format (with the full sequence included), using an accession number:

perl ncbi_search.pl -q 'NC_045512[Accession]' \
-o NC_045512.gbk \
-d nuccore \
-r gbwithparts \
-v

Download the protein sequences encoded by a genome, using the genome's accession number:

perl ncbi_search.pl -q 'NC_012920.1[Accession]' \
-o AL513382.1.faa \
-d nuccore \
-r fasta_cds_aa \
-v

Download multiple genomes using an accession number range, and save each genome to a file named after its accession number:

perl ncbi_search.pl -q 'NC_009925:NC_009934[Accession]' \
-o outdir1 \
-d nuccore \
-r gbwithparts \
-s \
-v

Download five coronavirus genomes from the RefSeq collection, and save each genome to a separate file:

perl ncbi_search.pl -q 'coronavirus[Organism] AND nucleotide genome[Filter] AND refseq[Filter]' \
-o outdir2 \
-d nuccore \
-r gbwithparts \
-m 5 \
-s \
-v

Download five abstracts from PubMed using an author name:

perl ncbi_search.pl -q 'Stothard P[Author]' \
-o abstracts.txt \
-d pubmed \
-r abstract \
-m 5 \
-v

Download information on the genes located in a genome region of interest:

perl ncbi_search.pl -q 'homo sapiens[Organism] AND 17[Chromosome] AND 7614064:7833711[Base position] AND GRCh38.p13[Assembly name]' \
-o gene_list.txt \
-d gene \
-r gene_table \
-v

Download information about a gene of interest:

perl ncbi_search.pl -q 'homo sapiens[Organism] AND PRNP[Gene name]' \
-o gene_info.txt \
-d gene \
-v

Download information about health-affecting variants for a genome region of interest:

perl ncbi_search.pl -q '17[Chromosome] AND 7614064:7620000[Base Position]' \
-o clinvar_info.xml \
-d clinvar \
-r clinvarset \
-v

Download a sequence record for each accession number in a file of accession numbers:

#preparing sample file of accession numbers
echo $'NP_776246.1\nNP_001073369.1\nXP_006724594.1\nNP_995328.2\nNP_115906.3\n' \
> accessions.txt

#performing search for each accession using xargs
cat accessions.txt | xargs -t -I{} \
perl ncbi_search.pl -q '{}[Accession]' \
-o {}.fasta \
-d protein \
-r fasta \
-v

Download sequences in fasta format and then save each sequence as a separate file:

#download fasta file containing multiple sequences
perl ncbi_search.pl -q 'coronavirus[Organism] AND nucleotide genome[Filter] AND refseq[Filter]' \
-o sequences.fasta \
-d nuccore \
-r fasta \
-m 5 \
-v

#create separate file for each sequence
outputdir=output_directory/
mkdir -p "$outputdir"
awk '/^>/ {OUT=substr($0,2); split(OUT, a, " "); sub(/[^A-Za-z_0-9\.\-]/, "", a[1]); OUT = "'"$outputdir"'" a[1] ".fa"}; OUT {print >>OUT; close(OUT)}' \
sequences.fasta

Supported -d option values:

annotinfo
assembly
bioproject
biosample
biosystems
blastdbinfo
books
cdd
clinvar
dbvar
gap
gapplus
gds
gene
genome
geoprofiles
grasp
homologene
ipg
medgen
mesh
ncbisearch
nlmcatalog
nuccore
nucleotide
omim
orgtrack
pcassay
pccompound
pcsubstance
pmc
popset
probe
protein
proteinclusters
pubmed
seqannot
snp
sparcle
sra
structure
taxonomy

Supported -r option values:

The supported -r option values are grouped by database type (i.e. -d option value) below. The name of each format is followed by the corresponding -r option value in parentheses. A value of null indicates that the -r option should be omitted in order to obtain that output format.

All Databases

Document summary (docsum)
List of UIDs in plain text (uilist)

d = bioproject

Full record XML (xml)

d = biosample

Full record text (full)

d = biosystems

Full record XML (xml)

d = gds

Summary (summary)

d = gene

text ASN.1 (null)
Gene table (gene_table)

d = homologene

text ASN.1 (null)
Alignment scores (alignmentscores)
FASTA (fasta)
HomoloGene (homologene)

d = mesh

Full record (full)

d = nlmcatalog

Full record (null)

d = nuccore, nucest, nucgss, protein or popset

text ASN.1 (null)
Full record in XML (native)
Accession number(s) (acc)
FASTA (fasta)
SeqID string (seqid)

Additional options for d = nuccore, nucest, nucgss or popset

GenBank flat file (gb)
INSDSeq XML (gbc)

Additional option for d = nuccore and protein

Feature table (ft)

Additional options for d = nuccore

GenBank flat file with full sequence (gbwithparts)
CDS nucleotide FASTA (fasta_cds_na)
CDS protein FASTA (fasta_cds_aa)

Additional option for d = nucest

EST report (est)

Additional option for d = nucgss

GSS report (gss)

Additional options for d = protein

GenPept flat file (gp)
INSDSeq XML (gpc)
Identical Protein XML (ipg)

d = pmc

XML (null)
MEDLINE (medline)

d = pubmed

text ASN.1 (null)
MEDLINE (medline)
PMID list (uilist)
Abstract (abstract)

d = sequences

text ASN.1 (null)
Accession number(s) (acc)
FASTA (fasta)
SeqID string (seqid)

d = snp

text ASN.1 (null)
Flat file (flt)
FASTA (fasta)
RS Cluster report (rsr)
SS Exemplar list (ssexemplar)
Chromosome report (chr)
Summary (docset)
UID list (uilist)

d = sra

XML (full)

d = taxonomy

XML (null)
TaxID list (uilist)

d = clinvar

ClinVar Set (clinvarset)
UID list (uilist)

d = gtr

GTR Test Report (gtracc)

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ncbi_search.pl		ncbi_search.pl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ncbi_search

Example usage:

Supported -d option values:

Supported -r option values:

All Databases

d = bioproject

d = biosample

d = biosystems

d = gds

d = gene

d = homologene

d = mesh

d = nlmcatalog

d = nuccore, nucest, nucgss, protein or popset

Additional options for d = nuccore, nucest, nucgss or popset

Additional option for d = nuccore and protein

Additional options for d = nuccore

Additional option for d = nucest

Additional option for d = nucgss

Additional options for d = protein

d = pmc

d = pubmed

d = sequences

d = snp

d = sra

d = taxonomy

d = clinvar

d = gtr

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ncbi_search

Example usage:

Supported -d option values:

Supported -r option values:

All Databases

d = bioproject

d = biosample

d = biosystems

d = gds

d = gene

d = homologene

d = mesh

d = nlmcatalog

d = nuccore, nucest, nucgss, protein or popset

Additional options for d = nuccore, nucest, nucgss or popset

Additional option for d = nuccore and protein

Additional options for d = nuccore

Additional option for d = nucest

Additional option for d = nucgss

Additional options for d = protein

d = pmc

d = pubmed

d = sequences

d = snp

d = sra

d = taxonomy

d = clinvar

d = gtr

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages