FILE: ncbi_search.pl
AUTH: Paul Stothard stothard@ualberta.ca
DATE: April 18, 2020
VERS: 1.2
This script uses NCBI's Entrez Programming Utilities to perform searches of NCBI databases. This script can return either the complete database records, or the IDs of the records.
For additional information on NCBI's Entrez Programming Utilities see: https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
USAGE:
perl ncbi_search.pl [-arguments]
-q [STRING] : Entrez query text (Required).
-o [FILE] : output file to create (Required). If the -s option is used,
this is the output directory to create.
-d [STRING] : name of the NCBI database to search, such as 'nuccore',
'protein', or 'gene' (Required).
-r [STRING] : the type of information to download. For sequences, 'fasta'
is typically specified. The accepted formats depend on the
database being queried (Optional).
-m [INTEGER] : the maximum number of records to return (Optional; default
is to return all matches satisfying the query).
-s : request each record separately and save as a separate file.
This option is only supported for -r values of 'gb'
and 'gbwithparts' (Optional).
-v : provide progress messages (Optional).
-h : show this message (Optional).
Download a sequence in GenBank format (with the full sequence included), using an accession number:
perl ncbi_search.pl -q 'NC_045512[Accession]' \
-o NC_045512.gbk \
-d nuccore \
-r gbwithparts \
-vDownload the protein sequences encoded by a genome, using the genome's accession number:
perl ncbi_search.pl -q 'NC_012920.1[Accession]' \
-o AL513382.1.faa \
-d nuccore \
-r fasta_cds_aa \
-vDownload multiple genomes using an accession number range, and save each genome to a file named after its accession number:
perl ncbi_search.pl -q 'NC_009925:NC_009934[Accession]' \
-o outdir1 \
-d nuccore \
-r gbwithparts \
-s \
-vDownload five coronavirus genomes from the RefSeq collection, and save each genome to a separate file:
perl ncbi_search.pl -q 'coronavirus[Organism] AND nucleotide genome[Filter] AND refseq[Filter]' \
-o outdir2 \
-d nuccore \
-r gbwithparts \
-m 5 \
-s \
-vDownload five abstracts from PubMed using an author name:
perl ncbi_search.pl -q 'Stothard P[Author]' \
-o abstracts.txt \
-d pubmed \
-r abstract \
-m 5 \
-vDownload information on the genes located in a genome region of interest:
perl ncbi_search.pl -q 'homo sapiens[Organism] AND 17[Chromosome] AND 7614064:7833711[Base position] AND GRCh38.p13[Assembly name]' \
-o gene_list.txt \
-d gene \
-r gene_table \
-vDownload information about a gene of interest:
perl ncbi_search.pl -q 'homo sapiens[Organism] AND PRNP[Gene name]' \
-o gene_info.txt \
-d gene \
-vDownload information about health-affecting variants for a genome region of interest:
perl ncbi_search.pl -q '17[Chromosome] AND 7614064:7620000[Base Position]' \
-o clinvar_info.xml \
-d clinvar \
-r clinvarset \
-vDownload a sequence record for each accession number in a file of accession numbers:
#preparing sample file of accession numbers
echo $'NP_776246.1\nNP_001073369.1\nXP_006724594.1\nNP_995328.2\nNP_115906.3\n' \
> accessions.txt
#performing search for each accession using xargs
cat accessions.txt | xargs -t -I{} \
perl ncbi_search.pl -q '{}[Accession]' \
-o {}.fasta \
-d protein \
-r fasta \
-vDownload sequences in fasta format and then save each sequence as a separate file:
#download fasta file containing multiple sequences
perl ncbi_search.pl -q 'coronavirus[Organism] AND nucleotide genome[Filter] AND refseq[Filter]' \
-o sequences.fasta \
-d nuccore \
-r fasta \
-m 5 \
-v
#create separate file for each sequence
outputdir=output_directory/
mkdir -p "$outputdir"
awk '/^>/ {OUT=substr($0,2); split(OUT, a, " "); sub(/[^A-Za-z_0-9\.\-]/, "", a[1]); OUT = "'"$outputdir"'" a[1] ".fa"}; OUT {print >>OUT; close(OUT)}' \
sequences.fasta- annotinfo
- assembly
- bioproject
- biosample
- biosystems
- blastdbinfo
- books
- cdd
- clinvar
- dbvar
- gap
- gapplus
- gds
- gene
- genome
- geoprofiles
- grasp
- homologene
- ipg
- medgen
- mesh
- ncbisearch
- nlmcatalog
- nuccore
- nucleotide
- omim
- orgtrack
- pcassay
- pccompound
- pcsubstance
- pmc
- popset
- probe
- protein
- proteinclusters
- pubmed
- seqannot
- snp
- sparcle
- sra
- structure
- taxonomy
The supported -r option values are grouped by database type (i.e. -d option value) below. The name of each format is followed by the corresponding -r option value in parentheses. A value of null indicates that the -r option should be omitted in order to obtain that output format.
- Document summary (docsum)
- List of UIDs in plain text (uilist)
- Full record XML (xml)
- Full record text (full)
- Full record XML (xml)
- Summary (summary)
- text ASN.1 (null)
- Gene table (gene_table)
- text ASN.1 (null)
- Alignment scores (alignmentscores)
- FASTA (fasta)
- HomoloGene (homologene)
- Full record (full)
- Full record (null)
- text ASN.1 (null)
- Full record in XML (native)
- Accession number(s) (acc)
- FASTA (fasta)
- SeqID string (seqid)
- GenBank flat file (gb)
- INSDSeq XML (gbc)
- Feature table (ft)
- GenBank flat file with full sequence (gbwithparts)
- CDS nucleotide FASTA (fasta_cds_na)
- CDS protein FASTA (fasta_cds_aa)
- EST report (est)
- GSS report (gss)
- GenPept flat file (gp)
- INSDSeq XML (gpc)
- Identical Protein XML (ipg)
- XML (null)
- MEDLINE (medline)
- text ASN.1 (null)
- MEDLINE (medline)
- PMID list (uilist)
- Abstract (abstract)
- text ASN.1 (null)
- Accession number(s) (acc)
- FASTA (fasta)
- SeqID string (seqid)
- text ASN.1 (null)
- Flat file (flt)
- FASTA (fasta)
- RS Cluster report (rsr)
- SS Exemplar list (ssexemplar)
- Chromosome report (chr)
- Summary (docset)
- UID list (uilist)
- XML (full)
- XML (null)
- TaxID list (uilist)
- ClinVar Set (clinvarset)
- UID list (uilist)
- GTR Test Report (gtracc)