# Accessing NCBI Database with Biopython

Biopython allows the user to access NCBI data using Entrez, their data retrieval system. A web interface is also available at [ https://www.ncbi.nlm.nih.gov/search/](https://www.ncbi.nlm.nih.gov/search/)

__Tips__
- specify an email address with your query 
- avoid large number of requests (100+) during 9-5 M-F peak hours
- do not post more than 3 queries per second (biopython will block excess)

It's not only good citizenship, but you risk getting blocked if you over use NCBI's servers (a good reason to give a real email address, because NCBI may try to contact you)

In [1]:
from Bio import Entrez, SeqIO
import certifi

Entrez.email = "fortinopineda303@gmail.com"

__Einfo:__ obtain a list of all database names available through Entrez

In [4]:
# This returns a list of all available databases
handle = Entrez.einfo()
rec = Entrez.read(handle)
handle.close()
print(rec.keys())

dict_keys(['DbList'])


In [5]:
rec['DbList']

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

# Accessing nucleotide database

How to access such databases at the National Center for Biotechnology Information (NCBI). Not only discuss GenBank, but also other databases at NCBI. Many people refer (wrongly) to the whole set of NCBI databases as GenBank, but NCBI includes the nucleotide database and many others, for example, PubMed. Nucleotide database includes entry from GenBank, RefSeq, TPA, and PDB

`eSearch`: to search the Entrez Databases

Note: the return values are limited to 20, to override set the `retmax` to desired amount of records (up to 10,000)

In [7]:
handle = Entrez.esearch(db="nucleotide",term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]', retmax=40)
rec_list = Entrez.read(handle)
handle.close
rec_list["Count"]

'3081'

In [8]:
len(rec_list["IdList"]) #returns only list of Ids, use this ID to pull seqs

40

In [9]:
rec_list["IdList"]

['2587918588', '2507817686', '2507817684', '2507817682', '2507817680', '2507817678', '2507817676', '2507817674', '2507817672', '2507817670', '2507817668', '2507817666', '2507817664', '2507817662', '2507817660', '2507817658', '2507817656', '2507817654', '2507817652', '2507817650', '2507817648', '2507817646', '2507817644', '2507817642', '2507817640', '2507817638', '2507817636', '2507817634', '2507817632', '2507817630', '2507817628', '2507817626', '2507817624', '2507817622', '2507817620', '2507817618', '2507817616', '2507817614', '2507817612', '2507817610']

Now that we have IDs, we can use `eFetch` to access and download the full records

To request a specific file format, add the following parameters to Bio.Entrez.efetch()
- `rettype`: return type, can set for GenBank
- `retmode`: 
- `gb` == Genbank, useful return to parse with SeqIO

In [10]:
id_list = rec_list["IdList"]
handle = Entrez.efetch(db="nucleotide", id=id_list, rettype="gb")
recs = list(SeqIO.parse(handle, "gb")) # saving to list saves multiple API calls
handle.close()

In [11]:
recs # be care with saving to list bc memory could be expended for large sequences

[SeqRecord(seq=Seq('GGTGGAGGTTCTTGTCTTGGTAAATGTGCTCATGTGTTTAAACTTATTTTTAAA...AAA'), id='OR483864.1', name='OR483864', description='Plasmodium falciparum isolate PE-26 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('TGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATT...TTG'), id='OQ672451.1', name='OQ672451', description='Plasmodium falciparum isolate ML_14 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('TGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATT...TTG'), id='OQ672450.1', name='OQ672450', description='Plasmodium falciparum isolate ML_13 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('TGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATT...TTG'), id='OQ672449.1', name='OQ672449', description='Plasmodium falciparum isolate ML_12 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('TGTGCTCATGTGTTTA

In [12]:
for rec in recs:
    if rec.name == "KM288867":
        break
print(rec.name)
print(rec.description)

OQ672413
Plasmodium falciparum isolate MAO_27 chloroquine resistance transporter (crt) gene, partial cds


In [13]:
print(rec.seq)

TGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATTTATATTTTAAGTATTATTTATTTAAGTGTATCTGTAATGAATACAATTTTTGCTAAAAGAACTTTAAACAAAATTGGTAACTATAGTTTTG
