# Accessing NCBI databases
This section provide on how to access to database at the National Center for Biotechnology Information (NCBI). We will not only discuss GenBank, but also other databases at NCBI. Many people refer (wrongly) to the whole set of NCBI databases as GenBank, but NCBI includes the nucleotide database and many others, for example, PubMed.

# Available databases at NCBI
Biopython provides an interface called Entrez, the data retrieval system made available by NCBI. Entrez can also be used through web browser: https://www.ncbi.nlm.nih.gov/search/
TIPS:

* specify an email address with your query
* Avoid large number of requests (100 or more) during peak times (between 9.00 a.m. and 5.00 p.m. American Eastern Time on weekdays)
* Do not post more than three queries per second (Biopython will take care of this for you)

It's not only good citizenship, but you risk getting blocked if you over use NCBI's servers (a good reason to give a real email address, because NCBI may try to contact you).

In [2]:
!pip install biopython

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting biopython
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 7.5 MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.79


In [3]:
from Bio import Entrez, SeqIO  #import libraries for enter to databases and managing sequecing files

Entrez.email = 'alejandro.delgado@yachaytech.edu.ec'

In [12]:
'''The following will give a list of available databases'''

handle = Entrez.einfo()
rec = Entrez.read(handle) #save a dict with the databases
handle.close()  #important to close
print(rec.keys())  #to see the dict keys

dict_keys(['DbList'])


In [14]:
rec

{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']}

## Example: Chloroquine resistance transporter
These part will search for **chloroquine resistance transporter (CRT)** gene (KM288867) in **Plasmodium falciparum** (the parasite that causes the deadliest form of malaria) on the nucleotide database:

**ESearch:** Function of Entrez to search in a database

**Note:** that the standard search will limit the number of record references to 20, so if we have more, we can override **retmax** function to desired amount of records as follows:

In [18]:
handle = Entrez.esearch(db='nucleotide', term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]',
                        retmax = "40")
rec_list = Entrez.read(handle)
handle.close()
rec_list['Count']

'2925'

In [19]:
len(rec_list['IdList'])

40

In [21]:
rec_list['IdList']

['2301594124', '2301594089', '2262825096', '2262825094', '2262825092', '2262825090', '2262825088', '2262825086', '2262825084', '2262825082', '2262825080', '2262825078', '2262825076', '2262825074', '2262825072', '2262825070', '2262825068', '2262825066', '2262825064', '2262825062', '2262825060', '2262825058', '2262825056', '2262825054', '2262825052', '2262825050', '2262825048', '2262825046', '2262825044', '2262825042', '2262825040', '2262825038', '2262825036', '2262825034', '2262825032', '2262825030', '2262825028', '2262825026', '2262825024', '2262825022']

We now have the IDs of all of the records, but we still need to retrieve the records properly.

**EFetch:** Function to download full records from Entrez

Requesting a specific file format from Entrez using Bio.Entrez.efetch() requires specifying the **rettype** and/or **retmode** optional arguments. The different combinations are described for each database type on the pages linked to on NCBI efetch webpage - https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch

rettype - return type, gb == GenBank retmax - Total number of records from the input set to be retrieved, up to a maximum of 10,000

In [22]:
id_list = rec_list['IdList'] #we save the id for the records
handle = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb') #to obtain gene bank format, and important to parse it with SeqIO module

In [24]:
# genbank format, we need to parse it with SeqIO module
recs = list(SeqIO.parse(handle, 'gb'))
handle.close()

ValueError: ignored

**NOTE:** that we have converted an iterator (the result of SeqIO.parse) to a list. The advantage of doing this is that we can use the result as many times as we want (for example, iterate many times over), without repeating the query on the server.

**BUT:**However, be careful with this technique, because you will retrieve a large amount of complete records, and some of them will have fairly large sequences inside. You risk downloading a lot of data (which would be a strain both on your side and on NCBI servers).

In [25]:
recs

[SeqRecord(seq=Seq('CTTATTTTTAAAGAGATTAAGGATAATATTTTTATTTATATTTTAAGTATTATT...TTC'), id='MZ054304.1', name='MZ054304', description='Plasmodium falciparum 3D7 isolate CRT33 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('CTTATTTTTAAAGAGATTAAGGATAATATTTTTATTTATATTTTAAGTATTATT...TCA'), id='MZ054303.1', name='MZ054303', description='Plasmodium falciparum 3D7 isolate CRT4 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('AGTTGTATACAAAGTCCAGCATTAGCAATTGCTTATTACTTTAAATTCTTAGCC...AAA'), id='OM964469.1', name='OM964469', description='Plasmodium falciparum isolate SPK77 chloroquine resistance transporter (CRT) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('AGTTGTATACAAAGTCCAGCATTAGCAATTGCTTATTACTTTAAATTCTTAGCC...AAA'), id='OM964468.1', name='OM964468', description='Plasmodium falciparum isolate SPK66 chloroquine resistance transporter (CRT) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('AGTTGTATA

In [26]:
for rec in recs:
  if rec.name == 'KM288867': #Try to find CTR gene in 40 records
    break
print(rec.name)
print(rec.description)

OM964432
Plasmodium falciparum isolate SPK67 chloroquine resistance transporter (CRT) gene, partial cds


In [27]:
str(rec.seq)

'AGTTGTATACAAGGTCCAGCATTAGCAATTGCTTATTACTTTAAATTCTTAGCCGTAAGAATTAAA'