# Accessing NCBI databases with Biopython

In this tutorial, we will learn how to access various databases in the National Center for Biotechnology Information (NCBI) using BioPython.

## Context:
- Entrez allows us to access all tne databases in the NCBI
- Our target is to find the chloroquine resistance transporter (CRT) gene (KM288867) in Plasmodium falciparum on the nucleotide database
- We retrieve the sequence information of the target gene in GeneBank format
- From all the 40 records retrieved, we derive the CRT gene information of name 'KM288867'
- Finally we print the whole sequence of the target gene

In [1]:
from Bio import Entrez, SeqIO

In [2]:
Entrez.email = "debnathk1997@gmail.com" 

In [3]:
# This gives you the list of available databases
handle = Entrez.einfo() # obtaining a list of all database names accessible through Entrez
rec = Entrez.read(handle)
handle.close()
print(rec.keys())

dict_keys(['DbList'])


In [4]:
rec['DbList']

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

We will now try to find the chloroquine resistance transporter (CRT) gene (KM288867) in Plasmodium falciparum (the parasite that causes the deadliest form of malaria) on the nucleotide database:

Note: the standard search will limit the number of record references to 20. We have to override retmax to desired amount of records.

In [5]:
handle = Entrez.esearch(db="nucleotide", term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]', retmax="40") # Searching the Entrez databases
rec_list = Entrez.read(handle)
handle.close()
rec_list['Count'] # no of records found for the desired term in the nucleotide databse in NCBI

'3080'

In [6]:
len(rec_list['IdList']) # no of unique ids

40

In [7]:
rec_list['IdList']

['2507817686', '2507817684', '2507817682', '2507817680', '2507817678', '2507817676', '2507817674', '2507817672', '2507817670', '2507817668', '2507817666', '2507817664', '2507817662', '2507817660', '2507817658', '2507817656', '2507817654', '2507817652', '2507817650', '2507817648', '2507817646', '2507817644', '2507817642', '2507817640', '2507817638', '2507817636', '2507817634', '2507817632', '2507817630', '2507817628', '2507817626', '2507817624', '2507817622', '2507817620', '2507817618', '2507817616', '2507817614', '2507817612', '2507817610', '2507817608']

In [8]:
id_list = rec_list['IdList']
handle = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb') # downloading full records from Entrez

In [9]:
recs = list(SeqIO.parse(handle, 'gb'))
handle.close()

In [10]:
recs

[SeqRecord(seq=Seq('TGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATT...TTG'), id='OQ672451.1', name='OQ672451', description='Plasmodium falciparum isolate ML_14 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('TGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATT...TTG'), id='OQ672450.1', name='OQ672450', description='Plasmodium falciparum isolate ML_13 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('TGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATT...TTG'), id='OQ672449.1', name='OQ672449', description='Plasmodium falciparum isolate ML_12 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('TGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATT...TTG'), id='OQ672448.1', name='OQ672448', description='Plasmodium falciparum isolate ML_11 chloroquine resistance transporter (crt) gene, partial cds', dbxrefs=[]),
 SeqRecord(seq=Seq('TGTGCTCATGTGTTTA

In [11]:
for rec in recs:
    if rec.name == 'KM288867': # try finding CRT gene in 40 records we fetched
        break
print(rec.name)
print(rec.description)

OQ672412
Plasmodium falciparum isolate MAO_26 chloroquine resistance transporter (crt) gene, partial cds


In [12]:
str(rec.seq)

'TGTGCTCATGTGTTTAAACTTATTTTTAAAGAGATTAAGGATAATATTTTTATTTATATTTTAAGTATTATTTATTTAAGTGTATCTGTAATGAATACAATTTTTGCTAAAAGAACTTTAAACAAAATTGGTAACTATAGTTTTG'

## Summary: In this tutorial, we retrieve sequence information of the chloroquine resistance transporter (CRT) gene (KM288867) in Plasmodium falciparum on the nucleotide database using Entrez.

# Finish!