### How to do it...
#### 1.We start by importing the relevant module and configuring the e-mail address.

In [1]:
from Bio import Entrez, Medline, SeqIO

In [2]:
Entrez.email = "kakyung.kim@gmail.com" 

#### 2.We will now try to find the Cholroquine Resistance Transporter (CRT) gene in Plasmodium falciparum (the parasite that causes the deadliest form of malaria) on the nucleotide database:

In [3]:
#This gives you the list of available databases
handle = Entrez.einfo()
rec = Entrez.read(handle)
print(rec)

{u'DbList': ['pubmed', 'protein', 'nuccore', 'nucleotide', 'nucgss', 'nucest', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'epigenomics', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'unigene', 'gencoll', 'gtr']}


In [4]:
handle = Entrez.esearch(db="nucleotide", term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]')
rec_list = Entrez.read(handle)
# override the default maximum limit(20) with retmax.
if rec_list['RetMax'] < rec_list['Count']:
    handle = Entrez.esearch(db="nucleotide", term='CRT[Gene Name] AND "Plasmodium falciparum"[Organism]',
                            retmax=rec_list['Count'])
    rec_list = Entrez.read(handle)

#### 3.Let's now try to retrieve all these records. The following query will download all matching nucleotide sequences from GenBank, which are 281 at the time of writing this book.

In [5]:
id_list = rec_list['IdList'] #be careful with a large amount of complete records
hdl = Entrez.efetch(db='nucleotide', id=id_list, rettype='gb')

#### 4 Let's read and parse the result:
Converted an iterator (the result of SeqIO.parse) to a **list**. => we can use the result as many times as we want (ex) iteration many times over), without repeating the query on the server.

Disadvantage : allocate memory for all records.

In [6]:
recs = list(SeqIO.parse(hdl, 'gb'))

#### 5.We will now just concentrate on a single record. This will only work if you used the exact same preceding query:

In [7]:
for rec in recs: #rec : record of interest
    if rec.name == 'KM288867':
        break
print(rec.name)
print(rec.description) #rec.description : human-readable description

KM288867
Plasmodium falciparum clone PF3D7_0709000 chloroquine resistance transporter (CRT) gene, complete cds.


#### 6.Let's now extract some sequence features, which contain information such as gene products and exon positions on the sequence:

In [8]:
for feature in rec.features:
    if feature.type == 'gene':
        print(feature.qualifiers['gene'])
    elif feature.type == 'exon':
        loc = feature.location
        print('Exon', loc.start, loc.end, loc.strand)
    else:
        print('not processed:\n%s' % feature)

not processed:
type: source
location: [0:10000](+)
qualifiers:
    Key: clone, Value: ['PF3D7_0709000']
    Key: db_xref, Value: ['taxon:5833']
    Key: mol_type, Value: ['genomic DNA']
    Key: organism, Value: ['Plasmodium falciparum']

['CRT']
not processed:
type: mRNA
location: join{[2751:3543](+), [3720:3989](+), [4168:4341](+), [4513:4646](+), [4799:4871](+), [4994:5070](+), [5166:5249](+), [5376:5427](+), [5564:5621](+), [5769:5862](+), [6055:6100](+), [6247:6302](+), [6471:7598](+)}
qualifiers:
    Key: gene, Value: ['CRT']
    Key: product, Value: ['chloroquine resistance transporter']
Sub-Features
type: mRNA
location: [2751:3543](+)
qualifiers:

type: mRNA
location: [3720:3989](+)
qualifiers:

type: mRNA
location: [4168:4341](+)
qualifiers:

type: mRNA
location: [4513:4646](+)
qualifiers:

type: mRNA
location: [4799:4871](+)
qualifiers:

type: mRNA
location: [4994:5070](+)
qualifiers:

type: mRNA
location: [5166:5249](+)
qualifiers:

type: mRNA
location: [5376:5427](+)
qualif

#### 7.We will now look at the annotations on the record, which is mostly metadata that is not related to the sequence position:

In [9]:
for name, value in rec.annotations.items():
    print('%s=%s' % (name, value))

sequence_version=1
source=Plasmodium falciparum (malaria parasite P. falciparum)
taxonomy=['Eukaryota', 'Alveolata', 'Apicomplexa', 'Aconoidasida', 'Haemosporida', 'Plasmodiidae', 'Plasmodium', 'Plasmodium (Laverania)']
keywords=['']
references=[Reference(title='Versatile control of Plasmodium falciparum gene expression with an inducible protein-RNA interaction', ...), Reference(title='Direct Submission', ...)]
accessions=['KM288867']
data_file_division=INV
date=12-NOV-2014
organism=Plasmodium falciparum
gi=706072608


#### 8.You can access the fundamental piece of information, the sequence:

In [10]:
print(len(rec.seq))

10000


### There's more…
You will probably want to check the short read
archive (SRA) database if you are working with next-generation sequencing data. The SNP
database will contain information on Single-nucleotide Polymorphisms (SNPs), whereas the
protein database will have protein sequences, and so on. A full list of databases in Entrez is
linked in the See also section of this recipe.

Another database that you probably already know about NCBI is PubMed, which includes a
list of scientific and medical citations, abstracts, even full texts. You can also access it via
Biopython. Furthermore, GenBank records often contain links to PubMed. For example, we
can perform this on our previous record, as shown here:

In [11]:
refs = rec.annotations['references']
for ref in refs:
    if ref.pubmed_id != '': #1)pubmed id check
        print(ref.pubmed_id)
        handle = Entrez.efetch(db="pubmed", id=[ref.pubmed_id],
                                rettype="medline", retmode="text") #2)retrieve
        records = Medline.parse(handle) #3)parse
        for med_rec in records:
            for k, v in med_rec.items():
                print('%s: %s' % (k, v)) #4)print

25370483
LID: 10.1038/ncomms6329 [doi]
STAT: In-Process
DEP: 20141105
MID: ['NIHMS630149']
DA: 20141105
AID: ['ncomms6329 [pii]', '10.1038/ncomms6329 [doi]']
CRDT: ['2014/11/06 06:00']
DP: 2014
GR: ['1DP2OD007124/OD/NIH HHS/United States', '5-T32-ES007020/ES/NIEHS NIH HHS/United States', '5-T32-GM08334/GM/NIGMS NIH HHS/United States', 'DP2 OD007124/OD/NIH HHS/United States', 'P30 ES002109/ES/NIEHS NIH HHS/United States']
OWN: NLM
PT: ['Journal Article', 'Research Support, N.I.H., Extramural', "Research Support, Non-U.S. Gov't"]
LA: ['eng']
FAU: ['Goldfless, Stephen J', 'Wagner, Jeffrey C', 'Niles, Jacquin C']
JT: Nature communications
LR: 20150505
PG: 5329
TI: Versatile control of Plasmodium falciparum gene expression with an inducible protein-RNA interaction.
PL: England
TA: Nat Commun
JID: 101528555
AB: The available tools for conditional gene expression in Plasmodium falciparum are limited. Here, to enable reliable control of target gene expression, we build a system to efficiently 

### See also
- Biopython tutorial http://biopython.org/DIST/docs/tutorial/Tutorial.html
- A list of accessible NCBI databases http://www.ncbi.nlm.nih.gov/gquery/
- Q&A site where you can find help for your problems with databases and sequence analysis (http://www.biostars.org)