# Entrez queries

Before start, NCBI asks for email address during queries. So, let's define a variable which can be used throughout the notebook.

In [27]:
ENTREZ_EMAIL = "your email address here"

In [2]:
from Bio import Entrez
Entrez.email = ENTREZ_EMAIL
handle = Entrez.einfo()
result = handle.read()
handle.close()

In [3]:
print(result)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
<eInfoResult>
<DbList>

	<DbName>pubmed</DbName>
	<DbName>protein</DbName>
	<DbName>nuccore</DbName>
	<DbName>ipg</DbName>
	<DbName>nucleotide</DbName>
	<DbName>nucgss</DbName>
	<DbName>nucest</DbName>
	<DbName>structure</DbName>
	<DbName>sparcle</DbName>
	<DbName>genome</DbName>
	<DbName>annotinfo</DbName>
	<DbName>assembly</DbName>
	<DbName>bioproject</DbName>
	<DbName>biosample</DbName>
	<DbName>blastdbinfo</DbName>
	<DbName>books</DbName>
	<DbName>cdd</DbName>
	<DbName>clinvar</DbName>
	<DbName>clone</DbName>
	<DbName>gap</DbName>
	<DbName>gapplus</DbName>
	<DbName>grasp</DbName>
	<DbName>dbvar</DbName>
	<DbName>gene</DbName>
	<DbName>gds</DbName>
	<DbName>geoprofiles</DbName>
	<DbName>homologene</DbName>
	<DbName>medgen</DbName>
	<DbName>mesh</DbName>
	<DbName>ncbisearch</DbName>
	<DbName>nlmcatalog</DbName>
	<DbName

In [4]:
from Bio import Entrez
handle = Entrez.einfo()
record = Entrez.read(handle)

record.keys()

dict_keys(['DbList'])

In [5]:
record["DbList"]

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']

In [6]:
from Bio import Entrez
Entrez.email = "alperyilmaz@gmail.com"
handle = Entrez.einfo(db="pubmed")
record = Entrez.read(handle)
record["DbInfo"]["Description"]

'PubMed bibliographic record'

In [7]:
record["DbInfo"]["Count"]

'29234208'

In [8]:
record["DbInfo"]["LastUpdate"]

'2019/01/03 15:47'

In [9]:
from Bio import Entrez
Entrez.email = "alperyilmaz@gmail.com"
# Always tell NCBI who you are
handle = Entrez.esearch(db="pubmed", term="biopython", retmax="20")
record = Entrez.read(handle)
"19304878" in record["IdList"]

True

Number of records returned is 

In [10]:
len(record['IdList'])

20

which is different from total number of query results.

In [11]:
record['Count']

'24'

## Esearch

In [12]:
handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]", idtype="acc")
record = Entrez.read(handle)

In [13]:
record["Count"]

'416'

In [15]:
record["IdList"][:10]

['MG522909.1',
 'MG522908.1',
 'MG522907.1',
 'MG522906.1',
 'MG522905.1',
 'MG522904.1',
 'MG522903.1',
 'MG522902.1',
 'MG522901.1',
 'MG522900.1']

In [16]:
handle = Entrez.esearch(db="nlmcatalog", term="computational[Journal]", retmax="20")
record = Entrez.read(handle)

In [17]:
print("{} computational journals found".format(record["Count"]))

158 computational journals found


In [18]:
print("The first 20 are\n{}".format(record["IdList"]))

The first 20 are
['101737789', '101736625', '101728813', '101723217', '101723351', '101719151', '101718871', '101717513', '101708081', '101707097', '101724357', '101721723', '101705423', '101703420', '101689612', '101738367', '101729936', '101726364', '101696157', '101660833']


These are not Pubmed IDs. They belong the db we did search at, NLM Catalog. The site for record [101660833](https://www.ncbi.nlm.nih.gov/nlmcatalog/?term=101660833). Below is the page itself.

In [21]:
%%html 
<iframe src="https://www.ncbi.nlm.nih.gov/nlmcatalog/?term=101660833" width="700" height="400"></iframe>

## Epost

You can upload a list of identifiers and do tasks on them. The returned XML includes two important strings, QueryKey and WebEnv which together define your history session. You would extract these values for use with another Entrez call such as EFetch

In [22]:
from Bio import Entrez
Entrez.email = ENTREZ_EMAIL    # Always tell NCBI who you are
id_list = ["19304878", "18606172", "16403221", "16377612", "14871861", "14630660"]
search_results = Entrez.read(Entrez.epost("pubmed", id=",".join(id_list)))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]

In [26]:
print("Webenv: {}\nQuery Key: {}".format(webenv,query_key))

Webenv: NCID_1_119918942_130.14.22.215_9001_1546587838_1781743763_0MetA0_S_MegaStore
Query Key: 1


## Esummary

ESummary retrieves document summaries from a list of primary IDs

In [29]:
from Bio import Entrez
Entrez.email = ENTREZ_EMAIL    # Always tell NCBI who you are
handle = Entrez.esummary(db="nlmcatalog", id="101660833")
record = Entrez.read(handle)
info = record[0]["TitleMainList"][0]
print("Journal info\nid: {}\nTitle: {}".format(record[0]["Id"], info["Title"]))

Journal info
id: 101660833
Title: IEEE transactions on computational imaging.


In [43]:
record[0]["PublicationInfoList"][0]["DatesOfSerialPublication"]

'Began with Vol. 1, issue 1 (March 2015).'

## Efetch

EFetch is what you use when you want to retrieve a full record from Entrez. This covers several possible
databases.

Requesting a specific file format from Entrez using Bio.Entrez.efetch() requires specifying the `rettype` and/or `retmode` optional arguments.

In [28]:
from Bio import Entrez
Entrez.email = ENTREZ_EMAIL    # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
print(handle.read())

LOCUS       EU490707                1302 bp    DNA     linear   PLN 26-JUL-2016
DEFINITION  Selenipedium aequinoctiale maturase K (matK) gene, partial cds;
            chloroplast.
ACCESSION   EU490707
VERSION     EU490707.1
KEYWORDS    .
SOURCE      chloroplast Selenipedium aequinoctiale
  ORGANISM  Selenipedium aequinoctiale
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; Liliopsida; Asparagales; Orchidaceae;
            Cypripedioideae; Selenipedium.
REFERENCE   1  (bases 1 to 1302)
  AUTHORS   Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A., Endara,L.,
            Williams,N.H. and Moore,M.
  TITLE     Phylogenetic utility of ycf1 in orchids: a plastid gene more
            variable than matK
  JOURNAL   Plant Syst. Evol. 277 (1-2), 75-84 (2009)
REFERENCE   2  (bases 1 to 1302)
  AUTHORS   Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A.,
            Endara,C.L., Williams,N.H. and Moore,M.J.
  TIT

In [48]:
handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="fasta", retmode="text")
print(handle.read())

>EU490707.1 Selenipedium aequinoctiale maturase K (matK) gene, partial cds; chloroplast
ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTACTTGTGAAACGTTTAA
TTACTCGAATGTATCAACAGAATTTTTTGATTTCTTCGGTTAATGATTCTAACCAAAAAGGATTTTGGGG
GCACAAGCATTTTTTTTCTTCTCATTTTTCTTCTCAAATGGTATCAGAAGGTTTTGGAGTCATTCTGGAA
ATTCCATTCTCGTCGCAATTAGTATCTTCTCTTGAAGAAAAAAAAATACCAAAATATCAGAATTTACGAT
CTATTCATTCAATATTTCCCTTTTTAGAAGACAAATTTTTACATTTGAATTATGTGTCAGATCTACTAAT
ACCCCATCCCATCCATCTGGAAATCTTGGTTCAAATCCTTCAATGCCGGATCAAGGATGTTCCTTCTTTG
CATTTATTGCGATTGCTTTTCCACGAATATCATAATTTGAATAGTCTCATTACTTCAAAGAAATTCATTT
ACGCCTTTTCAAAAAGAAAGAAAAGATTCCTTTGGTTACTATATAATTCTTATGTATATGAATGCGAATA
TCTATTCCAGTTTCTTCGTAAACAGTCTTCTTATTTACGATCAACATCTTCTGGAGTCTTTCTTGAGCGA
ACACATTTATATGTAAAAATAGAACATCTTCTAGTAGTGTGTTGTAATTCTTTTCAGAGGATCCTATGCT
TTCTCAAGGATCCTTTCATGCATTATGTTCGATATCAAGGAAAAGCAATTCTGGCTTCAAAGGGAACTCT
TATTCTGATGAAGAAATGGAAATTTCATCTTGTGAATTTTTGGCAATCTTATTTTCACTTTTGGTCTCAA
CCGTATAGGATTCATATAAAGCAATTATCCAACTATTCCTTCTCTTTTCTGGGGTATTTT

In [56]:
from Bio import SeqIO
from Bio import Entrez
Entrez.email = ENTREZ_EMAIL    # Always tell NCBI who you are
handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
print(record.id)

EU490707.1


In [58]:
print(record.reverse_complement().seq[:50])

TTCTTCTTCCATAAAGAATTCTTCTAATAATCCCGAACCTAATCTTCGCA


## Elink

In [59]:
from Bio import Entrez
Entrez.email = ENTREZ_EMAIL    # Always tell NCBI who you are
pmid = "19304878"
record = Entrez.read(Entrez.elink(dbfrom="pubmed", id=pmid))

In [60]:
record[0]["DbFrom"]

'pubmed'

In [61]:
len(record[0]["LinkSetDb"])

8

In [62]:
for linksetdb in record[0]["LinkSetDb"]:
    print(linksetdb["DbTo"], linksetdb["LinkName"], len(linksetdb["Link"]))

pubmed pubmed_pubmed 227
pubmed pubmed_pubmed_alsoviewed 2
pubmed pubmed_pubmed_citedin 641
pubmed pubmed_pubmed_combined 6
pubmed pubmed_pubmed_five 6
pubmed pubmed_pubmed_refs 17
pubmed pubmed_pubmed_reviews 8
pubmed pubmed_pubmed_reviews_five 6


In [64]:
%%html 
<iframe src="https://www.ncbi.nlm.nih.gov/pubmed/19304878" width="700" height="400"></iframe>

In [68]:
for link in record[0]["LinkSetDb"][1]["Link"]:
    print(link["Id"])

14630660
19505943


> why don't you retrieve"also viewed" articles for these two articles and follow along all "also viewed" articles to construct a network :o

> remove email addresses from publication