### 1. Sandbox for Bioinformatics - NCBI APIs

1. [NCBI/Entrez - PubMed ESearch (XML format)](#id1)
2. [NCBI/Entrez - PubMed ESummary (XML format)](#id2)
3. [NCBI/Entrez - Combined ESearch/ESummary (json format)](#id3)
4. [NCBI/Entrez - EFetch Query and Parse using Biopython](#id4)
5. [NCBI/Entrez - EPost as Class Object](#id5)
6. [NCBI/Entrez - Entrez Functions as Class Objects](#id6)

#### Summary of Entrez Functions
**esearch** - Searches and retrieves primary IDs (for use in EFetch, ELink,and ESummary) and term translations and optionally retains results for future use in the user's environment.

**efetch** - Retrieves records in the requested format from a list of one or more primary IDs or from the user's environment.

**esummary** - Retrieves document summaries from a list of primary IDs or from the user's environment.

**einfo** - Provides field index term counts, last update, and available links for each database.

**elink** - Checks for the existence of an external or Related Articles link from a list of one or more primary IDs. Retrieves primary IDs and relevancy scores for links to Entrez databases or Related Articles;  creates a hyperlink to the primary LinkOut provider for a specific ID and database, or lists LinkOut URLs and Attributes for multiple IDs.

**epost** - Posts a file containing a list of primary IDs for future use in the user's environment to use with subsequent search strategies.

In [1]:
# If Biopython is installed, pass, if not, install it.
try:
  import Bio
except ImportError:
  !pip install biopython

<a name="id1"></a>

In [2]:
import requests
from lxml import etree

# ------------------------------------------------------------------------------
# NCBI/Entrez - PubMed ESearch (XML format)
# ------------------------------------------------------------------------------

# API key
api_key = '3a17784f7e642d2145466dcab603a5b49908'

# Base request
base_entrez = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

db = 'pubmed' # one of numerous others
query = 'cancer[mesh]+AND+epigenomics[mesh]+AND+2024[pdat]'

# 'query' is embedded in 'esearch'.
esearch = (f'esearch.fcgi?db={db}&term={query}&usehistory=y&retmax=3')
url = base_entrez + esearch # final url

response = requests.get(url) # GET

if response.status_code == 200: # successful
  # Parse XML w/ lxml
  tree = etree.fromstring(response.content)

  # Use lxml's API to extract data
  # (i.e. find all elements with a specific tag)
  pmid_elements = tree.findall(".//Id")
  webenv_element = tree.findall(".//WebEnv") # need this
  querykey_element = tree.findall(".//QueryKey") # also need this
  webenv = webenv_element[0].text
  querykey = querykey_element[0].text

  # Iterate over the PMID elements
  print(f"PMIDs retrieved:\n")
  for element in pmid_elements: # PubMed Journal Article ID (PMID)
      print(element.text)
  print(f"\nThe WebEnv value: {webenv}")
  print(f"\nThe QueryKey value: {querykey}")
else:
    print("Error fetching the XML data:", response.status_code) # not successful

PMIDs retrieved:

39623476
39596027
39588573

The WebEnv value: MCID_67677bfa2c7209c78f09201d

The QueryKey value: 1


<a name="id2"></a>

In [None]:
import requests
from lxml import etree

# ------------------------------------------------------------------------------
# NCBI/Entrez - PubMed ESummary (XML format)
# ------------------------------------------------------------------------------

# API key
api_key = ''

# Build request
base_entrez = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"

db = 'pubmed' # one of numerous databases maintained by NCBI

# Construct query to retrieve a single PMID only
esummary = (f"esummary.fcgi?db={db}&query_key={querykey}&WebEnv={webenv}&retmax=1")

url = base_entrez + esummary

response = requests.get(url) # GET
print(f"The URL retrieved: \n{url}")

if response.status_code == 200: # success
  # Parse XML w/ lxml
  tree = etree.fromstring(response.content)
  # Get the element PMID
  for element in tree.xpath(".//Id"):
    print(f"\nThe PMID: {element.text}")

  print(f"\nDOCUMENT TREE:\n")
  # Navigate the document tree
  for parent in tree:
    print(f"Parent: {parent.tag}")
    for child in parent:
      print(f"  Child: {child.tag}")
      for grandchild in child:
        print(f"    Grandchild: {grandchild.tag}, Attribute: {grandchild.attrib}, Text: {grandchild.text}")
else:
    print("Error fetching the XML data:", response.status_code) # not successful

The URL retrieved: 
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&query_key=1&WebEnv=MCID_67677bfa2c7209c78f09201d&retmax=1

The PMID: 39623476

DOCUMENT TREE:

Parent: DocSum
  Child: Id
  Child: Item
  Child: Item
  Child: Item
  Child: Item
    Grandchild: Item, Attribute: {'Name': 'Author', 'Type': 'String'}, Text: Kelly K
    Grandchild: Item, Attribute: {'Name': 'Author', 'Type': 'String'}, Text: Scherer M
    Grandchild: Item, Attribute: {'Name': 'Author', 'Type': 'String'}, Text: Braun MM
    Grandchild: Item, Attribute: {'Name': 'Author', 'Type': 'String'}, Text: Lutsik P
    Grandchild: Item, Attribute: {'Name': 'Author', 'Type': 'String'}, Text: Plass C
  Child: Item
  Child: Item
  Child: Item
  Child: Item
  Child: Item
  Child: Item
    Grandchild: Item, Attribute: {'Name': 'Lang', 'Type': 'String'}, Text: English
  Child: Item
  Child: Item
  Child: Item
  Child: Item
    Grandchild: Item, Attribute: {'Name': 'PubType', 'Type': 'String'}, Text: Jo

<a name="id3"></a>

In [None]:
import requests
import json

# ------------------------------------------------------------------------------
# Combined ESearch/ESummary (json format)
# ------------------------------------------------------------------------------

# API key for NCBI
api_key = ''

# Base request
base_entrez = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

db = 'nucleotide'
query = 'viruses[orgn]'

# retmode is `json`.
esearch = (f'esearch.fcgi?db={db}&term={query}&usehistory=y&retmode=json&retmax=1')
url = base_entrez + esearch # final url

response = requests.get(url) # STEP 1: Search ----------------------------------

if response.status_code == 200:
  search_data = response.json() # returns a Python dict
  webenv = search_data['esearchresult']['webenv'] # need this
  querykey = search_data['esearchresult']['querykey'] # and this

# Construct query to retrieve a single summary, as json.
esummary = (f"esummary.fcgi?db={db}&query_key={querykey}&WebEnv={webenv}&retmode=json&retmax=1")

url = base_entrez + esummary

response = requests.get(url) # STEP 2: Retrieve --------------------------------

if response.status_code == 200:
  summary_data = response.json() # return a Python dict

# Convert to JSON formatted string with 2 spaces of indentation
json_str = json.dumps(summary_data, indent=2)

print(json_str)

{
  "header": {
    "type": "esummary",
    "version": "0.3"
  },
  "result": {
    "uids": [
      "2871752916"
    ],
    "2871752916": {
      "uid": "2871752916",
      "term": "2871752916",
      "caption": "PQ779047",
      "title": "Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/NY-CUIMC-NP-15699/2023 ORF1ab polyprotein (ORF1ab) and ORF1a polyprotein (ORF1ab) genes, partial cds; surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), and ORF7b (ORF7b) genes, complete cds; ORF8 gene, complete sequence; and nucleocapsid phosphoprotein (N) and ORF10 protein (ORF10) genes, complete cds",
      "extra": "gi|2871752916|gb|PQ779047.1|",
      "gi": 2871752916,
      "createdate": "2024/12/18",
      "updatedate": "2024/12/18",
      "flags": "",
      "taxid": 2697049,
      "slen": 29177,
      "biomol": "genomic",
      "moltype": "rna",
      "topology": "not-set",
   

<a name="id4"></a>

In [None]:
import requests
import Bio
from Bio import SeqIO

# ------------------------------------------------------------------------------
# NCBI/Entrez - EFetch using Biopython
# ------------------------------------------------------------------------------

def download_file(url):
  """Downloads a SINGLE file given an NCBI eFetch URL.
     DOES NOT HANDLE MORE THAN ONE FILE!
  """
  with open('sequence.gb', 'wb') as out_file:
    content = requests.get(url, stream=True).content
    out_file.write(content)

def seqrecord_to_dict(record):
  """Converts a Bio.SeqRecord.SeqRecord to a plain Python dictionary.
  """
  return {
      "id": record.id,
      "name": record.name,
      "description": record.description,
      "seq": str(record.seq),
      "features": [feature.__dict__ for feature in record.features],
      "annotations": record.annotations,
      "letter_annotations": record.letter_annotations
  }

# API key for NCBI
api_key = ''

# Base request
base_entrez = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

# Construct query to retrieve a single summary, as json.
efetch = (f"efetch.fcgi?db={db}&query_key={querykey}&WebEnv={webenv}&rettype=abstract&retmode=text&retmax=1")

url = base_entrez + efetch

download_file(url)

with open("sequence.gb", "r") as handle:
  record = SeqIO.read(handle, "genbank")

print(f"Record id: {record.id}")
print(f"Record description: {record.description}")
print(f"Record seq: {record.seq}\n")

print("NCBI record as a dictionary:\n")
seqrecord_to_dict(record)


Record id: PQ779047.1
Record description: Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/NY-CUIMC-NP-15699/2023 ORF1ab polyprotein (ORF1ab) and ORF1a polyprotein (ORF1ab) genes, partial cds; surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), and ORF7b (ORF7b) genes, complete cds; ORF8 gene, complete sequence; and nucleocapsid phosphoprotein (N) and ORF10 protein (ORF10) genes, complete cds
Record seq: CTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGGTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTCT

{'id': 'PQ779047.1',
 'name': 'PQ779047',
 'description': 'Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/NY-CUIMC-NP-15699/2023 ORF1ab polyprotein (ORF1ab) and ORF1a polyprotein (ORF1ab) genes, partial cds; surface glycoprotein (S), ORF3a protein (ORF3a), envelope protein (E), membrane glycoprotein (M), ORF6 protein (ORF6), ORF7a protein (ORF7a), and ORF7b (ORF7b) genes, complete cds; ORF8 gene, complete sequence; and nucleocapsid phosphoprotein (N) and ORF10 protein (ORF10) genes, complete cds',
 'seq': 'CTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGGTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTG

<a name="id5"></a>

In [None]:
import requests
import json
import Bio
from Bio import SeqIO

# ------------------------------------------------------------------------------
# NCBI/Entrez - EPost as Class Objects - First Try
# ------------------------------------------------------------------------------

class Parameter(object):

  db = {1: 'assembly', 2: 'clinvar', 3: 'dbvar', 4: 'gap', \
        5: 'gene', 6: 'genome', 7: 'mesh', 8: 'nuccore', 9: 'nucleotide', \
       10: 'omim', 11: 'protein', 12: 'pubmed', 13: 'snp', 14: 'sra'}

  uids = [0]
  webenv = None
  retmode = 'json'
  rettype = None
  querykey = None


class EutilsRequest(object):
  def __init__(self, db):
    self.db = db
    self.id = None
    self.querykey = None
    self.tool = 'gene_genie'
    self.url = None
    self.contact = 'gerald.mccollam@gmail.com'
    self.apikey = ''
    self.status = 2
    self.request_error = None
    self.size = 1

  def prepare_base_qry(self, extend=None):
    """
    Returns instance attributes required for every request.

    """
    if self.apikey:
      {'email' : self.contact, 'tool' : self.tool, 'db' : self.db}.update({'api_key' : self.apikey})
    if extend:
      {'email' : self.contact, 'tool' : self.tool, 'db' : self.db}.update(extend)
    return {'email' : self.contact, 'tool' : self.tool, 'db' : self.db}

class EpostRequest(EutilsRequest):
  def __init__(self, parameter):
    super().__init__(parameter.db)
    self.uids = parameter.uids
    self.size = len(parameter.uids)
    self.webenv = parameter.webenv
    self.querykey = parameter.querykey
    self.retmode = parameter.retmode
    self.rettype = parameter.retmode


  def get_post_parameter(self):
    return self.prepare_base_qry(extend={'id':','.join(str(x) for x in self.uids),
                                         'WebEnv':self.webenv})

  def dump(self):
    return {'db' : self.db,
            'WebEnv':self.webenv,
            'query_key' : self.querykey,
            'uids' : self.uids,
            'retmode' : self.retmode,
            'rettype' : self.rettype,
            'retstart' : self.retstart,
            'retmax' : self.retmax,
            'request_size' : self.reqsize
            }

<a name="id6"></a>

In [None]:
import io
import time
import warnings
import json
import requests

# ------------------------------------------------------------------------------
# NCBI/Entrez - Entrez Functions as Function Objects
# Based on Biopython but uses the Requests lib.
# NCBI defaults to XML. Here I've try to use json wherever possible.
# ------------------------------------------------------------------------------

email = 'gerald.mccollam@gmail.com'
max_tries = 3
sleep_between_tries = 15
tool = "gene_genie"
api_key = ''
local_cache = None

def esearch(db, term, **keywds):
  '''This function searches and retrieves primary IDs (for use in EFetch,
     ELink and ESummary).
  '''
  base_cgi = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
  variables = {"db": db}
  variables.update(keywords)
  request = _construct(base_cgi, variables)
  return _process(request)

def efetch(db, **keywords):
  '''This function retrieves records in the requested format from a list or
     set of one or more UIs or from user's environment.
  '''
  pass

def esummary(**keywords):
  '''This function retrieves document summaries from a list of primary IDs or
     from the user's environment.
  '''

  pass

def einfo(**keywords):
  '''This function returns a summary of the Entrez databases as a results handle.
  '''
  pass

def elink(**keywords):
  '''This function checks for the existence of an external or Related Articles
     link from a list of one or more primary IDs;  retrieves IDs and relevancy
     scores for links to Entrez databases or Related Articles.
  '''
  pass

def epost(db, **keywords):
  '''Posts a file containing a list of primary IDs for future use.
  '''
  pass

def _get_params(params, join_ids=True):
  pass

def _format_ids(ids):
    '''
    Input: a single ID (int or str), or iterable of strings/ints,
    or a string of IDs separated by commas.
    '''
    if isinstance(ids, int):
        return str(ids) # Single int, convert it to str.

    if isinstance(ids, str):
        # Multiple IDs, remove white space.
        return ",".join(id.strip() for id in ids.split(","))

    # Not a string or integer, assume iterable
    return ",".join(map(str, ids))

def _has_api_key(request):
  pass

def _construct(base_cgi, params=None, join_ids=True):
  '''
  :param str base_cgi: base URL.
  :param params: Mapping containing options to pass to NCBI.
  :type params: dict or None (containing only strings).
  :param bool join_ids: Passed to ``_get_params``.
  :returns: A request object ready to pass to ``_process``.
  '''

  params = _get_params(params, join_ids=join_ids)

def _process(request):
  pass

