# Exercise 1.
# Using Entrez API to Fetch PubMed Data

The Entrez API provides programmatic access to various biomedical databases hosted by the National Center for Biotechnology Information (NCBI). In this task, we aim to retrieve the metadata of 1000 Alzheimer's papers and 1000 cancer papers from 2023 available in the PubMed database.

## Steps to Achieve the Task:

1. **Search for Papers**: Use the Entrez API to search for papers based on specific queries and retrieve their PubMed IDs.
2. **Fetch Paper Metadata**: For each retrieved PubMed ID, fetch the paper's metadata.
3. **Parse and Save Metadata**: Parse the fetched metadata to extract the required information and save it in a JSON format.

In [1]:
import requests
import time
import xml.dom.minidom as m
import json

In [2]:
def get_id(disease):
    r = requests.get(
    "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?"
    f"db=pubmed&term={disease}+AND+2023[pdat]&retmode=xml&retmax=1000"
  )
    time.sleep(1)
    if r.status_code == 200:
        doc = m.parseString(r.text)
        IDs = doc.getElementsByTagName("Id")
        pubmed_id = [ID.childNodes[0].data for ID in IDs]
        return pubmed_id
    else:
        print("Failed to fetch IDs:", r.status_code)
        return []

In [3]:
def getText(nodelist):
    rc = []
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc.append(node.data)
        else:
            rc.append(getText(node.childNodes))
    return ''.join(rc)

In [4]:
def get_metadata(id_list, query_term):
    metadata_dict = {}
    if id_list:
        r = requests.post(
            "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi",
            data={
                "db": "pubmed",
                "retmode": "xml",
                "id": ",".join(id_list)
            }
        )
        time.sleep(1)
        if r.status_code == 200:
            doc = m.parseString(r.text)
            articles = doc.getElementsByTagName("PubmedArticle")
            for article, pubmed_id in zip(articles, id_list):
                title_node = article.getElementsByTagName("ArticleTitle")
                abstract_nodes = article.getElementsByTagName("AbstractText")
                title = getText(title_node[0].childNodes) if title_node else "N/A"
                abstract = " ".join(getText(abstract_node.childNodes) for abstract_node in abstract_nodes) if abstract_nodes else "N/A"
                metadata_dict[pubmed_id] = {
                    "ArticleTitle": title,
                    "AbstractText": abstract,
                    "query": query_term
                }
        else:
            print("Failed to fetch metadata:", r.status_code)
    return metadata_dict

In [5]:
if __name__ == "__main__":
    Alzheimers_id = get_id("Alzheimers")
    Cancer_id = get_id("cancer")

    common_ids = list(set(Alzheimers_id) & set(Cancer_id))
    print("Common IDs:", common_ids)

    Alzheimers_metadata = get_metadata(Alzheimers_id, "Alzheimers")
    Cancer_metadata = get_metadata(Cancer_id, "cancer")

    all_metadata = {**Alzheimers_metadata, **Cancer_metadata}
    with open('metadata.json', 'w') as f:
        json.dump(all_metadata, f, indent=4)
    print("Metadata saved to 'metadata.json'")

Common IDs: ['37895928', '37897137', '37895969', '37901920', '37902389', '37899058']
Metadata saved to 'metadata.json'


I identifies that there is some overlap between the two sets of papers (Alzheimer's and cancer papers) as `common_ids`

### Regarding the handling of multiple AbstractText fields:
**Concatenating with a Space**: This is a straightforward approach and makes it easy to read and process the abstract as a single string later. However, it might not preserve the structure of the original abstract, which could be important for understanding the flow or sections of the abstract.