
# PubMed API Script

This Jupyter notebook demonstrates how to use the PubMed API to fetch and process academic papers. 
The script includes functions for querying PubMed, fetching metadata, and identifying authors based on affiliations.

### Importing Required Libraries


In [1]:

import requests
import xml.etree.ElementTree as ET
import pandas as pd
from typing import List, Dict, Tuple
import string



## Function Definitions

The following section contains functions for:
- Querying the PubMed API
- Parsing XML responses
- Processing and categorizing papers


### Construct Query URLs
The `mkquery` function helps in constructing a URL with query parameters for API requests.

In [2]:
def mkquery(base_url: str, params: Dict[str, str]) -> str:
    query = "&".join(f"{key}={value}" for key, value in params.items())
    return f"{base_url}?{query}"


### Fetch XML Data
The `getXmlFromURL` function sends a request to the PubMed API and parses the XML response.

In [3]:
def getXmlFromURL(base_url: str, params: Dict[str, str]) -> ET.Element:
    response = requests.get(mkquery(base_url, params))
    response.raise_for_status()
    return ET.fromstring(response.text)


### Fetch Paper IDs
The `fetch_paper_ids` function retrieves PubMed IDs for papers matching a given query.

In [4]:
def fetch_paper_ids(query: str, max_results: int = 100) -> Tuple[List[str], str, str]:
    params = {
        'db': 'pubmed',
        'term': query,
        'retmax': str(max_results),
        'usehistory': 'y'
    }
    root = getXmlFromURL(BASEURL_SRCH, params)
    ids = [id_node.text for id_node in root.findall('.//Id')]
    query_key = root.findtext('.//QueryKey')
    web_env = root.findtext('.//WebEnv')
    return ids, query_key, web_env


### Fetch Paper Details
The `fetch_paper_details` function retrieves detailed information about the papers using PubMed IDs.

In [5]:
def fetch_paper_details(query_key: str, web_env: str, batch_size: int = 10) -> List[Dict]:
    params = {
        'db': 'pubmed',
        'query_key': query_key,
        'WebEnv': web_env,
        'retmax': str(batch_size),
        'retmode': 'xml'
    }
    root = getXmlFromURL(BASEURL_FTCH, params)
    papers = []

    for article in root.iter('PubmedArticle'):
        paper = {
            'PubmedID': article.findtext('.//PMID'),
            'Title': article.findtext('.//ArticleTitle'),
            'PublicationDate': article.findtext('.//PubDate/Year'),
            'Authors': [],
            'Affiliations': []
        }
        for author in article.findall('.//Author'):
            name = f"{author.findtext('ForeName', '')} {author.findtext('LastName', '')}".strip()
            affiliation = author.findtext('.//Affiliation')
            if name and affiliation:
                paper['Authors'].append(name)
                paper['Affiliations'].append(affiliation)
        papers.append(paper)
    return papers


### Check Academic Affiliation
The `check_academic` function determines if an affiliation is academic by matching keywords.

In [6]:
def check_academic(affiliation: str) -> bool:
    academic_keywords = ["school", "university", "college", "institute", "research", "lab"]
    affiliation = affiliation.translate(str.maketrans('', '', string.punctuation))
    affiliation = affiliation.lower().split()
    academic_keywords = set(academic_keywords)
    affiliation = set(affiliation)
    return len(academic_keywords.intersection(affiliation)) > 0


### Process Papers
The `process_papers` function processes metadata and categorizes authors as academic or non-academic.

In [7]:
def process_papers(papers: List[Dict]) -> pd.DataFrame:
    rows = []
    for paper in papers:
        non_academic_authors = []
        affiliations = []
        for author, affiliation in zip(paper['Authors'], paper['Affiliations']):
            is_academic = check_academic(affiliation)
            if not is_academic:
                non_academic_authors.append(author)
                affiliations.append(affiliation)
        rows.append({
            'PubmedID': paper['PubmedID'],
            'Title': paper['Title'],
            'Publication Date': paper['PublicationDate'],
            'Non-academic Author(s)': "; ".join(non_academic_authors),
            'Company Affiliation(s)': "; ".join(affiliations),
        })
    return pd.DataFrame(rows)



## Example Usage

The example demonstrates querying PubMed for papers and processing results.


In [None]:

# Constants for PubMed API
BASEURL_SRCH = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
BASEURL_FTCH = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi'

# Example query
query = "machine learning"

# Fetch PubMed IDs
ids, query_key, web_env = fetch_paper_ids(query)
print(f"Fetched {len(ids)} IDs")

# Fetch details
papers = fetch_paper_details(query_key, web_env)
print(f"Fetched details for {len(papers)} papers")

# Process papers
df = process_papers(papers)


Fetched 100 IDs
Fetched details for 10 papers
0                                                     
1                                                     
2                                                     
3                                                     
4                                                     
5     Zhe Zhao; Chang'e Liu; Qiuping Li; Xiaoyang Hong
6    Hannes De Meulemeester; Frank De Smet; Johan v...
7                                         Milan Špánik
8                                         Xiaodong Luo
9                                                     
Name: Non-academic Author(s), dtype: object
