# PubMed Article Search and Extractor

This Python script facilitates the search and extraction of academic publication metadata and abstracts from PubMed, one of the most comprehensive databases for medical and life science journal articles. The tool leverages the Entrez Programming Utilities (E-utilities) API, provided by the National Center for Biotechnology Information (NCBI), to programmatically search for articles based on specific queries and criteria such as keywords, author names, publication dates, and more.

## Features

- **Customizable Searches**: Perform searches using complex queries involving keywords, phrases, and logical operators (AND, OR).
- **Date Range Filtering**: Specify start and end dates to narrow down the search to articles published within a certain timeframe.
- **Fetch All Records**: Automatically fetches all records matching the search criteria without the need for specifying a maximum record count.
- **Comprehensive Metadata Extraction**: Extract detailed information for each article, including titles, abstracts, authors, publication dates, journal names, and direct URLs to the PubMed database.
- **Excel Export**: Save the extracted article metadata and abstracts into an Excel spreadsheet, making it easy for users to review and analyze the data.

## How to Use

1. **Set Up Your Environment**: Ensure you have Python installed on your system. This script was developed and tested with Python 3.8+. You'll also need to install the required libraries:

    ```bash
    pip install biopython pandas openpyxl
    ```

2. **Configuration**: Before running the script, you must edit it to include your email address in the placeholder `your_email@example.com`. This is a requirement for using the NCBI's E-utilities API.

3. **Running the Script**:
    - To perform a search, you'll need to modify the `query` variable in the script to reflect your search criteria.
    - Optionally, adjust the `mindate` and `maxdate` variables to filter articles published within a specific date range. If `maxdate` is not provided, the current date is used as the default.
    - Execute the script. The results will be fetched and saved in an Excel file named `PubMed_Articles_Details.xlsx` by default.

4. **Reviewing Results**: Open the generated Excel file to review the fetched article details. Each row represents an article, with columns for each piece of metadata, including a direct URL to the article on PubMed.

## Limitations and Considerations

- **API Rate Limits**: The NCBI recommends not sending more than three requests per second. If you plan to use this script heavily, consider obtaining an API key from NCBI to increase your rate limit.
- **Compliance with Terms of Use**: Ensure your use of the script complies with the NCBI's terms of service for automated querying and data usage.

## Contributing

Contributions to enhance the functionality, improve efficiency, or extend the capabilities of this script are welcome. Please feel free to fork the repository, make your changes, and submit a pull request.

In [None]:
from Bio import Entrez
import pandas as pd
from datetime import datetime

def search_pubmed(query, email, db="pubmed", mindate=None, maxdate=None):
    """
    Search PubMed for articles matching a specific query within specified date ranges.

    Parameters:
    - query: The search query.
    - email: Your email address.
    - db: The database to search. "pubmed" for PubMed citations and abstracts, "pmc" for PubMed Central full-text articles.
    - mindate: The start date for filtering articles (format: YYYY/MM/DD).
    - maxdate: The end date for filtering articles (format: YYYY/MM/DD). Defaults to the current date if not provided.

    Returns:
    A list of article IDs.
    """
    Entrez.email = email  # Set the email

    # Default maxdate to today's date if not provided
    if not maxdate:
        maxdate = datetime.today().strftime('%Y/%m/%d')

    # Initialize variables for pagination
    article_ids = []
    search_done = False
    count = 0
    retstart = 0

    # Construct the search term with date range
    if mindate:
        query = f"{query} AND ({mindate}[Date - Publication] : {maxdate}[Date - Publication])"

    while not search_done:
        search_handle = Entrez.esearch(db=db, term=query, retstart=retstart, retmax=10000, mindate=mindate, maxdate=maxdate, usehistory="y")
        search_results = Entrez.read(search_handle)
        search_handle.close()

        # Update variables for pagination
        count = int(search_results["Count"])
        batch_ids = search_results["IdList"]
        article_ids.extend(batch_ids)
        retstart += len(batch_ids)

        if len(article_ids) >= count:
            search_done = True

    return article_ids

# The rest of the functions remain unchanged

# Example usage
email = "your_email@example.com"
query = 'Carbapenem-Resistant AND (Pseudomonas aeruginosa OR Enterobacterales)'
mindate = "2020/01/01"
# maxdate is not provided and will default to the current date
article_ids = search_pubmed(query, email, db="pubmed", mindate=mindate)
articles = fetch_article_details(article_ids, email)
save_articles_to_excel(articles)
