<a href="https://colab.research.google.com/github/dkisselev-zz/mmc-pipeline/blob/main/Microbiome_Articles.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Microbiome Metadata Crisis project
Sam Degregori, PhD
SD IRACDA Fellow, Pediatrics
University of California San Diego
samdegregori@gmail.com

Dmitry Kisselev Code

## IMPORTS & INITIAL SETUP

In [1]:
!pip install pymed google-generativeai boto3

Collecting pymed
  Downloading pymed-0.8.9-py3-none-any.whl.metadata (2.4 kB)
Collecting boto3
  Downloading boto3-1.38.43-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.39.0,>=1.38.43 (from boto3)
  Downloading botocore-1.38.43-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.14.0,>=0.13.0 (from boto3)
  Downloading s3transfer-0.13.0-py3-none-any.whl.metadata (1.7 kB)
Downloading pymed-0.8.9-py3-none-any.whl (9.6 kB)
Downloading boto3-1.38.43-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.9/139.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.38.43-py3-none-any.whl (13.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.7/13.7 MB[0m [31m90.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Downloading s3transfer-0.13.0-py3-no

In [2]:
from pymed import PubMed
import pandas as pd
import google.generativeai as genai
from google.colab import userdata
import requests
import time
import re
import xml.etree.ElementTree as ET
import boto3
from botocore.client import Config
from botocore import UNSIGNED
from botocore.exceptions import NoCredentialsError, ClientError

## PARAMETERS & CONFIGURATION

In [56]:
# --- Search Parameters ---
ENVIRONMENT = "oral"

# 1. Define your environments with a list of related keywords.
#    Using [Title/Abstract] (or [tiab]) makes the search highly relevant.
ENVIRONMENTS = {
    "oral": [
        '"oral"[tiab]', '"saliva"[tiab]', '"salivary"[tiab]', '"dental plaque"[tiab]',
        '"gingival"[tiab]', '"periodontal"[tiab]', '"subgingival"[tiab]', '"supragingival"[tiab]'
    ],
    "gut": [
        '"gut"[tiab]', '"gastrointestinal"[tiab]', '"intestinal"[tiab]',
        '"fecal"[tiab]', '"feces"[tiab]', '"colonic"[tiab]'
    ],
    "skin": [
        '"skin"[tiab]', '"cutaneous"[tiab]'
    ]
}

# 2. Define the core concepts related to the microbiome.
MICROBIOME_CONCEPTS = [
    '"microbiome"[Title/Abstract]', '"microbiota"[Title/Abstract]',
    '"microflora"[Title/Abstract]', '"dysbiosis"[Title/Abstract]'
]

# 3. Select your target environment for this run.
TARGET_ENVIRONMENT = "oral"

# 4. Programmatically build the final search query.
# This creates the (A OR B OR C) structure for each part of the query.
environment_query = "(" + " OR ".join(ENVIRONMENTS[TARGET_ENVIRONMENT]) + ")"
microbiome_query = "(" + " OR ".join(MICROBIOME_CONCEPTS) + ")"

# 5. Assemble the final, complete search term.
SEARCH_TERM = (
    f"({environment_query} AND {microbiome_query}) "
    "AND (ffrft[Filter]) "
    "NOT (Review[PT] OR Editorial[PT] OR Letter[PT] OR Comment[PT] OR Case Reports[PT])"
)

# SEARCH_TERM = f"{ENVIRONMENT} microbiome NOT (Review[Publication Type])"
MAX_RESULTS = 100

# --- LLM Configuration ---
LLM_MODEL = "gemini-1.5-flash" # gemini-2.5-flash

# --- Google API Configuration ---
try:
    API_KEY = userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=API_KEY)
except (ValueError, FileNotFoundError):
    raise ValueError("GOOGLE_API_KEY not found in Colab secrets. Please add it.")

# --- LLM Prompts ---
# Centralized prompts
PROMPTS = {
    "get_country": (
        "Based on the following metadata from a scientific article, suggest the most likely country of publication or the primary country "
        "associated with this research. Provide ONLY the country name.\n\nMetadata:\n{metadata_text}"
    ),
    "predict_article_type": (
        "Analyze the following article information to determine if it is a Review or Original Research. "
        "Provide a brief one-sentence rationale for your choice, followed by the classification. "
        "Use this format exactly:\n"
        "Rationale: [Your one-sentence reasoning]\n"
        "Classification: [Review or Original Research or Unknown]\n\n"
        "Information:\n{llm_input_text}"
    ),
    "predict_sequencing_type": (
        "Analyze the text below from a scientific article to identify all high-throughput sequencing methods used Classify the method(s) into one or more of the following categories:\n"
        "- **16S:** ...\n"
        "- **Other:** ... such as rpoB, rpoC, cpn60, or ITS...\n"
        "- **Shotgun:** ...\n"
        "- **Unknown:** ...\n\n"
        "Instructions:\n"
        "1. List all methods you identify, separated by commas.\n"
        "2. If you identify 'Other', specify the gene in parentheses (e.g., 'Other (rpoB)').\n"
        "3. Extract a short quote, if there is no method identified use None value for it...\n\n"
        "Respond in the following format exactly:\n"
        "Primary Method(s): [Method1, Method2, ...]\n"
        "Supporting Quote:\n\n"
        "Text Snippet:\n"
        "{text_snippet}\n"
    ),
    "find_accession_code": (
        "Analyze the following text from a scientific article. Your task is to find sequencing data accession codes "
        "(e.g., PRJNA..., SRP..., SRA..., etc.) or a URL pointing to the data.\n\n"
        "Respond with ONLY the accession code(s) or the URL. If you find multiple, separate them with commas. "
        "If you find none, respond with 'NO RELEVANT INFORMATION'.\n\nText:\n{text_snippet}"
    ),
    "classify_environment": (
        "Analyze the following text from a scientific paper. Your task is to classify the primary microbiome environment being studied into ONE of the following categories.\n\n"
        "Allowed Categories:\n"
        "- Gastrointestinal Tract (e.g., gut, intestinal, fecal, colonic)\n"
        "- Oral Cavity (e.g., saliva, dental, periodontal, buccal)\n"
        "- Respiratory System (e.g., lung, nasal, airway)\n"
        "- Urogenital System (e.g., vaginal, urinary)\n"
        "- Dermal / Cutaneous (e.g., skin)\n"
        "- Ocular (e.g., eye)\n"
        "- Blood\n"
        "- Breast Milk\n"
        "- Cardiac (e.g., heart)\n"
        "- Wound\n"
        "- Tumor\n"
        "- Aquatic (e.g., marine, freshwater)\n"
        "- Terrestrial (e.g., soil, environmental)\n"
        "- Plant-Associated\n"
        "- Invertebrate-Associated\n"
        "- Multiple (if the study clearly focuses on two or more distinct environments, e.g., both gut and skin)\n"
        "- Not Sure (if you cannot determine the environment from the text\n\n"
        "Respond with ONLY the category name from the list above.\n\n"
        "Text Snippet:\n"
        "{text_snippet}\n"
  ),
}

## Utility classes and methods

In [61]:
class ArticleProcessor:
    """
    A class to process and enrich PubMed article data using LLMs and AWS S3.
    """
    def __init__(self, gemini_model_name, s3_config=None):
        """
        Initializes the processor with the Gemini model and an S3 client.
        """
        self.model = genai.GenerativeModel(gemini_model_name)
        if s3_config is None:
            s3_config = Config(signature_version=UNSIGNED)
        self.s3_client = boto3.client('s3', config=s3_config)

        # Pre-compile regex for accession number check
        self.accession_pattern = re.compile(r'(PRJNA\d+|PRJEB\d+|SRP\d+|ERP\d+|DRP\d+|SRA\d+)', re.IGNORECASE)

        # Add a dictionary for standardizing country names.
        self.COUNTRY_SYNONYMS = {
            'USA': 'United States',
            'U.S.A': 'United States',
            'U.S.A.': 'United States',
            'U.S.': 'United States',
            'UK': 'United Kingdom',
            'U.K.': 'United Kingdom',
            'P.R.C.': 'China',
            'PRC': 'China'
        }

        # Define the keywords for major, top-level sections of a paper.
        # This list will be used to determine the true boundaries of a section.
        self.MAJOR_SECTION_KEYWORDS = [
            'introduction', 'background',
            'methods', 'materials and methods', 'experimental procedures', 'study design',
            'results', 'discussion', 'conclusion',
            'ethics', 'acknowledgements', 'abbreviations',
            'author contributions', 'competing interests',
            'funding', 'references', 'supplementary'
        ]

    def _call_gemini(self, prompt, retries=3, delay=5):
        """
        Private method for all Gemini API calls.
        Handles API calls, response parsing, and basic error handling/retries.
        """
        for attempt in range(retries):
            try:
                response = self.model.generate_content(prompt)
                return response.text.strip()
            except Exception as e:
                print(f"Error calling Gemini API (Attempt {attempt + 1}/{retries}): {e}")
                if attempt < retries - 1:
                    time.sleep(delay)
                else:
                    return "LLM Error"

    def classify_environment(self, abstract, full_text_content=None):
        """
        ## NEW: Uses the LLM to classify the microbiome environment based on the abstract and introduction.
        Returns one of the predefined categories.
        """
        # --- Step 1: Prepare the context snippet ---
        context_parts = []
        if abstract:
            context_parts.append(f"ABSTRACT:\n{abstract}")

        # Use the beginning of the full text as a proxy for the introduction
        if full_text_content:
            # Taking the first 2500 chars of the body is a good heuristic for the intro
            intro_snippet = full_text_content[:2500]
            context_parts.append(f"\n\nINTRODUCTION (FROM FULL TEXT):\n{intro_snippet}")

        if not context_parts:
            return "Not Sure"

        text_snippet = "\n".join(context_parts)

        # --- Step 2: Define allowed categories and call the LLM ---
        allowed_categories = {
            'Gastrointestinal Tract', 'Oral Cavity', 'Respiratory System', 'Urogenital System',
            'Dermal / Cutaneous', 'Ocular', 'Blood', 'Breast Milk', 'Cardiac', 'Wound', 'Tumor', 'Aquatic',
            'Terrestrial', 'Plant-Associated', 'Invertebrate-Associated', 'Multiple', 'Not Sure'
        }

        prompt = PROMPTS["classify_environment"].format(text_snippet=text_snippet)
        llm_response = self._call_gemini(prompt).strip()

        # --- Step 3: Validate the response ---
        # If the LLM returns one of the exact category names, use it.
        if llm_response in allowed_categories:
            return llm_response
        else:
            # If the LLM fails to follow instructions, default to "Not Sure".
            print(f"Warning: LLM returned a non-standard environment class: '{llm_response}'. Defaulting to 'Not Sure'.")
            return "Not Sure"

    # --- Methods for Country Extraction ---
    def get_country_from_metadata(self, pubmed_id):
        """
        Fetches article metadata, uses an LLM to determine the country,
        and standardizes the result.
        """
        efetch_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={pubmed_id}&retmode=xml"
        try:
            response = requests.get(efetch_url)
            response.raise_for_status()
            root = ET.fromstring(response.content)

            # Extract affiliations and journal info
            affiliations = " ".join([elem.text for elem in root.findall('.//AffiliationInfo/Affiliation') if elem.text])
            journal_titles = " ".join([elem.text for elem in root.findall('.//Journal/Title') if elem.text])
            metadata_text = f"{affiliations} {journal_titles}".strip()

            if not metadata_text:
                return "No Metadata"

            prompt = PROMPTS["get_country"].format(metadata_text=metadata_text)
            country_from_llm = self._call_gemini(prompt)

            # Clean up potential LLM additions like "Country: "
            cleaned_country = country_from_llm.replace("Suggested Country:", "").replace("Country:", "").strip()

            # --- Standardization Step ---
            # Use .upper() for a case-insensitive lookup in our synonym dictionary.
            # .get() returns the standardized name if found, otherwise it returns the
            # original cleaned_country name as a default.
            standardized_country = self.COUNTRY_SYNONYMS.get(cleaned_country.upper(), cleaned_country)

            return standardized_country

        except requests.exceptions.RequestException as e:
            print(f"Error fetching metadata for {pubmed_id}: {e}")
            return "Request Error"
        except ET.ParseError as e:
            print(f"Error parsing XML for {pubmed_id}: {e}")
            return "XML Error"

    # --- Methods for Article Classification ---
    def predict_article_type(self, title, abstract, publication_types):
        """
        Now takes clean data as direct arguments.
        Classifies an article and provides a confidence level.
        """
        # --- Tier 1: High-Confidence Rule-Based Check ---
        pub_types = {pt.lower() for pt in publication_types}
        REVIEW_TYPES = {'review', 'systematic review', 'meta-analysis'}
        # Simplified the original research types for broader applicability
        ORIGINAL_RESEARCH_TYPES = {'clinical trial', 'randomized controlled trial', 'comparative study'}

        if REVIEW_TYPES.intersection(pub_types):
            return 'Review', 'High'
        if ORIGINAL_RESEARCH_TYPES.intersection(pub_types):
             return 'Original Research', 'High'

        # --- Tier 2: LLM Analysis ---
        input_parts = []
        if title: input_parts.append(f"Title: {title}")
        if abstract: input_parts.append(f"Abstract: {abstract}")

        if not input_parts: return "Unknown", "Low"

        # ... (The rest of the method logic is the same) ...
        llm_input_text = "\n".join(input_parts)
        prompt = PROMPTS["predict_article_type"].format(llm_input_text=llm_input_text)
        response_text = self._call_gemini(prompt)

        llm_classification = "Unknown"
        for line in response_text.split('\n'):
            if line.lower().startswith("classification:"):
                llm_classification = line.split(":", 1)[1].strip()

        confidence = "Medium"
        text_to_search = llm_input_text.lower()

        if 'review' in llm_classification.lower():
            llm_classification = 'Review'
            if any(kw in text_to_search for kw in ['a review', 'systematic review', 'meta-analysis', 'narrative review']):
                confidence = 'High'
        elif 'original research' in llm_classification.lower():
            llm_classification = 'Original Research'
            if any(kw in text_to_search for kw in ['methods', 'results', 'we enrolled', 'we analyzed', 'our study', 'we conducted']):
                confidence = 'High'
        else:
            llm_classification = 'Unknown'
            confidence = 'Low'

        return llm_classification, confidence

    def _parse_sequencing_response(self, response_text):
        """
        # Parses the structured response for sequencing type and
        # cleans the output to handle more LLM variations (e.g., "None") and ensure data quality.
        """
        if "LLM Error" in response_text:
            return "Prediction Error", ""

        # --- Step 1: Parse the raw response from the LLM ---
        seq_type = "Unknown"
        quote = "No supporting quote found"

        for line in response_text.split('\n'):
            clean_line = line.lower().strip()
            if clean_line.startswith("primary method(s):"):
                seq_type = line.split(":", 1)[1].strip()
            elif clean_line.startswith("supporting quote:"):
                quote = line.split(":", 1)[1].strip()

        # --- Step 2: NEW Cleanup and Validation Logic ---

        # Define "failure" keywords. Added 'none' to handle this new case.
        failure_keywords = ["unknown", "error", "no abstract", "none"]
        # Check if the returned type is a failure type. Use strip() to handle whitespace.
        is_type_failure = seq_type.strip().lower() in failure_keywords

        # Define unhelpful, placeholder quotes.
        placeholder_quotes = [
            "no supporting quote found",
            "not specified",
            "not mentioned",
            "n/a",
            "none",
            "[...]",
            "[]",
            ""
        ]
        is_quote_placeholder = quote.strip().lower() in placeholder_quotes

        # --- Final Decision Logic ---
        # If the type indicates a failure, standardize the output completely.
        if is_type_failure:
            seq_type = "Unknown"  # Standardize the type to our internal value.
            quote = ""            # Always clear the quote on failure.

        # If the type is valid but the quote is just a placeholder, clear the quote.
        elif is_quote_placeholder:
            quote = ""

        return seq_type, quote

    def predict_sequencing_type_advanced(self, abstract, keywords, full_text_content=None):
        """
        Now takes clean abstract and keywords as direct arguments.
        Predicts sequencing type using a tiered strategy.
        """
        # --- Tier 1: Use Abstract and Keywords ---
        print("Tier 1: Analyzing abstract and keywords...")
        context_parts = []
        if abstract:
            context_parts.append(f"Abstract:\n{abstract}")
        if keywords:
            context_parts.append(f"Keywords:\n{', '.join(keywords)}")

        if not context_parts:
            # If no abstract/keywords, proceed directly to full text search
            print("Tier 1 skipped: No abstract or keywords available.")
        else:
            initial_context = "\n\n".join(context_parts)
            prompt = PROMPTS["predict_sequencing_type"].format(text_snippet=initial_context)
            response_text = self._call_gemini(prompt)
            seq_type, seq_quote = self._parse_sequencing_response(response_text)

            # If we get a clear answer, we're done.
            if "unknown" not in seq_type.lower():
                return seq_type, seq_quote

        # --- Tier 2: Search Full Text ---
        if not full_text_content:
            print("Tier 2 skipped: No full text available.")
            return "Unknown", "" # Return a definite unknown if no other info is available

        print("Tier 2: Abstract/keywords inconclusive. Analyzing 'Methods' section of full text...")

        methods_keywords = ['method', 'materials and methods', 'experimental procedures', 'study design']
        methods_section = self._extract_section_text(full_text_content, methods_keywords)

        if not methods_section:
            print("Tier 2 failed: Could not find 'Methods' section.")
            return "Unknown", ""

        prompt = PROMPTS["predict_sequencing_type"].format(text_snippet=methods_section)
        response_text = self._call_gemini(prompt)

        return self._parse_sequencing_response(response_text)

    # --- S3 and Accession Code Search ---
    def get_s3_article_text(self, pmc_id):
        """
        Checks for a PMC article in public S3 buckets and returns its content if found.
        """
        if not pmc_id or not str(pmc_id).strip():
            return None, "No PMC ID"

        # The bucket is always the same, but the prefix within the bucket changes.
        bucket_name = 'pmc-oa-opendata'
        key_prefixes = ['oa_comm/txt/all/', 'oa_noncomm/txt/all/']

        object_key = f"{pmc_id}.txt"

        for prefix in key_prefixes:
            full_key = prefix + object_key
            try:
                # Use head_object for a fast check
                self.s3_client.head_object(Bucket=bucket_name, Key=full_key)
                # If it exists, get the full object
                response = self.s3_client.get_object(Bucket=bucket_name, Key=full_key)
                return response['Body'].read().decode('utf-8'), "Found"
            except ClientError as e:
                if e.response['Error']['Code'] == '404':
                    continue # Not in this prefix, try the next one
                else:
                    print(f"S3 ClientError for {pmc_id}: {e}")
                    return None, "S3 Error"
            except Exception as e:
                print(f"Unexpected error for {pmc_id}: {e}")
                return None, "S3 Error"

        return None, "Not Found in S3"

    def _map_all_sections(self, full_text):
        """
        Simple and robust single-line heading finder.
        It no longer looks for numbers, only for lines that look like headings.
        """
        section_map = []
        # This regex identifies single lines that look like headings (reasonable length, starts with a capital).
        heading_pattern = re.compile(r"^\s*([A-Z][^\n]{4,100})$", re.MULTILINE)

        for match in heading_pattern.finditer(full_text):
            title = match.group(1).strip()

            # Simple filter for common non-headings
            if title.upper() in ['STRENGTHS AND LIMITATIONS OF THIS STUDY']:
                continue

            section_map.append({
                'title': title,
                'start': match.start()
            })
        return section_map

    def _extract_section_text(self, full_text, heading_keywords):
        """
        The Definitive Semantic Solution.
        Uses a list of "Major Section Keywords" to determine the true boundaries of a section,
        correctly including all of its subsections.
        """
        if not full_text:
            return None

        section_map = self._map_all_sections(full_text)
        if not section_map:
            return None

        longest_section_text = ""

        # Find all candidate sections that match our target keywords (e.g., "methods")
        for i, start_section in enumerate(section_map):
            # Check if the title of the section is one we're looking for
            if any(keyword in start_section['title'].lower() for keyword in heading_keywords):

                start_pos = start_section['start']
                end_pos = len(full_text) # Default to the end of the document

                # Look ahead to find the boundary
                for j in range(i + 1, len(section_map)):
                    next_section_title = section_map[j]['title'].lower()

                    # The boundary is the NEXT major section.
                    # We check if any of our major keywords are in the next section's title.
                    if any(major_keyword in next_section_title for major_keyword in self.MAJOR_SECTION_KEYWORDS):
                        end_pos = section_map[j]['start']
                        break # Found the boundary, stop looking ahead

                current_section_text = full_text[start_pos:end_pos]

                # We still keep the "longest section" logic to handle cases where "Methods"
                # might appear in both the abstract and the body.
                if len(current_section_text) > len(longest_section_text):
                    longest_section_text = current_section_text

        # Use a hard limit to avoid sending excessively large snippets to the LLM
        MAX_SNIPPET_LENGTH = 25000
        final_text = longest_section_text[:MAX_SNIPPET_LENGTH]

        return final_text if final_text else None

    def find_accession_codes(self, text_content):
        """
        Finds accession codes using a tiered, intelligent search strategy.
        1. Tries fast regex search on the whole text.
        2. Tries to find and use the 'Data Availability' section.
        3. Falls back to other sections like 'Methods'.
        4. As a last resort, uses a generic snippet.
        """
        if not text_content:
            return "", "No Text Provided"

        # --- Step 1: Fast Regex Search (Most Efficient) ---
        regex_matches = self.accession_pattern.findall(text_content)
        if regex_matches:
            # Return unique, sorted codes
            return ", ".join(sorted(list(set(regex_matches)))), "Found by Regex"

        # --- Step 2: Intelligent Tiered Section Search for LLM ---
        snippet = None
        find_method_note = "Not Found" # Default note

        # Define the sections to search for, in order of priority
        search_tiers = {
            "Data Availability Statement": [
                'data availability', 'availability of data', 'data and material availability',
                'data access', 'deposition of data'
            ],
            "Supplementary Material": [
                'supplementary material', 'supporting information'
            ],
            "Methods Section": [
                'method', 'materials and methods', 'experimental procedures'
            ]
        }

        for section_name, keywords in search_tiers.items():
            section_text = self._extract_section_text(text_content, keywords)
            if section_text:
                snippet = section_text
                find_method_note = f"Found by LLM in {section_name}"
                print(f"Found potential section '{section_name}'. Using it for LLM search.")
                break # Found a section, stop searching

        # --- Step 3: Fallback to Generic Snippet if no section was found ---
        if not snippet:
            print("No specific section found. Falling back to generic snippet for LLM search.")
            find_method_note = "Found by LLM in Generic Snippet"
            # Use a slightly larger snippet than before as a last resort
            if len(text_content) > 8000:
                snippet = text_content[:4000] + "\n...\n" + text_content[-4000:]
            else:
                snippet = text_content

        # --- Step 4: Call LLM with the best available snippet ---
        prompt = PROMPTS["find_accession_code"].format(text_snippet=snippet)
        llm_result = self._call_gemini(prompt)

        if "NO RELEVANT INFORMATION" in llm_result.upper() or "LLM ERROR" in llm_result:
            return "", "Not Found by LLM"
        else:
            # Return the LLM's finding and the method we used to find it
            return llm_result, find_method_note

    # --- NCBI SRA Methods ---
    def get_sra_sample_details(self, accession_code):
        """
        Queries the NCBI SRA database to validate an accession number
        and retrieve metadata about sample/run titles.

        Args:
            accession_code (str): A single SRA accession code (e.g., 'PRJEB76625').

        Returns:
            A tuple containing:
            - sra_name_example (str): An example title of a run.
            - differentiates_samples (str): 'Yes' if titles are different, 'No' if they are the same.
                                            Returns 'N/A' for errors or single-sample projects.
        """
        if not accession_code or not str(accession_code).strip():
            return "N/A", "N/A"

        # NCBI E-utilities base URL
        base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"

        # Take the first accession code if multiple are listed
        main_accession = accession_code.split(',')[0].strip()

        try:
            # Use esearch to validate and get UIDs
            esearch_url = f"{base_url}esearch.fcgi?db=sra&term={main_accession}"
            # NCBI rate limits: max 3 requests/sec without API key
            # TODO: Use NCBI API Key instead
            time.sleep(0.4)
            response = requests.get(esearch_url)
            response.raise_for_status()

            root = ET.fromstring(response.content)
            id_list = [elem.text for elem in root.findall('.//Id')]

            if not id_list:
                print(f"SRA Validation Failed: Accession {main_accession} not found.")
                return "Invalid Accession", "N/A"

            # Use efetch to get full metadata for the UIDs
            efetch_url = f"{base_url}efetch.fcgi?db=sra&id={','.join(id_list)}&retmode=xml"
            time.sleep(0.4)
            response = requests.get(efetch_url)
            response.raise_for_status()

            sra_root = ET.fromstring(response.content)

            # Parse the XML to find all run titles
            # The title for each run is usually located at EXPERIMENT_PACKAGE/RUN_SET/RUN/TITLE
            all_titles = [elem.text for elem in sra_root.findall('.//RUN/TITLE') if elem.text]

            if not all_titles:
                # Sometimes titles are at the Experiment level
                all_titles = [elem.text for elem in sra_root.findall('.//EXPERIMENT/TITLE') if elem.text]

            if not all_titles:
                print(f"SRA Metadata Warning: No run titles found for {main_accession}.")
                return "No Titles Found", "N/A"

            # Analyze titles and return results
            sra_name_example = all_titles[0]

            # Check if there are multiple runs to compare
            if len(all_titles) <= 1:
                differentiates_samples = "N/A (Single Run)"
            else:
                # Use a set to find unique titles. If len > 1, they are different.
                unique_titles = set(all_titles)
                differentiates_samples = "Yes" if len(unique_titles) > 1 else "No"

            return sra_name_example, differentiates_samples

        except requests.exceptions.RequestException as e:
            print(f"NCBI API request error for {main_accession}: {e}")
            return "API Request Error", "N/A"
        except ET.ParseError as e:
            print(f"NCBI XML parse error for {main_accession}: {e}")
            return "XML Parse Error", "N/A"
        except Exception as e:
            print(f"An unexpected error occurred for {main_accession}: {e}")
            return "Unexpected Error", "N/A"

    # --- Static Utility Methods ---
    @staticmethod
    def find_pmc_id_in_xml(article_xml_element):
        """Finds the PMC ID from the article's XML element."""
        if not isinstance(article_xml_element, ET.Element): return None
        try:
            for article_id in article_xml_element.findall('.//ArticleId'):
                if article_id.get('IdType') == 'pmc':
                    return article_id.text
            return None
        except Exception as e:
            print(f"Error searching for PMC ID: {e}")
            return None

    @staticmethod
    def get_clean_article_title_from_xml(article_xml_element):
        """Extracts and cleans the article title from the XML."""
        if not isinstance(article_xml_element, ET.Element): return "N/A"
        try:
            title_element = article_xml_element.find('.//ArticleTitle')
            if title_element is not None:
                # Reconstruct text content, handling tags like <i>, <b>
                raw_title = "".join(title_element.itertext())
                # Clean up whitespace
                clean_title = re.sub(r'\s+', ' ', raw_title).strip()
                return clean_title
            return "N/A"
        except Exception as e:
            print(f"Error extracting title from XML: {e}")
            return "Error"

    @staticmethod
    def get_publication_year(article_xml_element):
        """
        Robustly finds the publication year by checking multiple XML locations.
        Checks for 'pubmed', 'entrez', 'JournalIssue', and 'accepted' dates in order of priority.

        Args:
            article_xml_element (xml.etree.ElementTree.Element): The root XML element for the article.

        Returns:
            A string representing the 4-digit year, or "N/A" if not found.
        """
        if not isinstance(article_xml_element, ET.Element):
            return "N/A"

        # Define XPath expressions in our desired order of preference
        xpath_priority_list = [
            ".//PubMedPubDate[@PubStatus='pubmed']",
            ".//PubMedPubDate[@PubStatus='entrez']",
            ".//Journal/JournalIssue/PubDate",
            ".//PubMedPubDate[@PubStatus='accepted']",
        ]

        for xpath in xpath_priority_list:
            date_element = article_xml_element.find(xpath)
            if date_element is not None:
                # The 'Year' tag is a direct child in all these cases
                year_element = date_element.find('Year')
                if year_element is not None and year_element.text:
                    # Found a valid year, return it and stop searching
                    return year_element.text.strip()

        # If the loop completes without finding a year in any of the prioritized locations
        return "N/A"

    @staticmethod
    def get_clean_abstract_from_xml(article_xml_element):
        """
        Robustly extracts the full abstract from XML.
        Handles multiple <AbstractText> sections and inline formatting tags (e.g., <i>).
        It preserves the section labels (e.g., BACKGROUND, METHODS) for better context.

        Args:
            article_xml_element (xml.etree.ElementTree.Element): The root XML element for the article.

        Returns:
            A single string containing the full, cleaned abstract, or an empty string if not found.
        """
        if not isinstance(article_xml_element, ET.Element):
            return ""

        abstract_element = article_xml_element.find('.//Abstract')
        if abstract_element is None:
            return ""

        abstract_parts = []
        # Iterate through all <AbstractText> nodes within the <Abstract> tag
        for abstract_text_node in abstract_element.findall('AbstractText'):
            # Get the section label (e.g., "BACKGROUND") if it exists
            label = abstract_text_node.get('Label', '').strip()

            # .itertext() correctly extracts all text, including from child tags like <i>
            text_content = "".join(abstract_text_node.itertext()).strip()

            if label:
                # Format with the label for better readability and context for the LLM
                abstract_parts.append(f"{label.upper()}: {text_content}")
            else:
                # If there's no label, just append the text
                abstract_parts.append(text_content)

        # Join all parts with a double newline to separate the sections
        return "\n\n".join(abstract_parts)

In [None]:
|# pyMed provides all found Pub ID in the metadata which creates a problem when
# url is created from that information

def clean_study_link(link):
    if isinstance(link, str) and link.strip():
        # Split the string by newline characters
        parts = link.split('\n')
        # The first part should contain the base URL and the first ID
        first_part = parts[0]
        # Ensure it ends with a '/'
        if not first_part.endswith('/'):
            first_part += '/'
        return first_part
    else:
        return "" # Return empty string for non-string or empty values


## MAIN EXECUTION SCRIPT

In [62]:
# --- Initialization ---
print("Initializing PubMed Searcher and Article Processor...")
pubmed = PubMed(tool="PubMedSearcher", email="dmitry756@gmail.com")
processor = ArticleProcessor(gemini_model_name=LLM_MODEL)

Initializing PubMed Searcher and Article Processor...


In [6]:
# --- Perform Search ---
print(f"Searching PubMed for: '{SEARCH_TERM}'...")
results = pubmed.query(SEARCH_TERM, max_results=MAX_RESULTS)
articles = list(results)
print(f"Found {len(articles)} articles.")

Searching PubMed for: '(("oral"[tiab] OR "saliva"[tiab] OR "salivary"[tiab] OR "dental plaque"[tiab] OR "gingival"[tiab] OR "periodontal"[tiab] OR "subgingival"[tiab] OR "supragingival"[tiab]) AND ("microbiome"[Title/Abstract] OR "microbiota"[Title/Abstract] OR "microflora"[Title/Abstract] OR "dysbiosis"[Title/Abstract])) AND (ffrft[Filter]) NOT (Review[PT] OR Editorial[PT] OR Letter[PT] OR Comment[PT] OR Case Reports[PT])'...
Found 100 articles.


In [63]:
print("\n--- Step 1: Building initial DataFrame from PubMed metadata ---")
initial_data = []
for article in articles[0:15]:
    pmid = article.pubmed_id.split()[0]
    pub_date = processor.get_publication_year(article.xml)
    authors = [a.get('lastname', '') for a in article.authors if a.get('lastname')]
    author_string = f"{''.join(authors[:2])}{pub_date}"

    initial_data.append({
        'PMID': pmid,
        'author1author2Year': author_string,
        'StudyTitle': processor.get_clean_article_title_from_xml(article.xml),
        'Year': pub_date,
        'PMC': processor.find_pmc_id_in_xml(article.xml),
        # We will enrich the rest of the data in the next step
    })
df = pd.DataFrame(initial_data)


--- Step 1: Building initial DataFrame from PubMed metadata ---


In [64]:
# --- Step 2: Enrich DataFrame with API Calls (LLM, S3, NCBI) ---
# This  consolidated loop handles all processing that requires external calls or full text.
# This is more efficient as S3 content is fetched only ONCE per article.
print("\n--- Step 2: Enriching DataFrame with API calls ---")

enrichment_results = []
for index, row in df.iterrows():
    article = articles[index] # Get the original pymed object using the index
    pmid = row['PMID']
    pmc_id = row['PMC']
    print(f"\nProcessing article {index + 1}/{len(df)} (PMID: {pmid})...")

    # --- Get Clean Data Once ---
    clean_title = df.loc[index, 'StudyTitle'] # We already have this from Step 1
    clean_abstract = processor.get_clean_abstract_from_xml(article.xml)
    keywords = article.keywords if hasattr(article, 'keywords') else []

    # --- S3 and Full-Text Processing ---
    text_content, s3_status = processor.get_s3_article_text(pmc_id)

    # --- Run ALL content-based analyses ---
    # Classify environment based on abstract and full text
    classified_environment = processor.classify_environment(clean_abstract, text_content)

    seq_type, seq_quote = processor.predict_sequencing_type_advanced(clean_abstract, keywords, text_content)
    accession_code, find_method = processor.find_accession_codes(text_content)

    # --- Other API Calls ---
    time.sleep(1) # Rate limit
    country = processor.get_country_from_metadata(pmid)
    time.sleep(1)

    # Pass clean data to this function as well
    pub_type = article.publication_types if hasattr(article, 'publication_types') else []
    article_type, type_confidence = processor.predict_article_type(clean_title, clean_abstract, pub_type)

    # --- SRA Lookup ---
    sra_example, sra_differentiates = processor.get_sra_sample_details(accession_code)

    enrichment_results.append({
        'PMID': pmid,
        'Environment': classified_environment,
        'Country': country,
        'Type': article_type,
        'TypeConfidence': type_confidence,
        'SequencingType': seq_type,
        'SequencingQuote': seq_quote,
        'S3_Status': s3_status,
        'AccessionCode': accession_code,
        'AccessionFindMethod': find_method,
        'SRAnameExample': sra_example,
        'SRAnamePresentAndDifferentiatesSamples': sra_differentiates,
        'Abstract': clean_abstract # Save the clean abstract to the DataFrame
    })

# Merge the enrichment results back into the main DataFrame
enrichment_df = pd.DataFrame(enrichment_results)
df = pd.merge(df, enrichment_df, on='PMID')


--- Step 2: Enriching DataFrame with API calls ---

Processing article 1/15 (PMID: 40544684)...
Tier 1: Analyzing abstract and keywords...
Tier 2 skipped: No full text available.

Processing article 2/15 (PMID: 40541100)...
Tier 1: Analyzing abstract and keywords...
Tier 2 skipped: No full text available.

Processing article 3/15 (PMID: 40540389)...
Tier 1: Analyzing abstract and keywords...
Tier 2 skipped: No full text available.

Processing article 4/15 (PMID: 40538752)...
Tier 1: Analyzing abstract and keywords...
Tier 2: Abstract/keywords inconclusive. Analyzing 'Methods' section of full text...
Tier 2 failed: Could not find 'Methods' section.
Found potential section 'Data Availability Statement'. Using it for LLM search.
SRA Validation Failed: Accession https://doi.org/10.1016/j.mtbio.2025.101945 not found.

Processing article 5/15 (PMID: 40535541)...
Tier 1: Analyzing abstract and keywords...
Found potential section 'Data Availability Statement'. Using it for LLM search.

Proces

In [65]:
# --- Step 3: Final Cleanup & Display ---
print("\n--- Step 3: Finalizing DataFrame ---")
final_columns = [
    'author1author2Year',
    'YourName', 'Environment',
    'StudyLink', 'StudyTitle', 'Year', 'Country',
    'Type', 'TypeConfidence',
    'SequencingType', 'SequencingQuote', 'AccessionCode', 'AccessionFindMethod',
    'SRAnameExample', 'SRAnamePresentAndDifferentiatesSamples', 'S3_Status',
    'PMC', 'PMID'
]
# Add any missing columns and set default values
for col in final_columns:
    if col not in df.columns:
        df[col] = ''
df['StudyLink'] = df['PMID'].apply(lambda pmid: f'https://pubmed.ncbi.nlm.nih.gov/{pmid}/')
# Fill in other static columns
df['YourName'] = 'DmitryKisselev'
# df['Environment'] = ENVIRONMENT

# Reorder columns and display
df = df[final_columns]
display(df)


--- Step 3: Finalizing DataFrame ---


Unnamed: 0,author1author2Year,YourName,Environment,StudyLink,StudyTitle,Year,Country,Type,TypeConfidence,SequencingType,SequencingQuote,AccessionCode,AccessionFindMethod,SRAnameExample,SRAnamePresentAndDifferentiatesSamples,S3_Status,PMC,PMID
0,EckermannKlein2025,DmitryKisselev,Oral Cavity,https://pubmed.ncbi.nlm.nih.gov/40544684/,Probiotics-embedded polymer films for oral hea...,2025,Germany,Original Research,Medium,Unknown,,,No Text Provided,,,No PMC ID,,40544684
1,TangMao2025,DmitryKisselev,Gastrointestinal Tract,https://pubmed.ncbi.nlm.nih.gov/40541100/,Cecal microbiota transplantation enhances calc...,2025,China,Original Research,High,Unknown,,,No Text Provided,,,No PMC ID,,40541100
2,HuangLu2025,DmitryKisselev,Gastrointestinal Tract,https://pubmed.ncbi.nlm.nih.gov/40540389/,Dietary Selenium Deficiency Accelerates the On...,2025,United States,Original Research,High,Unknown,,,No Text Provided,,,No PMC ID,,40540389
3,LeiCheng2025,DmitryKisselev,Gastrointestinal Tract,https://pubmed.ncbi.nlm.nih.gov/40538752/,Microalgal-enhanced cerium oxide nanotherapeut...,2025,China,Original Research,Medium,Unknown,,https://doi.org/10.1016/j.mtbio.2025.101945,Found by LLM in Data Availability Statement,Invalid Accession,,Found,PMC12178715,40538752
4,Kato-KogoeTsuda2025,DmitryKisselev,Oral Cavity,https://pubmed.ncbi.nlm.nih.gov/40535541/,Salivary microbiota and IgA responses are diff...,2025,Japan,Original Research,High,"16S, IgA-seq","""Further, 16S rRNA metagenomic analysis was pe...",,Not Found by LLM,,,Found,PMC12174153,40535541
5,WangHu2025,DmitryKisselev,Oral Cavity,https://pubmed.ncbi.nlm.nih.gov/40535082/,Oral microbiome and risk of lung cancer: resul...,2025,China,Original Research,High,Shotgun,"""Summary statistics for the oral microbiomes w...",,Not Found by LLM,,,Found,PMC12170216,40535082
6,Stolberg-MathieuMikkelsen2025,DmitryKisselev,Gastrointestinal Tract,https://pubmed.ncbi.nlm.nih.gov/40533209/,The MOTILITY Mother-Child Cohort: a Danish pro...,2025,Denmark,Original Research,High,"[16S, Shotgun]","""The primary outcome will be a characterisatio...",,Not Found by LLM,,,Found,PMC12182182,40533209
7,NakashimaShinohara2025,DmitryKisselev,Gastrointestinal Tract,https://pubmed.ncbi.nlm.nih.gov/40533160/,Phocaeicola dorei and Phocaeicola vulgatus Pro...,2025,Japan,Original Research,High,Unknown,,,No Text Provided,,,No PMC ID,,40533160
8,RussellCain2025,DmitryKisselev,Multiple,https://pubmed.ncbi.nlm.nih.gov/40533070/,Collecting at-Home Biometric Measures for Long...,2025,United States,Original Research,High,Unknown,,,No Text Provided,,,No PMC ID,,40533070
9,MatsumotoHitaka2025,DmitryKisselev,Multiple,https://pubmed.ncbi.nlm.nih.gov/40532026/,Exploration of predictive factors based on ora...,2025,Japan,Original Research,High,16S,"""Oral saliva and stool samples were analyzed f...",GSE293354,Found by LLM in Data Availability Statement,Invalid Accession,,Found,PMC12176287,40532026


## Fix data

In [None]:
indx=5
seq = predict_sequencing_type_and_quote(df['Abstract'][indx])
df.at[indx,'SequencingType']=seq[0]
df.at[indx,'SequencingQuote']=seq[1]

In [None]:
for index, row in df.iterrows():
    title = get_clean_article_title_from_xml(articles[index].xml)
    df.at[index,'StudyTitle']=title

## Debug and Test

In [41]:
df

Unnamed: 0,PMID,author1author2Year,StudyTitle,Year,PMC,Country,Type,SequencingType,SequencingQuote,S3_Status,AccessionCode,AccessionFindMethod,SRAnameExample,SRAnamePresentAndDifferentiatesSamples
0,40544684,EckermannKlein2025,Probiotics-embedded polymer films for oral hea...,2025,,Germany,Original Research,Other/Unknown,,No PMC ID,,No Text Provided,,
1,40544638,ZhangChen2025,Computationally-guided multifunctional nanopla...,2025,,China,Original Research,,,No PMC ID,,No Text Provided,,
2,40543709,BrehmWang2025,Inflammatory markers and microbiome dysbiosis ...,2025,,United States,Original Research,16S rRNA Amplicon,"""...analyzed both the oral and gut microbiome ...",No PMC ID,,No Text Provided,,
3,40542437,Duran-PinedoSolbiati2025,Correction: Longitudinal host-microbiome dynam...,2025,PMC12180239,United States,Review,No Abstract/Keywords,,Found,,Not Found by LLM,,
4,40541836,PostolacheSha2025,Porphyromonas gingivalis capsular K1 serotype ...,2025,,United States,Original Research,,,No PMC ID,,No Text Provided,,
5,40541100,TangMao2025,Cecal microbiota transplantation enhances calc...,2025,,China,Original Research,Other/Unknown,,No PMC ID,,No Text Provided,,
6,40540389,HuangLu2025,Dietary Selenium Deficiency Accelerates the On...,2025,,United States,Original Research,,,No PMC ID,,No Text Provided,,
7,40539790,AdamsCampbell2025,Legionella pneumophila type II secretome revea...,2025,,United States,Unknown,,,No PMC ID,,No Text Provided,,
8,40539731,LeeKim2025,Nasal Microbial Community Shifts Following Tre...,2025,,South Korea,Original Research,16S rRNA Amplicon,"""16S rRNA‐based sequencing, chronic rhinitis, ...",No PMC ID,,No Text Provided,,
9,40538752,LeiCheng2025,Microalgal-enhanced cerium oxide nanotherapeut...,2025,PMC12178715,China,Original Research,,,Found,,Not Found by LLM,,


In [12]:
# List available models
for m in genai.list_models():
  print(m.name)

models/embedding-gecko-001
models/gemini-1.0-pro-vision-latest
models/gemini-pro-vision
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-002
models/gemini-1.5-pro
models/gemini-1.5-flash-latest
models/gemini-1.5-flash
models/gemini-1.5-flash-002
models/gemini-1.5-flash-8b
models/gemini-1.5-flash-8b-001
models/gemini-1.5-flash-8b-latest
models/gemini-2.5-pro-exp-03-25
models/gemini-2.5-pro-preview-03-25
models/gemini-2.5-flash-preview-04-17
models/gemini-2.5-flash-preview-05-20
models/gemini-2.5-flash
models/gemini-2.5-flash-preview-04-17-thinking
models/gemini-2.5-flash-lite-preview-06-17
models/gemini-2.5-pro-preview-05-06
models/gemini-2.5-pro-preview-06-05
models/gemini-2.5-pro
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-preview-image-generation
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-

In [51]:
print(ET.tostring(articles[1].xml, encoding='unicode'))

<PubmedArticle><MedlineCitation Status="Publisher" Owner="NLM"><PMID Version="1">40552124</PMID><DateRevised><Year>2025</Year><Month>06</Month><Day>24</Day></DateRevised><Article PubModel="Electronic-eCollection"><Journal><ISSN IssnType="Electronic">2235-2988</ISSN><JournalIssue CitedMedium="Internet"><Volume>15</Volume><PubDate><Year>2025</Year></PubDate></JournalIssue><Title>Frontiers in cellular and infection microbiology</Title><ISOAbbreviation>Front Cell Infect Microbiol</ISOAbbreviation></Journal><ArticleTitle><i>Porphyromonas gingivalis</i>-induced periodontitis promotes neuroinflammation and neuronal loss associated with dysfunction of the brain barrier.</ArticleTitle><Pagination><StartPage>1559182</StartPage><MedlinePgn>1559182</MedlinePgn></Pagination><ELocationID EIdType="doi" ValidYN="Y">10.3389/fcimb.2025.1559182</ELocationID><Abstract><AbstractText Label="BACKGROUND" NlmCategory="UNASSIGNED">In our previous study, <i>Porphyromonas gingivalis</i> (<i>P. gingivalis</i>)-ind

In [34]:
text_content, s3_status = processor.get_s3_article_text('PMC12182182')
print(s3_status)
print(text_content[0:200])

Found

==== Front
BMJ Open
BMJ Open
bmjopen
bmjopen
BMJ Open
2044-6055
BMJ Publishing Group BMA House, Tavistock Square, London, WC1H 9JR

40533209
10.1136/bmjopen-2024-094965
bmjopen-2024-094965
Protocol
G


In [44]:
article_type, type_confidence = processor.predict_article_type(articles[0])


Analyze the following article information to determine if it is a Review or Original Research. Provide a brief one-sentence rationale for your choice, followed by the classification. Use this format exactly:
Rationale: [Your one-sentence reasoning]
Classification: [Review or Original Research or Unknown]

Information:
Title: Subgingival microbiota and genetic factors (A-2570G, A896G, and C1196T TLR4 polymorphisms) as periodontal disease determinants.
Abstract: Subgingival microbiota play an important role in maintaining oral health. Subgingival dysbiosis leads to the aggregation of highly pathogenic bacteria, and the host's genetics modulates the innate immune response. The interaction between these two factors plays an important role in the aggravation of periodontitis. Therefore, evaluating the association between the TLR-4 polymorphisms and subgingival microbiota in patients with periodontitis is necessary.
We included 58 cases with periodontitis and 53 controls without periodontiti

In [45]:
print(article_type, type_confidence)

Original Research Medium


In [35]:
clean_abstract = processor.get_clean_abstract_from_xml(articles[6].xml)
print(clean_abstract)

INTRODUCTION: Concurrent with infants' progression in dietary complexity and gut microbiome diversity, infants gradually change their defecation patterns during the first year of life. However, the links between bowel habits, the gut microbiota and early life nutrition remain unclear. The primary outcome is to characterise the gut microbiome development from birth to 1 year of age. Second, to investigate how bowel habits and nutrition in early life relate to the gut microbiome and metabolome during this period of life, and to explore how the development of the gut microbiome associates with host development.

METHODS AND ANALYSIS: The MOTILITY Mother-Child Cohort (MOTILITY) is a Danish prospective longitudinal cohort study enrolling up to 125 mother-infant dyads. Assessments occur at 36 weeks gestation (visit 1), birth (screening of infant) and 3, 6, 9 and 12 months (±2 weeks) post partum (visits 2-5). At visit 1, maternal anthropometrics, self-collected faecal and urine samples, and q

In [51]:
seq_type, seq_quote = processor.predict_sequencing_type_advanced(clean_abstract, articles[6].keywords, text_content)

Tier 1: Analyzing abstract and keywords...
Tier 2: Abstract/keywords inconclusive. Analyzing 'Methods' section of full text...


In [52]:
print(seq_type, seq_quote)

[16S, Shotgun] "The primary outcome will be a characterisation of changes in faecal microbial diversity and microbial community structures from longitudinal infant faecal samples collected biweekly from birth to 12 months of age, which will be analysed by 16S rRNA gene amplicon sequencing and shotgun metagenomic sequencing."


In [23]:
print(text_content)


==== Front
BMJ Open
BMJ Open
bmjopen
bmjopen
BMJ Open
2044-6055
BMJ Publishing Group BMA House, Tavistock Square, London, WC1H 9JR

40533209
10.1136/bmjopen-2024-094965
bmjopen-2024-094965
Protocol
Gastroenterology and Hepatology
1695
1506
The MOTILITY Mother-Child Cohort: a Danish prospective longitudinal cohort study of the infant gut microbiome, nutrition and bowel habits – a study protocol
https://orcid.org/0000-0002-4345-5182
Stolberg-Mathieu Gladys 1
https://orcid.org/0000-0001-7273-5733
Mikkelsen Lasse Sommer 1
Gottlieb Adam Duun 1
Mølgaard Christian 1
https://orcid.org/0000-0002-2504-8313
Roager Henrik M. 1
1 Department of Nutrition, Exercise and Sports, University of Copenhagen, Frederiksberg, Denmark
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ d

In [69]:
SEARCH_TERM

'(("oral"[tiab] OR "saliva"[tiab] OR "salivary"[tiab] OR "dental plaque"[tiab] OR "gingival"[tiab] OR "periodontal"[tiab] OR "subgingival"[tiab] OR "supragingival"[tiab]) AND ("microbiome"[Title/Abstract] OR "microbiota"[Title/Abstract] OR "microflora"[Title/Abstract] OR "dysbiosis"[Title/Abstract])) AND (ffrft[Filter]) NOT (Review[PT] OR Editorial[PT] OR Letter[PT] OR Comment[PT] OR Case Reports[PT])'

In [None]:
class Article():
  def __init__(self, article) -> None:
      self.author=''
      self.name=''
      self.environment=''
      self.study_link=''
      self.study_title=''
      self.year=''
      self.country=''
      self.sequencing_type=''
      self.sequencing_quote=''
      self.accession_code=''
      self.metadata_present=''
      self.sample_id_matches=''
      self.sra_name_present=''
      self.sra_name_example=''
      self.pmc=''
      self.notes=''
      self.questions=''
      self.article_type=''
      self.abstract=''

  def get_country():
    pass

  def get_sequencing_type():
    pass

  def get_sequencing_quote():
    pass

  def get_article_type():
    pass



In [110]:
print(llm_search_for_accession(tt))

HTTPS://TP.AMEGROUPS.COM/ARTICLE/VIEW/10.21037/TP-2024-571/DSS


In [99]:
qq

['PRJNA1240943', 'PRJNA1229416']

In [113]:
df


Unnamed: 0,author1author2Year,YourName,Environment,StudyLink,StudyTitle,Year,Country,SequencingType,SequencingQuote,AccessionCode,...,SampleIDmatches?,SRAnamePresent,SRAnameExample,PMC,Notes,Questions,Type,Abstract,S3_Bucket,S3_Object_Key
0,ZhangZhang2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40525871/,"Will gut, oral, and vaginal microbiota influen...",2025,China,Prediction Error,"Error: ('Connection aborted.', RemoteDisconnec...",,...,,,,,,,Original Research,This study aims to examine the specific relati...,,
1,JarrettXia2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40525692/,Characterizing Microbiome Changes in Veno-Veno...,2025,USA,16S,"""Specimens underwent 16S sequencing to identif...",,...,,,,,,,Original Research,Microbiome analysis using metagenomics next-ge...,,
2,GongHuang2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40525407/,Combination of mitomycin C and low-dose metron...,2025,Taiwan,No Abstract,,,...,,,,,,,Unknown,,,
3,MavropoulosZaiton2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40524744/,Rapid Griess assay (RGA): a chairside test for,2025,Sweden,Unknown,,,...,,,,PMC12168414,,,Original Research,"Nitrite (NO\nFrom 12 healthy individuals, tong...",,
4,Saberi kakhkiHarks2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40524743/,Association of supragingival plaque management...,2025,Germany,16S,"""Subgingival microbiota were profiled via Illu...",,...,,,,PMC12168411,,,Prediction Error,This study examines the relationship between s...,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,HuangHuang2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40458824/,Therapeutic efficacy of fecal microbiota trans...,2025,China,Unknown,,,...,,,,PMC12127135,,,Original Research,This report presents the first documented appl...,pmc-oa-opendata,oa_comm/txt/all/PMC12127135.txt
96,ShanshanShu2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40458807/,Modulating the gut microbiota and inflammation...,2025,China,Unknown,,,...,,,,PMC12127415,,,Original Research,Diabetic nephropathy (DN) is a severe complica...,pmc-oa-opendata,oa_comm/txt/all/PMC12127415.txt
97,SiddiquiTsai2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40458201/,Utilizing a naturopathic mouthwash with select...,2025,United States,Unknown,,,...,,,,PMC12127372,,,Original Research,Oral rinses intended for the prevention and tr...,pmc-oa-opendata,oa_comm/txt/all/PMC12127372.txt
98,CortizoCasarin2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40457720/,Microbiome and Metaproteome of Craniofacial Im...,2025,Brazil,16S,"""Microbiome profiling was conducted through DN...",,...,,,,,,,Original Research,Craniofacial defects from cancer surgery led t...,,


In [191]:
df.to_csv('tt.csv')

In [159]:
df.at[indx,'SequencingType']

'Shotgun'

In [125]:
predict_sequencing_type_and_quote(df['Abstract'][18])

('16S',
 '"The use of antimicrobials reduced ... bacterial diversity, and genera such as Blautia and Turicibacter."')

In [133]:
predict_sequencing_type_and_quote(df['Abstract'][18])

('Shotgun', '"This modulation promoted BA production"')

In [189]:
df

Unnamed: 0,author1author2Year,YourName,Environment,StudyLink,StudyTitle,Year,Country,SequencingType,SequencingQuote,AccessionCode,...,SRAnameExample,PMC,Notes,Questions,Type,Abstract,S3_Bucket,S3_Object_Key,Regex_Accession_Codes,LLM_Accession_Search_Status
0,ZhangZhang2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40525871/,"Will gut, oral, and vaginal microbiota influen...",2025,China,Unknown,"""The underlying mechanism was furtherly confir...",,...,,,,,Original Research,This study aims to examine the specific relati...,,,,
1,JarrettXia2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40525692/,Characterizing Microbiome Changes in Veno-Veno...,2025,USA,16S,"""Specimens underwent 16S sequencing to identif...",,...,,,,,Original Research,Microbiome analysis using metagenomics next-ge...,,,,
2,GongHuang2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40525407/,Combination of mitomycin C and low-dose metron...,2025,Taiwan,No Abstract,,,...,,,,,Unknown,,,,,
3,MavropoulosZaiton2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40524744/,Rapid Griess assay (RGA): a chairside test for...,2025,Sweden,Unknown,,,...,,PMC12168414,,,Original Research,"Nitrite (NO\nFrom 12 healthy individuals, tong...",,,,
4,Saberi kakhkiHarks2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40524743/,Association of supragingival plaque management...,2025,Germany,16S,"""Subgingival microbiota were profiled via Illu...",,...,,PMC12168411,,,Prediction Error,This study examines the relationship between s...,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,HuangHuang2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40458824/,Therapeutic efficacy of fecal microbiota trans...,2025,China,Unknown,,,...,,PMC12127135,,,Original Research,This report presents the first documented appl...,pmc-oa-opendata,oa_comm/txt/all/PMC12127135.txt,,HTTPS://WWW.FRONTIERSIN.ORG/ARTICLES/10.3389/F...
96,ShanshanShu2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40458807/,Modulating the gut microbiota and inflammation...,2025,China,Unknown,,,...,,PMC12127415,,,Original Research,Diabetic nephropathy (DN) is a severe complica...,pmc-oa-opendata,oa_comm/txt/all/PMC12127415.txt,,NO RELEVANT INFORMATION.
97,SiddiquiTsai2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40458201/,Utilizing a naturopathic mouthwash with select...,2025,United States,Unknown,,,...,,PMC12127372,,,Original Research,Oral rinses intended for the prevention and tr...,pmc-oa-opendata,oa_comm/txt/all/PMC12127372.txt,,HTTPS://WWW.FRONTIERSIN.ORG/ARTICLES/10.3389/F...
98,CortizoCasarin2025,DmitryKisselev,oral,https://pubmed.ncbi.nlm.nih.gov/40457720/,Microbiome and Metaproteome of Craniofacial Im...,2025,Brazil,16S,"""Microbiome profiling was conducted through DN...",,...,,,,,Original Research,Craniofacial defects from cancer surgery led t...,,,,


In [190]:
print(df['StudyTitle'][29])

Roseburia intestinalis Modulates Immune Responses by Inducing M1 Macrophage Polarization.


In [185]:
get_clean_article_title_from_xml(articles[29].xml)

<Element 'ArticleTitle' at 0x7974178e4810>
Roseburia intestinalis Modulates Immune Responses by Inducing M1 Macrophage Polarization.


'N/A'

In [24]:
find_accession_codes_in_text(tt1)

['PRJEB76179']

In [13]:
tt1=get_s3_file_content(buck[0],buck[1])

In [9]:
buck=check_pmc_file_in_s3('PMC12168284')

Checking s3://pmc-oa-opendata/oa_noncomm/txt/all/PMC12168284.txt
Checking s3://pmc-oa-opendata/oa_comm/txt/all/PMC12168284.txt


In [19]:
tt1[0:100]

'\n==== Front\nMicrobiome\nMicrobiome\nMicrobiome\n2049-2618\nBioMed Central London\n\n2123\n10.1186/s40168-02'

In [22]:
buck[1]

'oa_comm/txt/all/PMC12168284.txt'

In [29]:
llm_search_for_accession(tt1)

'ACCESSION CODES: PRJEB76179\nURL: HTTPS://GITHUB.COM/AEMANN01/LONG_ORAL_MICROBIOME'