<a href="https://colab.research.google.com/github/dkisselev-zz/mmc-pipeline/blob/main/Get_ICD_11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install pymed google-generativeai boto3

Collecting pymed
  Downloading pymed-0.8.9-py3-none-any.whl.metadata (2.4 kB)
Collecting boto3
  Downloading boto3-1.38.45-py3-none-any.whl.metadata (6.6 kB)
Collecting botocore<1.39.0,>=1.38.45 (from boto3)
  Downloading botocore-1.38.45-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3)
  Downloading jmespath-1.0.1-py3-none-any.whl.metadata (7.6 kB)
Collecting s3transfer<0.14.0,>=0.13.0 (from boto3)
  Downloading s3transfer-0.13.0-py3-none-any.whl.metadata (1.7 kB)
Downloading pymed-0.8.9-py3-none-any.whl (9.6 kB)
Downloading boto3-1.38.45-py3-none-any.whl (139 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.9/139.9 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading botocore-1.38.45-py3-none-any.whl (13.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.7/13.7 MB[0m [31m66.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jmespath-1.0.1-py3-none-any.whl (20 kB)
Downloading s3transfer-0.13.0-py3-n

In [2]:
from pymed import PubMed
import pandas as pd
import google.generativeai as genai
from google.colab import userdata
import requests
import time
import json
import time
from typing import List, Dict, Optional, Tuple
import re
import xml.etree.ElementTree as ET
import boto3
from botocore.client import Config
from botocore import UNSIGNED
from botocore.exceptions import NoCredentialsError, ClientError

In [17]:
try:
    GEMINI_API_KEY = userdata.get('GOOGLE_API_KEY')
    genai.configure(api_key=GEMINI_API_KEY)
except (ValueError, FileNotFoundError):
    raise ValueError("GOOGLE_API_KEY not found in Colab secrets. Please add it.")

try:
    ICD11_CLIENT_ID = userdata.get('ICD11_CLIENT_ID')
    ICD11_CLIENT_SECRET = userdata.get('ICD11_CLIENT_SECRET')
except (ValueError, FileNotFoundError):
    raise ValueError("ICD11 not found in Colab secrets. Please add it.")

try:
    EMAIL = userdata.get('EMAIL')
except (ValueError, FileNotFoundError):
    raise ValueError("EMAIL not found in Colab secrets. Please add it.")

In [71]:
class DOIClassifier:
    def __init__(self, gemini_api_key: str, icd11_client_id: Optional[str] = None, icd11_client_secret: Optional[str] = None):
        """
        Initialize the DOI classifier with Gemini API key and optional ICD-11 OAuth credentials.

        Args:
            gemini_api_key (str): Google Gemini API key
            icd11_client_id (Optional[str]): WHO ICD-11 API client ID
            icd11_client_secret (Optional[str]): WHO ICD-11 API client secret
        """
        self.pubmed = PubMed(tool="DOIClassifier", email=EMAIL)
        genai.configure(api_key=gemini_api_key)
        self.model = genai.GenerativeModel('gemini-1.5-flash')

        # ICD-11 API endpoint and OAuth credentials
        self.icd11_api_base = "https://id.who.int/icd/entity"
        self.icd11_token_url = "https://icdaccessmanagement.who.int/connect/token"
        self.icd11_client_id = icd11_client_id
        self.icd11_client_secret = icd11_client_secret
        self.icd11_access_token = None
        self.icd11_token_expires_at = None

        # Set up headers for ICD-11 API calls
        self.icd11_headers = {
            'Accept': 'application/json',
            'API-Version': 'v2',
            'Accept-Language': 'en'
        }

        # Get initial access token if credentials are provided
        if self.icd11_client_id and self.icd11_client_secret:
            self._get_icd11_access_token()

    def _get_icd11_access_token(self):
        """Request an access token from WHO ICD-11 API using OAuth 2.0."""
        try:
            token_data = {
                'grant_type': 'client_credentials',
                'client_id': self.icd11_client_id,
                'client_secret': self.icd11_client_secret,
                'scope': 'icdapi_access'
            }

            response = requests.post(self.icd11_token_url, data=token_data)
            response.raise_for_status()

            token_info = response.json()
            self.icd11_access_token = token_info.get('access_token')
            expires_in = token_info.get('expires_in', 3600)  # Default to 1 hour

            # Calculate expiration time (subtract 5 minutes for safety)
            self.icd11_token_expires_at = time.time() + expires_in - 300

            if self.icd11_access_token:
                self.icd11_headers['Authorization'] = f'Bearer {self.icd11_access_token}'
                print("Successfully obtained ICD-11 API access token")
            else:
                print("Warning: Could not obtain ICD-11 access token")

        except Exception as e:
            print(f"Error obtaining ICD-11 access token: {str(e)}")
            self.icd11_access_token = None

    def _ensure_valid_icd11_token(self):
        """Ensure we have a valid access token, refresh if needed."""
        if not self.icd11_client_id or not self.icd11_client_secret:
            return

        current_time = time.time()

        # Check if token is expired or will expire soon
        if (not self.icd11_access_token or
            not self.icd11_token_expires_at or
            current_time >= self.icd11_token_expires_at):
            self._get_icd11_access_token()

    def read_dois_from_file(self, file_path: str) -> List[str]:
        """
        Read DOIs from a newline-separated file.

        Args:
            file_path (str): Path to the file containing DOIs

        Returns:
            List[str]: List of DOIs
        """
        with open(file_path, 'r') as file:
            dois = [line.strip() for line in file if line.strip()]
        return dois

    def get_article_info(self, doi: str) -> Optional[Dict]:
        """
        Retrieve article title and abstract using pymed.

        Args:
            doi (str): DOI of the article

        Returns:
            Optional[Dict]: Dictionary containing title and abstract, or None if not found
        """
        try:
            # Search for the article using DOI
            query = f"{doi}[doi]"
            results = self.pubmed.query(query, max_results=1)

            for article in results:
                meta_xml = getattr(article, 'xml', '')
                title = self.get_clean_article_title_from_xml(meta_xml)
                abstract = self.get_clean_abstract_from_xml(meta_xml)

                if title or abstract:
                    return {
                        'doi': doi,
                        'title': title,
                        'abstract': abstract
                    }

            return None

        except Exception as e:
            print(f"Error retrieving article for DOI {doi}: {str(e)}")
            return None

    def classify_disease_with_gemini(self, title: str, abstract: str) -> Dict:
        """
        Use Gemini to classify the disease mentioned in the article.

        Args:
            title (str): Article title
            abstract (str): Article abstract

        Returns:
            Dict: Dictionary containing disease name and confidence
        """
        prompt = f"""
        Analyze the following research article and identify the primary disease or medical condition being investigated.

        Title: {title}
        Abstract: {abstract}

        Please provide your response in the following JSON format. The "disease_name" should be the full, unabbreviated name and must not contain any acronyms or parentheses.
        {{
            "disease_name": "The specific disease or condition name",
            "confidence": "high/medium/low",
            "reasoning": "Brief explanation of why this disease was identified"
        }}

        Focus on the most specific disease name mentioned. If multiple diseases are mentioned, choose the primary one being investigated.
        """

        #  For the 'disease_name', provide only the full name, completely spelled out. For example, if the text mentions "Myocardial Infarction (MI)", the value should be "Myocardial Infarction", not "MI" or "Myocardial Infarction (MI)".
        try:
            response = self.model.generate_content(prompt)
            # Extract JSON from response
            response_text = response.text
            json_match = re.search(r'\{.*\}', response_text, re.DOTALL)

            if json_match:
                result = json.loads(json_match.group())
                return result
            else:
                return {
                    "disease_name": "Unknown",
                    "confidence": "low",
                    "reasoning": "Could not parse Gemini response"
                }

        except Exception as e:
            print(f"Error classifying disease: {str(e)}")
            return {
                "disease_name": "Unknown",
                "confidence": "low",
                "reasoning": f"Error: {str(e)}"
            }

    def search_icd11_code(self, disease_name: str) -> Dict:
        """
        Search for ICD-11 code and description for a given disease.

        Args:
            disease_name (str): Name of the disease

        Returns:
            Dict: Dictionary containing ICD-11 code and description
        """
        try:
            # Ensure we have a valid access token before making API calls
            self._ensure_valid_icd11_token()

            # Step 1: Search for foundation entities
            search_url = f"https://id.who.int/icd/entity/search"
            params = {
                'q': disease_name,
                # 'propertiesToBeSearched': 'Title,Definition,Exclusion,FullySpecifiedName',
                # 'useFlexisearch': 'true',
                'flatResults': 'true'
            }

            response = requests.get(search_url, params=params, headers=self.icd11_headers)
            response.raise_for_status()

            data = response.json()

            if data.get('destinationEntities'):
                entity = data['destinationEntities'][0]
                entity_url = entity.get('id')  # This is the full URL like "http://id.who.int/icd/entity/359051131"

                # Extract numeric ID from the entity URL
                entity_id_match = re.search(r'/entity/(\d+)$', entity_url)
                if not entity_id_match:
                    return {
                        'icd11_code': 'Error',
                        'icd11_description': 'Could not extract entity ID from URL',
                        'search_confidence': 'error'
                    }

                entity_id = entity_id_match.group(1)

                # Step 2: Get linearization data to get the actual ICD-11 code
                # Use latest MMS release (2025-01) and extract numeric ID from entity URI
                linearization_url = f"https://id.who.int/icd/release/11/2025-01/mms/{entity_id}"
                linearization_response = requests.get(linearization_url, headers=self.icd11_headers)

                if linearization_response.status_code == 200:
                    linearization_data = linearization_response.json()

                    icd11_code = linearization_data.get('code', '')
                    title_info = linearization_data.get('title', {})
                    icd11_description = title_info.get('@value', '') if isinstance(title_info, dict) else str(title_info)

                    return {
                        'icd11_code': icd11_code,
                        'icd11_description': icd11_description,
                        'search_confidence': 'high'
                    }

            return {
                'icd11_code': 'Not found',
                'icd11_description': 'No ICD-11 code found for this disease',
                'search_confidence': 'low'
            }

        except Exception as e:
            print(f"Error searching ICD-11 for {disease_name}: {str(e)}")
            return {
                'icd11_code': 'Error',
                'icd11_description': f'Error searching ICD-11: {str(e)}',
                'search_confidence': 'error'
            }

    def process_dois(self, doi_file_path: str, output_file: str = 'article_classification.csv'):
        """
        Process all DOIs in the file and save results to CSV.

        Args:
            doi_file_path (str): Path to file containing DOIs
            output_file (str): Output CSV file path
        """
        dois = self.read_dois_from_file(doi_file_path)
        results = []

        print(f"Processing {len(dois)} DOIs...")

        for i, doi in enumerate(dois, 1):
            print(f"Processing DOI {i}/{len(dois)}: {doi}")

            # Get article information
            article_info = self.get_article_info(doi)

            if article_info:
                # Classify disease
                disease_info = self.classify_disease_with_gemini(
                    article_info['title'],
                    article_info['abstract']
                )

                # Search ICD-11 code
                icd11_info = self.search_icd11_code(disease_info['disease_name'])

                # Combine results
                result = {
                    'doi': doi,
                    'title': article_info['title'],
                    'abstract': article_info['abstract'],
                    'disease_name': disease_info['disease_name'],
                    'classification_confidence': disease_info['confidence'],
                    'classification_reasoning': disease_info['reasoning'],
                    'icd11_code': icd11_info['icd11_code'],
                    'icd11_description': icd11_info['icd11_description'],
                    'icd11_search_confidence': icd11_info['search_confidence']
                }

                results.append(result)

                # Add delay to avoid rate limiting
                time.sleep(1)
            else:
                print(f"Could not retrieve information for DOI: {doi}")

        # Save to CSV
        df = pd.DataFrame(results)
        df.to_csv(output_file, index=False)
        print(f"Results saved to {output_file}")

        return df


    @staticmethod
    def get_clean_article_title_from_xml(article_xml_element):
        """Extracts and cleans the article title from the XML."""
        if not isinstance(article_xml_element, ET.Element): return "N/A"
        try:
            title_element = article_xml_element.find('.//ArticleTitle')
            if title_element is not None:
                # Reconstruct text content, handling tags like <i>, <b>
                raw_title = "".join(title_element.itertext())
                # Clean up whitespace
                clean_title = re.sub(r'\s+', ' ', raw_title).strip()
                return clean_title
            return "N/A"
        except Exception as e:
            print(f"Error extracting title from XML: {e}")
            return "Error"

    @staticmethod
    def get_clean_abstract_from_xml(article_xml_element):
        """
        Robustly extracts the full abstract from XML.
        Handles multiple <AbstractText> sections and inline formatting tags (e.g., <i>).
        It preserves the section labels (e.g., BACKGROUND, METHODS) for better context.

        Args:
            article_xml_element (xml.etree.ElementTree.Element): The root XML element for the article.

        Returns:
            A single string containing the full, cleaned abstract, or an empty string if not found.
        """
        if not isinstance(article_xml_element, ET.Element):
            return ""

        abstract_element = article_xml_element.find('.//Abstract')
        if abstract_element is None:
            return ""

        abstract_parts = []
        # Iterate through all <AbstractText> nodes within the <Abstract> tag
        for abstract_text_node in abstract_element.findall('AbstractText'):
            # Get the section label (e.g., "BACKGROUND") if it exists
            label = abstract_text_node.get('Label', '').strip()

            # .itertext() correctly extracts all text, including from child tags like <i>
            text_content = "".join(abstract_text_node.itertext()).strip()

            if label:
                # Format with the label for better readability and context for the LLM
                abstract_parts.append(f"{label.upper()}: {text_content}")
            else:
                # If there's no label, just append the text
                abstract_parts.append(text_content)

        # Join all parts with a double newline to separate the sections
        return "\n\n".join(abstract_parts)

In [72]:
# Initialize classifier
classifier = DOIClassifier(GEMINI_API_KEY, ICD11_CLIENT_ID, ICD11_CLIENT_SECRET)

# Process DOIs
doi_file = "dois.txt"  # Change this to your DOI file path
output_file = "article_classification.csv"

results_df = classifier.process_dois(doi_file, output_file)
print(f"Processing complete! Found information for {len(results_df)} articles.")

Successfully obtained ICD-11 API access token
Processing 423 DOIs...
Processing DOI 1/423: 10.1007/s10620-021-06857-y
Processing DOI 2/423: 10.1016/j.micres.2022.127010
Processing DOI 3/423: 10.1186/s12866-025-03849-0
Processing DOI 4/423: 10.1186/s40168-022-01400-1
Processing DOI 5/423: 10.1001/jamapsychiatry.2023.0685
Processing DOI 6/423: 10.1111/1462-2920.15441
Processing DOI 7/423: 10.1038/s41398-023-02325-5
Processing DOI 8/423: 10.21037/tp-2024-571
Processing DOI 9/423: 10.1126/scitranslmed.abk0855
Processing DOI 10/423: 10.1097/JU.0000000000002274
Processing DOI 11/423: 10.1038/s42003-023-05714-0
Processing DOI 12/423: 10.1016/j.ijantimicag.2019.09.009
Processing DOI 13/423: 10.1007/s10815-022-02688-6
Processing DOI 14/423: 10.1002/ohn.1014
Processing DOI 15/423: 10.1186/s12866-024-03709-3
Processing DOI 16/423: 10.3389/fcimb.2022.914749
Processing DOI 17/423: 10.1097/ICO.0000000000002940
Processing DOI 18/423: 10.1136/gutjnl-2020-322771
Processing DOI 19/423: 10.3389/fimmu.202

In [73]:
results_df

Unnamed: 0,doi,title,abstract,disease_name,classification_confidence,classification_reasoning,icd11_code,icd11_description,icd11_search_confidence
0,10.1007/s10620-021-06857-y,Effects of Proton Pump Inhibitors on the Small...,BACKGROUND: Proton pump inhibitor (PPI) use is...,Small intestinal bacterial overgrowth,high,The abstract explicitly states that the study ...,Not found,No ICD-11 code found for this disease,low
1,10.1016/j.micres.2022.127010,The urinary microbiome and biological therapeu...,The discovery of microbial communities in the ...,Urinary tract infections,high,The title and abstract explicitly state that t...,GC08,"Urinary tract infection, site not specified",high
2,10.1186/s12866-025-03849-0,Full-length 16S rRNA Sequencing Reveals Gut Mi...,BACKGROUND: The gut microbiota plays a crucial...,Metabolic dysfunction-associated steatotic liv...,high,The abstract explicitly states that the study ...,Not found,No ICD-11 code found for this disease,low
3,10.1186/s40168-022-01400-1,Vaginal microbiome-host interactions modeled i...,BACKGROUND: A dominance of non-iners Lactobaci...,Bacterial vaginosis,high,The abstract explicitly mentions bacterial vag...,Not found,No ICD-11 code found for this disease,low
4,10.1001/jamapsychiatry.2023.0685,Interplay of Metabolome and Gut Microbiome in ...,IMPORTANCE: Metabolomics reflect the net effec...,Major Depressive Disorder,high,"The title, abstract, and main body of the rese...",6A70.3,"Single episode depressive disorder, severe, wi...",high
...,...,...,...,...,...,...,...,...,...
418,10.14740/wjon1587,Fluctuations in Gut Microbiome Composition Dur...,BACKGROUND: Immune checkpoint inhibitors (ICIs...,Non-small cell lung cancer,high,The abstract explicitly states that the study ...,Not found,No ICD-11 code found for this disease,low
419,10.3389/fonc.2022.837525,Distinct Functional Metagenomic Markers Predic...,BACKGROUND: Programmed death 1 (PD-1) and the ...,Non-small cell lung cancer,high,The title and abstract explicitly state that t...,Not found,No ICD-11 code found for this disease,low
420,10.1097/QAI.0b013e31824e4bdb,Altered vaginal microbiota are associated with...,BACKGROUND: Mother-to-child transmission (MTCT...,Human immunodeficiency virus infection,high,The entire study focuses on mother-to-child tr...,1C62,Human immunodeficiency virus disease without m...,high
421,10.3390/ijms242316626,Skin-Microbiome Assembly in Preterm Infants du...,The structure and function of infant skin is n...,Late-onset sepsis,high,The abstract explicitly states that the immatu...,Not found,No ICD-11 code found for this disease,low


In [48]:
results_df.iloc[6]

Unnamed: 0,6
doi,10.1164/rccm.201911-2202OC
title,Metagenomics Reveals a Core Macrolide Resistom...
abstract,Rationale: Long-term antibiotic use for managi...
disease_name,Chronic respiratory disease
classification_confidence,high
classification_reasoning,The title and abstract explicitly state that t...
icd11_code,12
icd11_description,Diseases of the respiratory system
icd11_search_confidence,high


In [70]:
cl2.search_icd11_code("Small intestinal bacterial overgrowth")

{'icd11_code': 'Not found',
 'icd11_description': 'No ICD-11 code found for this disease',
 'search_confidence': 'low'}

In [68]:
cl2 = DOIClassifier(GEMINI_API_KEY, ICD11_CLIENT_ID, ICD11_CLIENT_SECRET)
cl2.classify_disease_with_gemini("Effects of Proton Pump Inhibitors on the Small Bowel and Stool Microbiomes.","""BACKGROUND: Proton pump inhibitor (PPI) use is extremely common. PPIs have been suggested to affect the gut microbiome, and increase risks of Clostridium difficile infection and small intestinal bacterial overgrowth (SIBO). However, existing data are based on stool analyses and PPIs act on the foregut.

AIMS: To compare the duodenal and stool microbiomes in PPI and non-PPI users.

METHODS: Consecutive subjects presenting for upper endoscopy without colonoscopy were recruited. Current antibiotic users were excluded. Subjects taking PPI were age- and gender-matched 1:2 to non-PPI controls. Subjects completed medical history questionnaires, and duodenal aspirates were collected using a validated protected catheter. A subset also provided stool samples. Duodenal and stool microbiomes were analyzed by 16S rRNA sequencing.

RESULTS: The duodenal microbiome exhibited no phylum-level differences between PPI (N = 59) and non-PPI subjects (N = 118), but demonstrated significantly higher relative abundances of families Campylobacteraceae (3.13-fold, FDR P value < 0.01) and Bifidobacteriaceae (2.9-fold, FDR P value < 0.01), and lower relative abundance of Clostridiaceae (88.24-fold, FDR P value < 0.0001), in PPI subjects. SIBO rates were not significantly different between groups, whether defined by culture (> 103 CFU/ml) or 16S sequencing, nor between subjects taking different PPIs. The stool microbiome exhibited significantly higher abundance of family Streptococcaceae (2.14-fold, P = 0.003), and lower Clostridiaceae (2.60-fold, FDR P value = 8.61E-13), in PPI (N = 22) versus non-PPI (N = 47) subjects.

CONCLUSIONS: These findings suggest that PPI use is not associated with higher rates of SIBO. Relative abundance of Clostridiaceae was reduced in both the duodenal and stool microbiomes, and Streptococcaceae was increased in stool. The clinical implications of these findings are unknown.""")

Successfully obtained ICD-11 API access token


{'disease_name': 'Small intestinal bacterial overgrowth',
 'confidence': 'high',
 'reasoning': 'The abstract explicitly states that the study aims to investigate the relationship between proton pump inhibitor use and small intestinal bacterial overgrowth (SIBO).  While the study also looks at the broader impact on the gut microbiome, SIBO is identified as a specific concern and a key outcome measure in the research.'}