<a href="https://colab.research.google.com/github/aiml-hrushikesh/Research-Paper-Analysis-and-Classification-Velsera/blob/main/Disease_Specific_Identification_from_Abstracts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Disease Specific Identification from Abstracts
This notebook identifies disease-specific information from biomedical research abstracts.
It leverages NLP tools like **spaCy** and **SciSpaCy** to extract and analyze entities.

**Objective:**
- Process abstracts using NLP pipelines
- Identify disease-specific mentions
- Analyze entity frequency and distribution

In [3]:
!pip install -q spacy scispacy
!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bc5cdr_md-0.5.1.tar.gz

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/61.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.2/46.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.1/14.1 MB[0m [31m76.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.6/6.6 MB[0m [31m88.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m920.2/920.2 kB[0m [31m47.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m75.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[

In [4]:
# Importing required libraries
import pandas as pd
import spacy
import json
from collections import defaultdict

# **Model Selection**
The en_ner_bc5cdr_md model is a biomedical NER model trained on the BioCreative V CDR corpus. It is specifically designed to identify diseases in medical and scientific texts. This makes it ideal for extracting relevant entities from research abstracts, clinical notes, and biomedical literature, offering better accuracy than general-purpose models for healthcare-related tasks.

In [5]:
# Load biomedical NLP model
nlp = spacy.load("en_ner_bc5cdr_md")



In [11]:
data = [
    {
        "abstract_id": "30884810",
        "title": "PFKFB2 Promoter Hypomethylation as Recurrence Predictive Marker in Well-Differentiated Thyroid Carcinomas.",
        "abstract": "Despite the low mortality rates, well-differentiated thyroid carcinomas (WDTC) frequently relapse. BRAF and TERT mutations have been extensively related to prognosis in thyroid cancer. In this study, the methylation levels of selected CpGs (5-cytosine-phosphate-guanine-3) comprising a classifier, previously reported by our group, were assessed in combination with BRAF and TERT mutations. We evaluated 121 WDTC, three poorly-differentiated/anaplastic thyroid carcinomas (PDTC/ATC), 22 benign thyroid lesions (BTL), and 13 non-neoplastic thyroid (NT) tissues. BRAF (V600E) and TERT promoter (C228T and C250T) mutations were tested by pyrosequencing and Sanger sequencing, respectively. Three CpGs mapped in PFKFB2, ATP6V0C, and CXXC5 were evaluated by bisulfite pyrosequencing. ATP6V0C hypermethylation and PFKFB2 hypomethylation were detected in poor-prognosis (PDTC/ATC and relapsed WDTC) compared with good-prognosis (no relapsed WDTC) and non-malignant cases (NT/BTL). CXXC5 was hypomethylated in both poor and good-prognosis cases. Shorter disease-free survival was observed in WDTC patients presenting lower PFKFB2 methylation levels (p = 0.004). No association was observed on comparing BRAF (60.7%) and TERT (3.4%) mutations and prognosis. Lower PFKFB2 methylation levels was an independent factor of high relapse risk (Hazard Ratio = 3.2; CI95% = 1.1-9.5). PFKFB2 promoter methylation analysis has potential applicability to better stratify WDTC patients according to the recurrence risk, independently of BRAF and TERT mutations."
    },

    {
        "abstract_id": "30885334",
        "title": "A study of ALK-positive pulmonary squamous-cell carcinoma: From diagnostic methodologies to clinical efficacy.",
        "abstract": "BACKGROUND: High concordance has been observed between Ventana D5F3 ALK immunohistochemistry (IHC) and fluorescence in-situ hybridization (FISH) in lung adenocarcinoma (LADC). However, whether a similar conclusion can be applied to lung squamous-cell carcinoma (LSCC) has remained unclear. We therefore evaluated the ALK (anaplastic lymphoma kinase) status and the therapeutic effect of an ALK tyrosine kinase inhibitor (TKI) in IHC- or FISH-positive LSCC. MATERIALS AND METHODS: A total of 2403 LSCC patients from three institutions were screened for ALK aberration by IHC. All IHC-positive cases were subjected to FISH (with an approximately equal number of negative cases as a control group) and next-generation sequencing (NGS). Clinical efficacy was evaluated for the patients who received TKI therapy. RESULTS: In 2403 cases of LSCC, 37 cases were identified as ALK-positive by IHC. After quality control, 28 cases were succeeded by FISH (six with insufficient tissue, three with lack of signals) and 13 by NGS (24 failed due to insufficient samples or poor DNA quality); the percentage of non-diagnostic tests was 24.3% (9/37) and 64.9% (24/37), respectively. Four cases (4/2394, 0.17%) analyzed by FISH were determined as ALK-positive. For the control group (40 ALK IHC), FISH demonstrated no samples with ALK gene fusion. The concordance between ALK IHC- and ALK FISH-positive results was 14.3% (4/28). In the 13 cases studied by NGS, two cases showed ALK-EML4 fusion (consistent with two FISH-positive results), and two cases were interpreted as harboring an ALK-association gene mutation. Among four patients (two FISH-positive and two IHC-positive only cases) receiving TKI therapy, two patients had stable disease and the other two had progressive disease. CONCLUSIONS: The positive concordance rate of ALK IHC and FISH in LSCC is far less than that reported for LADC. Therefore, ALK IHC detection in LSCC cannot be used as a diagnostic method for ALK rearrangement."
    },
    {
        "abstract_id": "30886395",
        "title": "Immunotherapy in colorectal cancer: rationale, challenges and potential.",
        "abstract": "Following initial successes in melanoma treatment, immunotherapy has rapidly become established as a major treatment modality for multiple types of solid cancers, including a subset of colorectal cancers (CRCs). Two programmed cell death 1 (PD1)-blocking antibodies, pembrolizumab and nivolumab, have shown efficacy in patients with metastatic CRC that is mismatch-repair-deficient and microsatellite instability-high (dMMR-MSI-H), and have been granted accelerated FDA approval. In contrast to most other treatments for metastatic cancer, immunotherapy achieves long-term durable remission in a subset of patients, highlighting the tremendous promise of immunotherapy in treating dMMR-MSI-H metastatic CRC. Here, we review the clinical development of immune checkpoint inhibition in CRC leading to regulatory approvals for the treatment of dMMR-MSI-H CRC. We focus on new advances in expanding the efficacy of immunotherapy to early-stage CRC and CRC that is mismatch-repair-proficient and has low microsatellite instability (pMMR-MSI-L) and discuss emerging approaches for targeting the immune microenvironment, which might complement immune checkpoint inhibition."
    },
    {
        "abstract_id": "30887763",
        "title": "Immunotherapy in endometrial cancer: new scenarios on the horizon.",
        "abstract": "This extensive review summarizes clinical evidence on immunotherapy and targeted therapy currently available for endometrial cancer (EC) and reports the results of the clinical trials and ongoing studies. The research was carried out collecting preclinical and clinical findings using keywords such as immune environment, tumor infiltrating lymphocytes, programmed death-1 (PD-1)/programmed death-ligand 1 (PD-L1) expression, immune checkpoint inhibitors, anti-PD-1/PD-L1 antibodies and others' on PubMed. Finally, we looked for the ongoing immunotherapy trials on ClinicalTrials.gov. EC is the fourth most common malignancy in women in developed countries. Despite medical and surgical treatments, survival has not improved in the last decade and death rates have increased for uterine cancer in women. Therefore, identification of clinically significant prognostic risk factors and formulation of new rational therapeutic regimens have great significance for enhancing the survival rate and improving the outcome in patients with advanced or metastatic disease. The identification of genetic alterations, including somatic mutations and microsatellite instability, and the definition of intracellular signaling pathways alterations that have a major role in in tumorigenesis is leading to the development of new therapeutic options for immunotherapy and targeted therapy."
    }
]

In [12]:
df = pd.DataFrame([{
    "ID": item["abstract_id"],
    "text": item["title"] + " " + item["abstract"]
} for item in data])

# Define extraction function
def extract_diseases(text, nlp):
# Apply NLP model to extract entities from abstract text
    doc = nlp(text)
    diseases = defaultdict(list)

    for ent in doc.ents:
        if ent.label_ in ['DISEASE']:
            normalized = ent.text.lower().strip()
            diseases[normalized].append(ent.text)

    # Choose longest variant
    results = []
    for variants in diseases.values():
        best = max(variants, key=len)
        results.append(best)

    return results

In [13]:

# Process DataFrame
def process_dataframe(df):
    results = []

    for _, row in df.iterrows():
        diseases = extract_diseases(row['text'], nlp)

        # Filter and deduplicate
        final_diseases = []
        seen = set()
        for disease in diseases:
            key = disease.lower()
            if (key not in seen and len(disease) > 3
                and not disease.isdigit() and ' ' in disease):
                seen.add(key)
                final_diseases.append(disease)

        results.append({
            "abstract_id": str(row['ID']),
            "extracted_diseases": sorted(final_diseases)
        })

# Display the output
    print(json.dumps(results, indent=2))
    return results

# **Output**

In [14]:
# Run the processing
processed_data = process_dataframe(df)

[
  {
    "abstract_id": "30884810",
    "extracted_diseases": [
      "Thyroid Carcinomas",
      "thyroid cancer"
    ]
  },
  {
    "abstract_id": "30885334",
    "extracted_diseases": [
      "ALK-positive pulmonary squamous-cell carcinoma",
      "anaplastic lymphoma",
      "lung adenocarcinoma",
      "lung squamous-cell carcinoma"
    ]
  },
  {
    "abstract_id": "30886395",
    "extracted_diseases": [
      "colorectal cancer",
      "colorectal cancers"
    ]
  },
  {
    "abstract_id": "30887763",
    "extracted_diseases": [
      "endometrial cancer",
      "uterine cancer"
    ]
  }
]
