# Clinical Notes NLP: Disease & Symptom Extraction and Classification

Clinical notes are essential for diagnosis and care but are time-consuming to process. This project aims to automate part of that process by:

- Extracting diseases and symptoms from the Subjective section

- Classifying each condition as present or absent

- Evaluating results against human annotations



## Tools and Technologies:

- INCEpTION
- SecTag
- spacy & scispacy
- Ollama Langchain LLM

## Load the Clinical Notes

In [None]:
import os
import pandas as pd

# Path to clinical notes
notes_dir = "sample_clinical_notes/"

# Load all notes into a dictionary
clinical_notes = {}
for file in os.listdir(notes_dir):
    if file.endswith(".txt"):
        with open(os.path.join(notes_dir, file), 'r', encoding='utf-8') as f:
            clinical_notes[file] = f.read()

## Load SecTag Header

In [None]:
import re

def sectag_to_regex(header_file_path, seg_col='kmname', header_col='str'):
    header_df = pd.read_csv(header_file_path)
    header_df = header_df.drop_duplicates()
    headers = header_df[header_col].tolist()
    header_patterns = [f'^{re.escape(header)}[\\n:]' for header in headers]
    return header_patterns, header_df[seg_col].tolist()

header_patterns, seg_names = sectag_to_regex("SecTag/SecTag.csv")

##  Extract Subjective Section Using SecTag

In [None]:
def find_segs(note, header_patterns, seg_names):
    segs = {}
    for i, pattern in enumerate(header_patterns):
        for m in re.finditer(pattern, note.lower(), re.MULTILINE):
            seg_head = (note[m.span()[0]:m.span()[1]], m.span()[0])
            if seg_head not in segs:
                segs[seg_head] = []
            segs[seg_head].append(seg_names[i])

    segs = [[k[0], segs[k], k[1]] for k in segs.keys()]
    segs = sorted(segs, key=lambda x: x[2])

    for i in range(len(segs)):
        if i == len(segs) - 1:
            segs[i].append(len(note))
        else:
            segs[i].append(segs[i + 1][2])

    return segs

def extract_subjective(note, header_patterns, seg_names):
    segments = find_segs(note, header_patterns, seg_names)
    for header_text, section_labels, start, end in segments:
        if 'subjective' in [s.lower() for s in section_labels]:
            return note[start:end].strip()
    return ""

In [None]:
for filename, note in clinical_notes.items():
    subjective_section = extract_subjective(note, header_patterns, seg_names)
    print(f"--- {filename} ---")
    print(subjective_section)
    print("\n\n")

--- sample_365.txt ---
SUBJECTIVE:  The patient is in with several medical problems.  She complains of numbness, tingling, and a pain in the toes primarily of her right foot described as a moderate pain.  She initially describes it as a sharp quality pain, but is unable to characterize it more fully.  She has had it for about a year, but seems to be worsening.  She has little bit of paraesthesias in the left toe as well and seem to involve all the toes of the right foot.  They are not worse with walking.  It seems to be worse when she is in bed.  There is some radiation of the pain up her leg.  She also continues to have bilateral shoulder pains without sinus allergies.  She has hypothyroidism.  She has thrombocythemia, insomnia, and hypertension.



--- sample_214.txt ---
S:  The patient is here today with his mom for several complaints.  Number one, he has been having issues with his right shoulder.  Approximately 10 days ago he fell, slipping on ice, did not hit his head but fell st

SecTag is useful for identifying and extracting specific sections of clinical notes, like the "Subjective" section, which helps focus analysis on relevant content. This improves the accuracy of downstream tasks such as symptom or disease extraction. However, its utility depends on how consistently section headers are written. If headers are unclear or non-standard, SecTag may miss them. Also, it only segments text and doesn’t analyze meaning, so it needs to be used with other tools for full clinical understanding.

## Extracting Conditions from the Subjective Section: Run spaCy NER on Subjective Text

In [None]:
import scispacy
import spacy

# Load the clinical/biomedical model
nlp = spacy.load("en_core_sci_sm")

def extract_conditions(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents if len(ent.text.split()) <= 5]  # Filter short medical terms

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


In [None]:
# Loop through subjective sections and extract conditions
for filename, note in clinical_notes.items():
    subjective = extract_subjective(note, header_patterns, seg_names)
    if not subjective:
        continue

    conditions = extract_conditions(subjective)

    print(f"--- {filename} ---")
    print(" Extracted Conditions:")
    if conditions:
        for cond in conditions:
            print(f" - {cond}")
    else:
        print(" - None found")
    print("\n\n")


--- sample_365.txt ---
 Extracted Conditions:
 - SUBJECTIVE
 - patient
 - medical problems
 - complains
 - numbness
 - tingling
 - pain
 - right foot
 - moderate pain
 - sharp quality
 - pain
 - year
 - worsening
 - paraesthesias
 - left toe
 - toes
 - right foot
 - walking
 - bed
 - radiation
 - pain
 - leg
 - bilateral shoulder pains
 - sinus
 - allergies
 - hypothyroidism
 - thrombocythemia
 - insomnia
 - hypertension



--- sample_214.txt ---
 Extracted Conditions:
 - S
 - patient
 - mom
 - complaints
 - issues
 - right shoulder
 - days
 - ice
 - head
 - issues
 - difficulties
 - head
 - intermittent numbness
 - fingers
 - night
 - anti-inflammatories
 - pain relievers
 - sore throat
 - exposure to
 - Strep
 - long history
 - strep throat
 - Denies
 - fevers
 - rashes
 - nausea
 - vomiting
 - diarrhea
 - constipation
 - ADHD
 - Dr.
 - Adderall
 - Zoloft
 - day
 - notice
 - medication
 - school
 - weight
 - medications
 - issues
 - anger
 - anger
 - outbursts
 - problem
 - mom
 - wi

spaCy’s named entity recognition (NER), especially with SciSpaCy models, is valuable for identifying medical terms like diseases and symptoms from clinical notes. It works quickly and integrates well with other Python tools, making it efficient for large datasets. However, its accuracy is limited by the model used—smaller models like en_core_sci_sm may miss complex or rare terms, while larger models require more resources. Also, spaCy NER does not classify entities as present or absent, so additional logic or models are needed for contextual interpretation.

## Use Ollama LLM LangChain to classify symptoms

In [None]:
from langchain_ollama import ChatOllama
from langchain_core.prompts import ChatPromptTemplate

# Initialize the Ollama LLM
llm = ChatOllama(model="llama3.2")
# Define prompt
prompt = ChatPromptTemplate.from_template("""
You are a clinical AI assistant. Given the following subjective section:
{text}

Extract all diseases and symptoms mentioned. For each one, say whether it's PRESENT or ABSENT based on context.
Format: <condition>: PRESENT/ABSENT
""")

# Loop through notes
for filename, note in clinical_notes.items():
    subjective = extract_subjective(note, header_patterns, seg_names)
    if not subjective:
        continue

    conditions = extract_conditions(subjective)

    print(f"--- {filename} ---")
    print("Extracted Conditions and Classification:")

    if conditions:
        try:
            llm_input = prompt.invoke({"text": subjective})  # Generate prompt string
            response = llm.invoke(llm_input)  # Get model response
            print(response.content)
        except Exception as e:
            print(f"Error calling Ollama: {e}")
    else:
        print(" - No conditions found for classification.")
    print("\n\n")


--- sample_365.txt ---
Extracted Conditions and Classification:
Here are the extracted diseases and symptoms with their presence status:

1. Numbness: PRESENT
2. Tingling: PRESENT
3. Pain in the toes (moderate): PRESENT
4. Sharp quality pain: ABSENT (described, but not characterized further)
5. Paraesthesias in the left toe: PRESENT
6. Pain radiation up the leg: PRESENT
7. Bilateral shoulder pains: PRESENT
8. Sinus allergies: ABSENT
9. Hypothyroidism: PRESENT
10. Thrombocythemia: PRESENT
11. Insomnia: PRESENT
12. Hypertension: PRESENT

Note that some conditions (e.g., sinus allergies) are not explicitly mentioned as being present or absent, but rather stated to be ABSENT based on the context.



--- sample_214.txt ---
Extracted Conditions and Classification:
Here are the extracted conditions and symptoms with their presence status:

1. Shoulder injury (from fall): PRESENT
2. Intermittent numbness in fingers at night: PRESENT
3. Sore throat: PRESENT
4. Strep throat (history): ABSENT (no

Using Ollama with LangChain and an LLM like LLaMA3 provides flexible and context-aware extraction of medical information from clinical text. Unlike rule-based tools or traditional NER, the LLM can understand nuanced language and infer whether a symptom or disease is present or absent, even when it's implied rather than explicitly stated. This makes it particularly powerful for subjective clinical narratives. However, its utility depends on a reliable local setup—running Ollama requires sufficient compute and correct networking. Additionally, LLMs may hallucinate or misclassify without proper prompt tuning and evaluation.

## Evaluate Accuracy of LLM Model by Comparing to Annotations

#### First Clinical Note:



In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Predicted conditions (from LLM model)
predicted_conditions = [
    ("Numbness", "PRESENT"),
    ("Tingling", "PRESENT"),
    ("Pain in the toes", "PRESENT"),
    ("Sharp quality pain", "ABSENT"),
    ("Paraesthesias in the left toe", "PRESENT"),
    ("Pain radiation up the leg", "PRESENT"),
    ("Bilateral shoulder pains", "PRESENT"),
    ("Sinus allergies", "ABSENT"),
    ("Hypothyroidism", "PRESENT"),
    ("Thrombocythemia", "PRESENT"),
    ("Insomnia", "PRESENT"),
    ("Hypertension", "PRESENT")
]

# Annotations (Based off INCEpTION)
annotations = [
    ("Numbness", "PRESENT"),
    ("Tingling", "PRESENT"),
    ("Pain in the toes", "PRESENT"),
    ("Sharp quality pain", "PRESENT"),
    ("Paraesthesias in the left toe", "PRESENT"),
    ("Pain radiation up the leg", "PRESENT"),
    ("Bilateral shoulder pains", "PRESENT"),
    ("Sinus allergies", "ABSENT"),
    ("Hypothyroidism", "PRESENT"),
    ("Thrombocythemia", "PRESENT"),
    ("Insomnia", "PRESENT"),
    ("Hypertension", "PRESENT")
]

# Extract predicted and true labels
predicted_labels = [condition[1] for condition in predicted_conditions]
true_labels = [condition[1] for condition in annotations]

# Calculate metrics
precision = precision_score(true_labels, predicted_labels, pos_label='PRESENT')
recall = recall_score(true_labels, predicted_labels, pos_label='PRESENT')
f1 = f1_score(true_labels, predicted_labels, pos_label='PRESENT')

# Output the evaluation metrics
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")


Precision: 1.00
Recall: 0.91
F1 Score: 0.95


The model demonstrated strong performance in identifying and classifying diseases and symptoms from the subjective section of clinical notes. With a precision of 1.00, it accurately predicted all conditions labeled as PRESENT without any false positives, indicating high reliability when it does make a prediction. The recall of 0.91 shows that it successfully identified the majority of relevant conditions, missing only a small portion of those that were actually PRESENT in the ground truth annotations. This balance between precision and recall is reflected in the high F1 score of 0.95, suggesting the model is both accurate and consistent. Overall, these results indicate that the model is effective for extracting and classifying clinical conditions, with room for minor improvements in recall to capture all relevant entities.