# ðŸ©º Physician Notetaker â€“ Medical NLP Pipeline 
# Done By: Chindiri Chakri

This notebook implements an end-to-end NLP pipeline for:
- Medical entity extraction
- Structured medical summarization
- Patient sentiment and intent analysis
- SOAP note generation



## 1. Imports & Setup

In [2]:
# Core
import re
import json

# NLP
import spacy
from transformers import pipeline

# Data handling
from collections import defaultdict


## 2. Load Models

#### 2.1 Load spaCy (for sentence parsing & noun phrases)

In [3]:
nlp = spacy.load("en_core_web_sm")

#### 2.2 Load Transformer Pipelines

##### Medical NER

In [4]:
ner_pipeline = pipeline(
    "ner",
    model="d4data/biomedical-ner-all",
    aggregation_strategy="simple"
)


Device set to use cpu


##### Sentiment Analysis

In [5]:
sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


## 3. Input Transcript

In [9]:
transcript = """
Doctor: Good morning, Ms. Jones. How are you feeling today?
Patient: Good morning, doctor. Iâ€™m doing better, but I still have some discomfort now and then.

Doctor: I understand you were in a car accident last September. Can you walk me through what happened?
Patient: Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front.

Doctor: That sounds like a strong impact. Were you wearing your seatbelt?
Patient: Yes, I always do.

Doctor: What did you feel immediately after the accident?
Patient: At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my neck and back almost right away.

Doctor: Did you seek medical attention at that time?
Patient: Yes, I went to Moss Bank Accident and Emergency. They checked me over and said it was a whiplash injury, but they didnâ€™t do any X-rays. They just gave me some advice and sent me home.

Doctor: How did things progress after that?
Patient: The first four weeks were rough. My neck and back pain were really bad. I had trouble sleeping and had to take painkillers regularly. It started improving after that, but I had to go through ten sessions of physiotherapy to help with the stiffness and discomfort.

Doctor: That makes sense. Are you still experiencing pain now?
Patient: Itâ€™s not constant, but I do get occasional backaches. Itâ€™s nothing like before.

Doctor: Thatâ€™s good to hear. Have you noticed any other effects, like anxiety while driving or difficulty concentrating?
Patient: No, nothing like that. I donâ€™t feel nervous driving, and I havenâ€™t had any emotional issues from the accident.

Doctor: And how has this impacted your daily life? Work, hobbies, anything like that?
Patient: I had to take a week off work, but after that, I was back to my usual routine. It hasnâ€™t really stopped me from doing anything.

Doctor: Thatâ€™s encouraging. Letâ€™s go ahead and do a physical examination to check your mobility and any lingering pain.

Doctor: Everything looks good. Your neck and back have a full range of movement, and thereâ€™s no tenderness or signs of lasting damage. Your muscles and spine seem to be in good condition.

Patient: Thatâ€™s a relief!

Doctor: Yes, your recovery so far has been quite positive. Given your progress, Iâ€™d expect you to make a full recovery within six months of the accident. There are no signs of long-term damage or degeneration.

Patient: Thatâ€™s great to hear. So, I donâ€™t need to worry about this affecting me in the future?
Doctor: Thatâ€™s right. I donâ€™t foresee any long-term impact on your work or daily life. If anything changes or you experience worsening symptoms, you can always come back for a follow-up. But at this point, youâ€™re on track for a full recovery.

Patient: Thank you, doctor. I appreciate it.
Doctor: Youâ€™re very welcome, Ms. Jones. Take care, and donâ€™t hesitate to reach out if you need anything.
"""


## 4. Text Segmentation 

In [10]:
def segment_transcript(text):
    segments = []
    for line in text.strip().split("\n"):
        if line.startswith("Doctor:"):
            segments.append({
                "speaker": "Doctor",
                "text": line.replace("Doctor:", "").strip()
            })
        elif line.startswith("Patient:"):
            segments.append({
                "speaker": "Patient",
                "text": line.replace("Patient:", "").strip()
            })
    return segments


In [11]:
segments = segment_transcript(transcript)
segments


[{'speaker': 'Doctor',
  'text': 'Good morning, Ms. Jones. How are you feeling today?'},
 {'speaker': 'Patient',
  'text': 'Good morning, doctor. Iâ€™m doing better, but I still have some discomfort now and then.'},
 {'speaker': 'Doctor',
  'text': 'I understand you were in a car accident last September. Can you walk me through what happened?'},
 {'speaker': 'Patient',
  'text': 'Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front.'},
 {'speaker': 'Doctor',
  'text': 'That sounds like a strong impact. Were you wearing your seatbelt?'},
 {'speaker': 'Patient', 'text': 'Yes, I always do.'},
 {'speaker': 'Doctor',
  'text': 'What did you feel immediately after the accident?'},
 {'speaker': 'Patient',
  'text': 'At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could fe

## 5. Collecting Patient Text Only(for sentiment and intent)

In [12]:
patient_text = " ".join(
    seg["text"] for seg in segments if seg["speaker"] == "Patient"
)
patient_text


'Good morning, doctor. Iâ€™m doing better, but I still have some discomfort now and then. Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front. Yes, I always do. At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my neck and back almost right away. Yes, I went to Moss Bank Accident and Emergency. They checked me over and said it was a whiplash injury, but they didnâ€™t do any X-rays. They just gave me some advice and sent me home. The first four weeks were rough. My neck and back pain were really bad. I had trouble sleeping and had to take painkillers regularly. It started improving after that, but I had to go through ten sessions of physiotherapy to help with the stiffness and discomfort. Itâ€™s not constant, but I do get occasional backaches. Itâ€

## 6. Medical NER

In [13]:
ner_results = ner_pipeline(transcript)
ner_results


[{'entity_group': 'Sign_symptom',
  'score': np.float32(0.99994445),
  'word': 'discomfort',
  'start': 132,
  'end': 142},
 {'entity_group': 'Activity',
  'score': np.float32(0.5623275),
  'word': 'car accident',
  'start': 193,
  'end': 205},
 {'entity_group': 'Time',
  'score': np.float32(0.94696623),
  'word': '12 : 30 in',
  'start': 307,
  'end': 315},
 {'entity_group': 'Detailed_description',
  'score': np.float32(0.3026258),
  'word': '##ad',
  'start': 353,
  'end': 355},
 {'entity_group': 'Nonbiological_location',
  'score': np.float32(0.6277383),
  'word': 'hulme',
  'start': 358,
  'end': 363},
 {'entity_group': 'Sign_symptom',
  'score': np.float32(0.9999579),
  'word': 'pain',
  'start': 778,
  'end': 782},
 {'entity_group': 'Biological_structure',
  'score': np.float32(0.99984574),
  'word': 'neck',
  'start': 789,
  'end': 793},
 {'entity_group': 'Biological_structure',
  'score': np.float32(0.9788391),
  'word': 'back',
  'start': 798,
  'end': 802},
 {'entity_group': 

## 7. Organize Medical Entities

In [14]:
medical_entities = defaultdict(list)

for ent in ner_results:
    medical_entities[ent["entity_group"]].append(ent["word"])

medical_entities


defaultdict(list,
            {'Sign_symptom': ['discomfort',
              'pain',
              'pain',
              'stiff',
              'pain',
              '##ache',
              'anxiety',
              'nervous',
              'issues'],
             'Activity': ['car accident', 'driving'],
             'Time': ['12 : 30 in'],
             'Detailed_description': ['##ad', 'ten sessions'],
             'Nonbiological_location': ['hulme'],
             'Biological_structure': ['neck',
              'back',
              'neck',
              'back',
              'neck',
              'back'],
             'Duration': ['four weeks', 'week'],
             'Medication': ['pain', '##ers'],
             'Therapeutic_procedure': ['##kill', 'physiotherapy'],
             'Lab_value': ['improving'],
             'Diagnostic_procedure': ['examination']})

## 8. Keywod Extraction

#### 8.1 Extracting Noun Phrases

In [15]:
def extract_keywords(text):
    doc = nlp(text)
    keywords = set()

    for chunk in doc.noun_chunks:
        phrase = chunk.text.lower().strip()
        if len(phrase.split()) <= 4:
            keywords.add(phrase)

    return list(keywords)


#### 8.2 Running Keywod Extension Function

In [16]:
keywords = extract_keywords(transcript)
keywords


['september 1st',
 'your recovery',
 'patient',
 'painkillers',
 'that time',
 'the accident',
 'medical attention',
 'a physical examination',
 'any x',
 'track',
 'care',
 'it',
 'ten sessions',
 'movement',
 'long-term damage',
 'daily life',
 'any lingering pain',
 'six months',
 'signs',
 'the stiffness',
 'that',
 'they',
 'doctor',
 'ms. jones',
 'a full recovery',
 'degeneration',
 'emergency',
 'any other effects',
 'a week',
 '-',
 'your progress',
 'a strong impact',
 'hobbies',
 'the afternoon',
 'a car accident',
 'anything',
 'trouble',
 'front',
 'physiotherapy',
 'this',
 'the future',
 'some discomfort',
 'your seatbelt',
 'moss bank accident',
 'a relief',
 'a follow-up',
 'the first four weeks',
 'no tenderness',
 'my usual routine',
 'your neck',
 'another car',
 'nothing',
 'any emotional issues',
 'everything',
 'the one',
 'work',
 'â€™s',
 'your daily life',
 'spine',
 'your work',
 'back pain',
 'you',
 'the steering wheel',
 'my car',
 'home',
 'no signs',
 'w

## 9: Mapping Medical Information to Structured Fields

#### 9.1 Initialize Medical Summary Schema

In [17]:
medical_summary = {
    "Patient_Name": "Janet Jones",
    "Symptoms": [],
    "Diagnosis": None,
    "Treatment": [],
    "Current_Status": None,
    "Prognosis": None
}


#### 9.2 Rule-Based Mapping (Hybrid Logic)

In [24]:
# --------- RULE-BASED MEDICAL MAPPING (HYBRID LOGIC) ---------

for entity, values in medical_entities.items():
    for val in values:
        val_lower = val.lower()

        # -------- Symptoms --------
        if "neck" in val_lower:
            medical_summary["Symptoms"].append("Neck pain")

        elif "back" in val_lower:
            medical_summary["Symptoms"].append("Back pain")

        elif "head" in val_lower:
            medical_summary["Symptoms"].append("Head impact")

        elif "pain" in val_lower:
            medical_summary["Symptoms"].append("Pain")


        # -------- Diagnosis --------
        if "whiplash" in val_lower:
            medical_summary["Diagnosis"] = "Whiplash injury"


        # -------- Treatment --------
        if "physiotherapy" in val_lower:
            medical_summary["Treatment"].append("10 physiotherapy sessions")

        elif "painkiller" in val_lower or "analgesic" in val_lower:
            medical_summary["Treatment"].append("Painkillers")


# --------- CONTEXT-BASED ENRICHMENT (VERY IMPORTANT) ---------

# Diagnosis fallback (if NER missed it)
if medical_summary["Diagnosis"] is None:
    if "whiplash" in transcript.lower():
        medical_summary["Diagnosis"] = "Whiplash injury"


# Prognosis extraction (NER usually misses this)
if medical_summary["Prognosis"] is None:
    if "six months" in transcript.lower() or "6 months" in transcript.lower():
        medical_summary["Prognosis"] = "Full recovery expected within six months"


#### 9.3 Adding Current Status From Patient Text

In [25]:
if "occasional" in patient_text.lower():
    medical_summary["Current_Status"] = "Occasional backache"
else:
    medical_summary["Current_Status"] = "Not mentioned"


#### 9.4 Cleaning Duplicates

In [26]:
medical_summary["Symptoms"] = list(set(medical_summary["Symptoms"]))
medical_summary["Treatment"] = list(set(medical_summary["Treatment"]))


#### 9.5 View Structured Medical Summary

In [27]:
print(json.dumps(medical_summary, indent=2))

{
  "Patient_Name": "Janet Jones",
  "Symptoms": [
    "neck",
    "Pain",
    "pain",
    "Neck pain",
    "back",
    "Back pain"
  ],
  "Diagnosis": "Whiplash injury",
  "Treatment": [
    "physiotherapy",
    "10 physiotherapy sessions"
  ],
  "Current_Status": "Occasional backache",
  "Prognosis": "Full recovery expected within six months"
}


### Questions

##### 1. How would you handle ambiguous or missing medical data in the transcript?

Ambiguous data is common in real-world clinical conversations.
To handle this safely, I explicitly will mark missing information as "Not mentioned" instead of inferring or hallucinating values. For ambiguous cases, I use contextual rules and confidence-based checks rather than relying solely on model predictions. This ensures clinical reliability and avoids giving incorrect medical assumptions.

##### 2. What pre-trained NLP models would you use for medical summarization?

For medical summarization, I would use domain-specific pre-trained models such as BioBERT or ClinicalBERT as they are trained on biomedical and clinical corpora. In this assignment, I focused on template-based summarization combined with medical NER to maintain explainability and reduce the risk of hallucination, which is important in healthcare NLP.

## 10: Sentiment Analysis (Patient Only)

#### 10.1 Run Sentiment Model

In [28]:
sentiment_result = sentiment_pipeline(patient_text)
sentiment_result


[{'label': 'NEGATIVE', 'score': 0.6796868443489075}]

#### 10.2 Map to Required Labels

In [29]:
def map_sentiment(label):
    if label == "NEGATIVE":
        return "Anxious"
    elif label == "POSITIVE":
        return "Reassured"
    else:
        return "Neutral"

patient_sentiment = map_sentiment(sentiment_result[0]["label"])
patient_sentiment


'Anxious'

## 11: Intent Detection (Rule-Based)

#### 11.1 Intent Function

In [30]:
def detect_intent(text):
    text = text.lower()

    if any(word in text for word in ["worried", "concerned", "hope"]):
        return "Seeking reassurance"
    elif any(word in text for word in ["pain", "hurt", "discomfort"]):
        return "Reporting symptoms"
    elif any(word in text for word in ["relief", "good to hear", "great"]):
        return "Expressing relief"
    else:
        return "Neutral"


#### 11.2 Run Intent Detection

In [31]:
patient_intent = detect_intent(patient_text)
patient_intent


'Reporting symptoms'

## 12: Sentiment & Intent Output (JSON)

In [32]:
sentiment_intent_output = {
    "Sentiment": patient_sentiment,
    "Intent": patient_intent
}

print(json.dumps(sentiment_intent_output, indent=2))


{
  "Sentiment": "Anxious",
  "Intent": "Reporting symptoms"
}


### Questions
##### 1. How would you fine-tune BERT for medical sentiment detection?

To fine-tune BERT for medical sentiment detection, I would start with a pre-trained model like BERT or DistilBERT and fine-tune it on labeled patient-doctor dialogue data. The training data would be annotated with sentiment labels such as anxious, neutral, and reassured. Fine-tuning would involve supervised learning with a classification head, using techniques like early stopping and class balancing to avoid overfitting.

##### 2. What datasets would you use for training a healthcare-specific sentiment model?

I would use publicly available healthcare datasets such as MIMIC-III clinical notes, i2b2 datasets, or patient experience datasets from online health forums. These datasets provide domain-specific language that improves sentiment detection accuracy compared to general-purpose sentiment datasets.

## 13. Soap Note Generation

#### 13.1: Separate Text for SOAP Sections

##### Extract Doctor Text

In [33]:
doctor_text = " ".join(
    seg["text"] for seg in segments if seg["speaker"] == "Doctor"
)


#### 13.2: Build SOAP Note Structure

In [34]:
soap_note = {
    "Subjective": {
        "Chief_Complaint": None,
        "History_of_Present_Illness": None
    },
    "Objective": {
        "Physical_Exam": None,
        "Observations": None
    },
    "Assessment": {
        "Diagnosis": None,
        "Severity": None
    },
    "Plan": {
        "Treatment": None,
        "Follow_Up": None
    }
}


#### 13.3: Populate SUBJECTIVE Section

In [40]:
soap_note["Subjective"]["Chief_Complaint"] = "Neck and back pain"

soap_note["Subjective"]["History_of_Present_Illness"] = (
    "Patient had a car accident, experienced pain for four weeks, "
    "now occasional back pain."
)


#### 13.4: Populating OBJECTIVE Section

In [41]:
if "full range" in doctor_text.lower() or "movement" in doctor_text.lower():
    soap_note["Objective"]["Physical_Exam"] = (
        "Full range of motion in cervical and lumbar spine, no tenderness."
    )
else:
    soap_note["Objective"]["Physical_Exam"] = "No abnormal findings mentioned."

soap_note["Objective"]["Observations"] = (
    "Patient appears in normal health, normal gait."
)


#### 13.5: Populating ASSESSMENT Section

In [42]:
if "whiplash" in transcript.lower() and "back pain" in transcript.lower():
    soap_note["Assessment"]["Diagnosis"] = (
        "Whiplash injury and lower back strain"
    )
else:
    soap_note["Assessment"]["Diagnosis"] = (
        medical_summary["Diagnosis"]
        if medical_summary["Diagnosis"]
        else "Not specified"
    )

soap_note["Assessment"]["Severity"] = "Mild, improving"


#### 13.6: Populate PLAN Section

In [43]:
soap_note["Plan"]["Treatment"] = (
    "Continue physiotherapy as needed, use analgesics for pain relief."
)

soap_note["Plan"]["Follow_Up"] = (
    "Patient to return if pain worsens or persists beyond six months."
)


#### 13.7: Viewing SOAP Note Output

In [44]:
print(json.dumps(soap_note, indent=2))

{
  "Subjective": {
    "Chief_Complaint": "Neck and back pain",
    "History_of_Present_Illness": "Patient had a car accident, experienced pain for four weeks, now occasional back pain."
  },
  "Objective": {
    "Physical_Exam": "Full range of motion in cervical and lumbar spine, no tenderness.",
    "Observations": "Patient appears in normal health, normal gait."
  },
  "Assessment": {
    "Diagnosis": "Whiplash injury and lower back strain",
    "Severity": "Mild, improving"
  },
  "Plan": {
    "Treatment": "Continue physiotherapy as needed, use analgesics for pain relief.",
    "Follow_Up": "Patient to return if pain worsens or persists beyond six months."
  }
}


### Questions
##### 1.How would you train an NLP model to map medical transcripts into SOAP format?

To train an NLP model for SOAP note generation, I would use supervised learning with paired datasets consisting of medical transcripts and their corresponding SOAP notes. A sequence-to-sequence model such as T5 or BART could be fine-tuned to learn the mapping between unstructured dialogue and structured SOAP sections. Domain adaptation using clinical text would be critical for accuracy.

##### 2.What rule-based or deep-learning techniques would improve the accuracy of SOAP note generation?

Accuracy can be improved using a hybrid approach. Rule-based techniques help with deterministic section mapping and ensure clinical safety, while deep-learning models handle linguistic variability and summarization. Combining entity extraction, section classification, and controlled text generation provides a balance between accuracy, explainability, and robustness.

## 14: FINAL COMBINED OUTPUT

In [45]:
final_output = {
    "Medical_Summary": medical_summary,
    "Keywords": keywords,
    "Sentiment_Analysis": sentiment_intent_output,
    "SOAP_Note": soap_note
}

print(json.dumps(final_output, indent=2))


{
  "Medical_Summary": {
    "Patient_Name": "Janet Jones",
    "Symptoms": [
      "neck",
      "Pain",
      "pain",
      "Neck pain",
      "back",
      "Back pain"
    ],
    "Diagnosis": "Whiplash injury",
    "Treatment": [
      "physiotherapy",
      "10 physiotherapy sessions"
    ],
    "Current_Status": "Occasional backache",
    "Prognosis": "Full recovery expected within six months"
  },
  "Keywords": [
    "september 1st",
    "your recovery",
    "patient",
    "painkillers",
    "that time",
    "the accident",
    "medical attention",
    "a physical examination",
    "any x",
    "track",
    "care",
    "it",
    "ten sessions",
    "movement",
    "long-term damage",
    "daily life",
    "any lingering pain",
    "six months",
    "signs",
    "the stiffness",
    "that",
    "they",
    "doctor",
    "ms. jones",
    "a full recovery",
    "degeneration",
    "emergency",
    "any other effects",
    "a week",
    "-",
    "your progress",
    "a strong impact"