<a href="https://colab.research.google.com/github/data-with-shobhit/Physician-Notetaker/blob/main/Physician_Notetaker.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. Medical NLP Summarization**

**Task:** Implement an NLP pipeline to **extract medical details** from the transcribed conversation.

Implemented a custom NER using Spacy and EntityRuler for defined patterns for optimal resuslts expected.

In [1]:
import spacy
from spacy.pipeline import EntityRuler
import json

# Load a blank English NLP model
nlp = spacy.blank("en")


ruler = nlp.add_pipe("entity_ruler")


patterns = [
    {"label": "SYMPTOM", "pattern": [{"lower": "back"}, {"lower": "pain"}]},
    {"label": "SYMPTOM", "pattern": [{"lower": "neck"}, {"lower": "pain"}]},
    {"label": "SYMPTOM", "pattern": [{"lower": "head"}, {"lower": "ache"}]},
    {"label": "SYMPTOM", "pattern": "neck and back pain"},
    {"label": "SYMPTOM", "pattern": "head ache"},
    {"label": "TREATMENT", "pattern": "painkillers"},
    {"label": "TREATMENT", "pattern": "physiotherapy"},
    {"label": "PERSON", "pattern": "Ms.Jones"},
    {"label": "CURRENT_STATUS", "pattern": "occasional backaches"},
    {"label": "DIAGNOSIS", "pattern": "whiplash injury"},
    {"label": "PROGNOSIS", "pattern": "full recovery within six months"}
]


ruler.add_patterns(patterns)


# print("Patterns in EntityRuler:", ruler.patterns)  # This should NOT be empty

transcript = """
Physician: Good morning, Ms. Jones. How are you feeling today?
Patient: Good morning, doctor. I’m doing better, but I still have some discomfort now and then.
Physician: I understand you were in a car accident last September. Can you walk me through what happened?
Patient: Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front.
Physician: That sounds like a strong impact. Were you wearing your seatbelt?
Patient: Yes, I always do.
Physician: What did you feel immediately after the accident?
Patient: At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my neck and back almost right away.
Physician: Did you seek medical attention at that time?
Patient: Yes, I went to Moss Bank Accident and Emergency. They checked me over and said it was a whiplash injury, but they didn’t do any X-rays. They just gave me some advice and sent me home.
Physician: How did things progress after that?
Patient: The first four weeks were rough. My neck and back pain were really bad—I had trouble sleeping and had to take painkillers regularly. It started improving after that, but I had to go through ten sessions of physiotherapy to help with the stiffness and discomfort.
Physician: That makes sense. Are you still experiencing pain now?
Patient: It’s not constant, but I do get occasional backaches. It’s nothing like before, though.
Physician: That’s good to hear. Have you noticed any other effects, like anxiety while driving or difficulty concentrating?
Patient: No, nothing like that. I don’t feel nervous driving, and I haven’t had any emotional issues from the accident.
Physician: And how has this impacted your daily life? Work, hobbies, anything like that?
Patient: I had to take a week off work, but after that, I was back to my usual routine. It hasn’t really stopped me from doing anything.
Physician: That’s encouraging. Let’s go ahead and do a physical examination to check your mobility and any lingering pain.
[Physical Examination Conducted]
Physician: Everything looks good. Your neck and back have a full range of movement, and there’s no tenderness or signs of lasting damage. Your muscles and spine seem to be in good condition.
Patient: That’s a relief!
Physician: Yes, your recovery so far has been quite positive. Given your progress, I’d expect you to make a full recovery within six months of the accident. There are no signs of long-term damage or degeneration.
Patient: That’s great to hear. So, I don’t need to worry about this affecting me in the future?
Physician: That’s right. I don’t foresee any long-term impact on your work or daily life. If anything changes or you experience worsening symptoms, you can always come back for a follow-up. But at this point, you’re on track for a full recovery.
Patient: Thank you, doctor. I appreciate it.
Physician: You’re very welcome, Ms. Jones. Take care, and don’t hesitate to reach out if you need anything.
"""

doc = nlp(transcript)


# for ent in doc.ents:
#     print(ent.text, ent.label_)

patient_info = {
    "Patient_Name": "",
    "Symptoms": [],
    "Diagnosis": "",
    "Treatment": [],
    "Current_Status": "",
    "Prognosis": ""
}

# Extract the entities from the processed text
for ent in doc.ents:
    if ent.label_ == "PERSON":
        patient_info["Patient_Name"] = ent.text
    elif ent.label_ == "SYMPTOM":
        patient_info["Symptoms"].append(ent.text)
    elif ent.label_ == "DIAGNOSIS":
        patient_info["Diagnosis"] = ent.text
    elif ent.label_ == "TREATMENT":
        patient_info["Treatment"].append(ent.text)
    elif ent.label_ == "CURRENT_STATUS":
        patient_info["Current_Status"] = ent.text
    elif ent.label_ == "PROGNOSIS":
        patient_info["Prognosis"] = ent.text

# Print the structured output
print(json.dumps(patient_info, indent=5))


{
     "Patient_Name": "Ms. Jones",
     "Symptoms": [
          "neck and back pain"
     ],
     "Diagnosis": "whiplash injury",
     "Treatment": [
          "painkillers",
          "physiotherapy"
     ],
     "Current_Status": "occasional backaches",
     "Prognosis": "full recovery within six months"
}


Text Summarization:

In [2]:
from transformers import pipeline

# Load a pre-trained summarization model
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Summarize the transcript
summary = summarizer(transcript, max_length=200, min_length=25, do_sample=False)



print(summary[0]['summary_text'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


Patient: I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front. I could feel pain in my neck and back almost right away. The first four weeks were rough.


Keyword Extraction: Identify important medical phrases (e.g., "whiplash injury," "physiotherapy sessions").

In [6]:
!pip install keybert


Collecting keybert
  Downloading keybert-0.9.0-py3-none-any.whl.metadata (15 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.3.8->keybert)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.3.8->keybert)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.3.8->keybert)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=0.3.8->keybert)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers>=0.3.8->keybert)
  Downloading nvi

In [3]:
doc = """
Physician: Good morning, Ms. Jones. How are you feeling today?
Patient: Good morning, doctor. I’m doing better, but I still have some discomfort now and then.
Physician: I understand you were in a car accident last September. Can you walk me through what happened?
Patient: Yes, it was on September 1st, around 12:30 in the afternoon. I was driving from Cheadle Hulme to Manchester when I had to stop in traffic. Out of nowhere, another car hit me from behind, which pushed my car into the one in front.
Physician: That sounds like a strong impact. Were you wearing your seatbelt?
Patient: Yes, I always do.
Physician: What did you feel immediately after the accident?
Patient: At first, I was just shocked. But then I realized I had hit my head on the steering wheel, and I could feel pain in my neck and back almost right away.
Physician: Did you seek medical attention at that time?
Patient: Yes, I went to Moss Bank Accident and Emergency. They checked me over and said it was a whiplash injury, but they didn’t do any X-rays. They just gave me some advice and sent me home.
Physician: How did things progress after that?
Patient: The first four weeks were rough. My neck and back pain were really bad—I had trouble sleeping and had to take painkillers regularly. It started improving after that, but I had to go through ten sessions of physiotherapy to help with the stiffness and discomfort.
Physician: That makes sense. Are you still experiencing pain now?
Patient: It’s not constant, but I do get occasional backaches. It’s nothing like before, though.
Physician: That’s good to hear. Have you noticed any other effects, like anxiety while driving or difficulty concentrating?
Patient: No, nothing like that. I don’t feel nervous driving, and I haven’t had any emotional issues from the accident.
Physician: And how has this impacted your daily life? Work, hobbies, anything like that?
Patient: I had to take a week off work, but after that, I was back to my usual routine. It hasn’t really stopped me from doing anything.
Physician: That’s encouraging. Let’s go ahead and do a physical examination to check your mobility and any lingering pain.
[Physical Examination Conducted]
Physician: Everything looks good. Your neck and back have a full range of movement, and there’s no tenderness or signs of lasting damage. Your muscles and spine seem to be in good condition.
Patient: That’s a relief!
Physician: Yes, your recovery so far has been quite positive. Given your progress, I’d expect you to make a full recovery within six months of the accident. There are no signs of long-term damage or degeneration.
Patient: That’s great to hear. So, I don’t need to worry about this affecting me in the future?
Physician: That’s right. I don’t foresee any long-term impact on your work or daily life. If anything changes or you experience worsening symptoms, you can always come back for a follow-up. But at this point, you’re on track for a full recovery.
Patient: Thank you, doctor. I appreciate it.
Physician: You’re very welcome, Ms. Jones. Take care, and don’t hesitate to reach out if you need anything.
"""

In [4]:
from keybert import KeyBERT

# Load the KeyBERT model
kw_model = KeyBERT()

# Define the text
# text = """The patient was diagnosed with whiplash injury and underwent ten physiotherapy sessions.
#           There was initial discomfort, but the condition improved over time."""

# Extract keywords
keywords = kw_model.extract_keywords(doc, keyphrase_ngram_range=(1, 2), stop_words='english', top_n=5)

# Print extracted keywords
print("Extracted Keywords:", [kw[0] for kw in keywords])


Extracted Keywords: ['accident patient', 'whiplash injury', 'accident physician', 'car accident', 'physician impacted']


**📍 Questions:**

- How would you handle **ambiguous or missing medical data** in the transcript?

```
1.Context-Based Imputation

Use BERT-based models (e.g., BioBERT, ClinicalBERT) to infer missing details based on surrounding context.
Example: If the patient says, "I had therapy," but doesn’t specify the type, the model can infer physiotherapy from previous mentions.

2.Rule-Based Fallbacks

If specific medical details are missing (e.g., treatment duration), flag the entry as “Incomplete” instead of making incorrect assumptions.

3.Structured Data Extraction

Use NER models (e.g., Med7, SciSpacy) to extract symptoms, diagnoses, and treatments even when phrased vaguely.
Example: If the transcript mentions “stiffness after the accident”, NER can map it to musculoskeletal injury.

```


- What **pre-trained NLP models** would you use for medical summarization?

```
1 BioBERT
  Fine-tuned on biomedical texts (PubMed, PMC)
  Best for NER, question-answering, and summarization
  Use Case: Extracting symptoms, diagnoses, and medications from transcripts

2 ClinicalBERT
  Trained on MIMIC-III (electronic health records)
  Handles physician-patient conversations & medical jargon
  Use Case: Summarizing patient history and clinical progress

3 SciBERT
  Designed for scientific & medical literature
  Captures relationships between medical terms better
  Use Case: Summarization of research papers & clinical trials

4 T5 (Text-to-Text Transfer Transformer) for Summarization
  Pre-trained T5 models fine-tuned on clinical narratives
  Can convert long patient dialogues into structured SOAP notes

  ```

## **2. Sentiment & Intent Analysis**

**Task:** Implement **sentiment analysis** to detect patient concerns and reassurance needs.

In [5]:
import torch
from transformers import pipeline

sentiment_pipeline = pipeline("text-classification", model="distilbert-base-uncased")

def analyze_sentiment_intent(text):
    # Get sentiment prediction
    sentiment_result = sentiment_pipeline(text)

    # Define rules for intent detection (custom logic)
    if "worried" in text.lower() or "concerned" in text.lower():
        intent = "Seeking reassurance"
    elif "pain" in text.lower():
        intent = "Reporting symptoms"
    else:
        intent = "Expressing concern"

    # Convert sentiment labels
    sentiment_label = sentiment_result[0]['label']
    if sentiment_label == "NEGATIVE":
        sentiment = "Anxious"
    elif sentiment_label == "POSITIVE":
        sentiment = "Reassured"
    else:
        sentiment = "Neutral"

    # Return JSON Output
    return {
        "Sentiment": sentiment,
        "Intent": intent
    }

# Example
text = "I had a car accident. My neck and back hurt a lot for four weeks"
output = analyze_sentiment_intent(text)
print(output)




config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

Some weights of T5ForSequenceClassification were not initialized from the model checkpoint at t5-small and are newly initialized: ['classification_head.dense.bias', 'classification_head.dense.weight', 'classification_head.out_proj.bias', 'classification_head.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cpu


{'Sentiment': 'Neutral', 'Intent': 'Expressing concern'}


1. How would you fine-tune BERT for medical sentiment detection?
Fine-tuning BERT (or a variant like BioBERT or ClinicalBERT) for medical sentiment detection involves adapting a pre-trained model on a domain-specific dataset, such as medical conversations or clinical notes. Here’s a step-by-step approach:

```
Steps for Fine-tuning BERT for Medical Sentiment Detection:
Choose the Right BERT Model:

BioBERT or ClinicalBERT are pre-trained on biomedical and clinical text, making them suitable for fine-tuning on medical sentiment data.
You can use the general BERT-base model, but domain-specific models are more effective for medical tasks.
Prepare the Dataset:

A dataset of medical dialogues or clinical notes where sentiments (e.g., positive, negative, neutral) are labeled is needed. For example, labels might indicate emotions like Anxiety, Anger, Reassurance, or Relief.
Preprocessing the Data:

Tokenization: Convert the text into tokens using the same tokenizer used during pre-training (for example, the BioBERT tokenizer).
Padding & Truncation: Ensure the sequences are of fixed length to avoid model performance degradation.
Label Encoding: Encode the sentiment labels (e.g., Anxious, Neutral, Reassured) into numerical values for classification.
Model Fine-tuning:

Add a classification head to the pre-trained model. This typically involves adding a fully connected layer on top of the BERT output.
Fine-tune the model on your labeled medical sentiment dataset.
Training:

Define the hyperparameters such as learning rate, batch size, epochs, etc.
Use cross-entropy loss for the multi-class sentiment classification task.
Fine-tune the model on your dataset for a few epochs to let it adapt to the medical language and sentiment patterns.
Evaluation:

After training, evaluate the model on a held-out test set to check metrics like accuracy, precision, recall, and F1 score.
```

- What Datasets Would You Use for Training a Healthcare-Specific Sentiment Model?
When fine-tuning BERT for medical sentiment detection, using a healthcare-specific dataset is crucial for obtaining meaningful results. Below are some publicly available datasets that can be used for training or fine-tuning a sentiment model:

```
Medical Sentiment Datasets:

MIMIC-III (Medical Information Mart for Intensive Care):

Description: A large, publicly available dataset that contains de-identified health records of ICU patients, including clinical notes.

Usage: Sentiment can be derived from patient notes, discharge summaries, and physician-patient conversations.

Challenges: Clinical notes may not be directly labeled with sentiment, so you may need to label them manually or use heuristics.



Health Tweets Dataset:

Description: A collection of tweets related to health, where tweets are labeled based on sentiments like positive, negative, or neutral.
Usage: Suitable for social media sentiment analysis related to healthcare.

The Clinical Trials Dataset:

Description: This dataset contains patient reports from clinical trials, including sentiment annotations.
Usage: Can be used for detecting sentiment in trial reports.


PHEME Dataset (Health-Related):

Description: The PHEME dataset is a large corpus of social media posts about health topics, including sentiment labels.

Usage: Fine-tune a model to detect sentiment from health-related social media content.



SST-2 (Sentiment Analysis):

Description: While not healthcare-specific, Stanford Sentiment Treebank (SST-2) provides sentiment labels for movie reviews, which can be adapted to fine-tune BERT for sentiment classification.
Usage: You can train the model first on SST-2, then fine-tune it on medical data.



```


