# Bangla Doctor-Patient Conversation Summarization

This notebook implements a pipeline to summarize Bangla doctor-patient conversations by extracting key entities like symptoms, medications, and duration, and then generating a summary.

In [None]:
# 1. Setup: Install necessary libraries
# Ensure you have these libraries installed. Uncomment the line below to install if needed.
%pip install transformers torch bnlp_toolkit spacy
%python -m spacy download en_core_web_sm # Example for spaCy, Bangla models might need different handling or custom setup

In [None]:
# Import libraries
from bnlp import BasicTokenizer, NER as BNLP_NER
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline

# For Bangla-BERT NER (optional, requires fine-tuning or specific pre-trained model for medical entities)
# from transformers import AutoModelForTokenClassification

import warnings
warnings.filterwarnings("ignore")

In [None]:
# 2. Input: Example Bangla Doctor-Patient Conversation
conversation_text = """রোগী: ডাক্তার সাহেব, আমার গত তিন দিন ধরে খুব জ্বর ও শরীরে ব্যথা অনুভব করছি। সাথে ঠান্ডাও লেগেছে।
ডাক্তার: আচ্ছা, তাপমাত্রা মেপেছেন? অন্য কোনো উপসর্গ আছে, যেমন কাশি বা গলা ব্যথা?
রোগী: হ্যাঁ, তাপমাত্রা ১০২ ডিগ্রি ফারেনহাইট। হালকা কাশিও আছে। গলা ব্যথা নেই।
ডাক্তার: আমি আপনাকে কিছু ওষুধ দিচ্ছি। প্যারাসিটামল ৫০০মিগ্রা দিনে তিনবার খাবেন, ভরা পেটে। সাথে একটি অ্যান্টিহিস্টামিন দিচ্ছি, রাতে একটা করে খাবেন, সাত দিন। প্রচুর পানি পান করবেন ও বিশ্রাম নেবেন।
রোগী: ধন্যবাদ ডাক্তার সাহেব।
"""

print("Original Conversation:")
print(conversation_text)

## 3. Preprocessing & Named Entity Recognition (NER)

We'll use BNLP for tokenization and its pre-trained NER model. For more advanced NER, fine-tuning a model like Bangla-BERT on medical conversations would be beneficial.

In [None]:
# Initialize BNLP tools
tokenizer = BasicTokenizer()
bnlp_ner = BNLP_NER()

# Tokenize the conversation (BNLP NER often works on raw text, but tokenization can be a separate step)
tokens = tokenizer.tokenize(conversation_text)
# print("\\nTokenized Text:")
# print(tokens) # BNLP NER takes raw text

# Perform NER using BNLP
# BNLP's NER might not have specific "SYMPTOM", "MEDICATION", "DURATION" tags out-of-the-box for medical domain.
# It typically identifies general entities like Person (PER), Location (LOC), Organization (ORG), Date (DATE), Time (TIME), Money (MONEY), Percent (PERCENT).
# We will need to map or infer these from the general tags or train a custom NER model.
# For this example, we'll try to extract based on keywords and general tags if direct medical tags aren't present.

print("\\nBNLP NER Output (General Entities):")
entities_bnlp = bnlp_ner.tag(conversation_text)
print(entities_bnlp)

### 3.1. Custom Keyword-Based Extraction (Simpler Approach for Demo)
Since BNLP's default NER might not directly give us "SYMPTOM", "MEDICATION", "DURATION", we'll use a keyword-based approach for this demonstration. For a robust solution, a custom-trained NER model is recommended.

In [None]:
# Define keywords for extraction (this is a very basic approach)
symptom_keywords = ["জ্বর", "ব্যথা", "কাশি", "ঠান্ডা", "গলা ব্যথা", "মাথাব্যথা"]
medication_keywords = ["প্যারাসিটামল", "অ্যান্টিহিস্টামিন", "নাপা", "এসপিরিন"] # Add more as needed
duration_keywords = ["দিন", "সপ্তাহ", "মাস"] # Often associated with numbers

extracted_symptoms = []
extracted_medications = []
extracted_durations = []

# Simple keyword spotting (can be improved with regex and context analysis)
# This is a placeholder for a more sophisticated NER/Keyword extraction
words = conversation_text.replace(\'\\n\', \' \').split(\' \') # Basic word splitting

for i, word in enumerate(words):
    # Clean the word
    cleaned_word = word.strip(\',.?!\')
    
    if cleaned_word in symptom_keywords:
        extracted_symptoms.append(cleaned_word)
    
    if cleaned_word in medication_keywords:
        # Try to capture dosage if available (e.g., "৫০০মিগ্রা")
        med_info = cleaned_word
        if i + 1 < len(words) and ("মিগ্রা" in words[i+1] or "mg" in words[i+1].lower()):
            med_info += " " + words[i+1]
        extracted_medications.append(med_info)
        
    if cleaned_word in duration_keywords:
        # Try to capture the number before "দিন", "সপ্তাহ", etc.
        if i > 0 and words[i-1].isdigit(): # Basic check for a preceding number
            extracted_durations.append(words[i-1] + " " + cleaned_word)
        elif i > 0 and words[i-1] == "এক": # Handle "এক দিন"
             extracted_durations.append("এক " + cleaned_word)
        elif i > 0 and words[i-1] == "দুই":
             extracted_durations.append("দুই " + cleaned_word)
        elif i > 0 and words[i-1] == "তিন":
             extracted_durations.append("তিন " + cleaned_word)
        # This can be made more robust with number parsing in Bangla
        # For now, let's also add durations found by BNLP if they are DATEs
        
# Add durations found by BNLP if tagged as DATE (often includes durations)
for entity, tag in entities_bnlp:
    if tag == "DATE": # BNLP uses DATE for time expressions like "তিন দিন"
        # Check if it contains duration keywords
        if any(d_keyword in entity for d_keyword in duration_keywords):
             if entity not in extracted_durations: # Avoid duplicates
                extracted_durations.append(entity)


# Remove duplicates
extracted_symptoms = list(set(extracted_symptoms))
extracted_medications = list(set(extracted_medications))
extracted_durations = list(set(extracted_durations))


print("\\n--- Extracted Information (Keyword-Based) ---")
print(f"Symptoms: {extracted_symptoms}")
print(f"Medications: {extracted_medications}")
print(f"Durations: {extracted_durations}")

### 3.2. (Optional) NER using Hugging Face Transformers (e.g., Bangla-BERT)
For better accuracy, you would fine-tune a model like `csebuetnlp/banglabert` on a Bangla medical NER dataset.
If a pre-trained Bangla medical NER model is available, you can use it directly.

In [None]:
# # Example placeholder for using a Hugging Face NER pipeline
# # You would need a model fine-tuned for Bangla medical NER.
# # For instance, if 'some-bangla-medical-ner-model' existed on Hugging Face Hub:
# try:
#     ner_pipeline_hf = pipeline("ner", model="sagorsarker/bangla-bert-ner", tokenizer="sagorsarker/bangla-bert-ner", grouped_entities=True)
#     # The model above is a general Bangla NER model, it might not have specific medical entity types.
#     # It might identify entities like B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, etc.
#     # You would need to map these or use a model specifically trained for SYMPTOM, MEDICATION, DURATION.
#     hf_entities = ner_pipeline_hf(conversation_text)
#     print("\\n--- Hugging Face NER Output (Example with sagorsarker/bangla-bert-ner) ---")
#     print(hf_entities)

#     # Process hf_entities to extract symptoms, medications, durations based on its entity types
#     # This part is highly dependent on the specific model's output and entity schema.
#     # e.g., if it had a 'MED' tag:
#     # extracted_medications_hf = [entity['word'] for entity in hf_entities if entity['entity_group'] == 'MED']
#     # print(f"Medications (from HF NER): {extracted_medications_hf}")

# except Exception as e:
#     print(f"Could not load Hugging Face NER model. This is an optional step. Error: {e}")
#     print("Skipping Hugging Face NER.")

## 4. Conversational Summarization
We will use a pre-trained sequence-to-sequence model (like mT5) for summarization. We'll create a prompt from the extracted entities.

In [None]:
# Prepare input for the summarization model
# Using the keyword-based extracted entities for this demo
if not extracted_symptoms and not extracted_medications and not extracted_durations:
    summary_input_text = "রোগীর সাধারণ কথোপকথন।" # Fallback if no entities extracted
else:
    summary_input_text = "রোগীর লক্ষণসমূহ: " + ", ".join(extracted_symptoms) + \
                         "। পরামর্শকৃত ঔষধ: " + ", ".join(extracted_medications) + \
                         "। সময়কাল: " + ", ".join(extracted_durations) + "।"

print("\\n--- Input for Summarization Model ---")
print(summary_input_text)

# Load a pre-trained summarization model (e.g., mT5 small for multilingual capabilities)
# Using a smaller model for quicker execution in a demo.
# For better Bangla summarization, a model fine-tuned on Bangla text or medical summaries would be ideal.
summarizer_model_name = "google/mt5-small" 
try:
    summarizer_tokenizer = AutoTokenizer.from_pretrained(summarizer_model_name)
    summarizer_model = AutoModelForSeq2SeqLM.from_pretrained(summarizer_model_name)
    
    # Create the summarization pipeline
    summarization_pipeline = pipeline("summarization", model=summarizer_model, tokenizer=summarizer_tokenizer)
    
    # Generate summary
    # Prepending a task-specific prefix for mT5 if needed, e.g., "summarize: " or "bangla summarize: "
    # For mT5, it's often trained with prefixes. Let's try without first.
    # Max length of summary can be adjusted.
    summary = summarization_pipeline(summary_input_text, max_length=100, min_length=10, do_sample=False)
    
    print("\\n--- Generated Summary (mT5-small) ---")
    print(summary[0]['summary_text'])

except Exception as e:
    print(f"Error during summarization with {summarizer_model_name}: {e}")
    print("Please ensure the model name is correct and you have an internet connection.")


## 5. Notes and Further Improvements
*   **NER Model**: The keyword-based extraction is very basic. For robust performance, fine-tune a transformer model (like `csebuetnlp/banglabert` or `ai4bharat/indic-bert`) on a custom Bangla medical NER dataset. The dataset should have annotations for SYMPTOM, MEDICATION, DURATION, etc.
*   **Summarization Model**: `google/mt5-small` is a general multilingual model. For higher quality summaries, fine-tuning mT5 or mBART (`facebook/mbart-large-cc25`) on Bangla conversational data, especially medical dialogues, is recommended.
*   **Contextual Understanding**: More advanced techniques can be used to link symptoms to specific medications or durations if multiple are mentioned.
*   **Preprocessing**: Advanced Bangla text normalization and cleaning can improve the performance of both NER and summarization.
*   **BNLP NER Tags**: The default BNLP NER tags might not directly map to medical entities. You might need to analyze its output for `DATE`, `NUMBER`, `MISC` tags and apply rules, or extend BNLP with custom rules/models.