This notebook runs a script for medication–diagnosis matching and first-line drug Identification using a Hugging Face LLM (DeepSeek-R1-Distill-Llama-8B).  

It:
- sets environment variables and logging,
- loads the model/pipeline,
- reads `patients.json`,
- prompts the model to produce a strict **Markdown table** (Medication → Matched Diagnosis, Guideline Compliance, Citation, Notes, Optimisation Recommendation),
- saves raw outputs, and
- parses/saves **per-patient CSVs** for downstream analysis.

> **Use case:** support for managing polypharmacy & multimorbidity (Structured Medication Review)
> **Caveat:** LLM outputs should remain adjunctive to clinician judgment and authoritative guidelines (NICE/BNF).

## 1) Environment, Imports, and Logging
Turn off tokenizer parallelism warnings, import standard/third-party libraries, and configure logging.

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Import Standard library
import time    # for measuring execution time
import logging # for structured logging
import json    # for reading/writing JSON files
import re      # for regular expressions (text cleanup)

# Import Third-party library (HuggingFace)
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Setup logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") # timestamp, log level, message
logger = logging.getLogger(__name__) # create logger for this script


## 2) Load Model and Build Text-Generation Pipeline
Initialize DeepSeek-R1-Distill-Llama-8B on CPU and create the generation pipeline.


In [3]:
# Choose model
MODEL = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

# Load tokenizer (fast implementation enabled)
tokenizer = AutoTokenizer.from_pretrained(MODEL, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token  # pad with EOS token
tokenizer.padding_side = "right"           # pad on the right side

# Load model weights on CPU (can change to "cuda" if GPU available)
model = AutoModelForCausalLM.from_pretrained(MODEL, device_map="cpu")
model.config.pad_token_id = tokenizer.eos_token_id  # align pad token id

# Create HuggingFace text-generation pipeline
pipe = pipeline(
    "text-generation",   # task type
    model=model,
    tokenizer=tokenizer, # running on CPU
    device_map="cpu",
    framework="pt"       # use PyTorch backend
)


NameError: name 'AutoTokenizer' is not defined

## 3) Load Patients and Preview
Read `patients.json` and print a compact list to verify inputs.


In [None]:
# Open patient data JSON file
with open("patients.json") as f:
    patients = json.load(f)
    
# Display available patients for dictionary format
for i, (patient_id, patient_info) in enumerate(patients.items()):
    # Extract first diagnosis (before the first semicolon)
    first_diagnosis = patient_info['diagnosis'].split(';')[0].strip()
    # Print patient summary line
    print(f"{i+1}. {patient_info['patient_id']} - {patient_info['age']}yr {patient_info['sex']} - {first_diagnosis}")


## 4) Prompt Template (Few-Shot Prompt focused on BNF/NICE-Guideline)
Instruction block requesting a strict Markdown table only.


In [None]:
# Prompt template instructing the model to act like a UK clinical pharmacist
# Strictly enforce Markdown table output with fixed columns
medication_diagnosis_prompt_template = """
You are a UK clinical pharmacist.

Review the patient's medications and diagnoses.
For each medication, determine which diagnosis it treats using BNF/NICE guideline only.
If no diagnosis matches, write "No matching diagnosis".
Also provide optimisation recommendations: identify drugs to continue, discontinue, dose-adjust, or switch.

Output ONLY a markdown table with these exact columns:

| Medication | Matched Diagnosis | Guideline Compliance | Citation | Notes | Optimisation Recommendation |
|------------|-------------------|----------------------|----------|-------|---------------------------|

Examples:
Patient 1:
Medications: Metformin; Atorvastatin; Lisinopril
Diagnoses: Type 2 Diabetes; Dyslipidemia; Hypertension

| Medication | Matched Diagnosis | Guideline Compliance | Citation | Notes | Optimisation Recommendation |
|------------|-------------------|----------------------|----------|-------|---------------------------|
| Metformin | Type 2 Diabetes | First-line | NICE NG28 | Standard first-line treatment | CONTINUE - appropriate first-line therapy |
| Atorvastatin | Dyslipidemia | First-line | NICE CG181 | High-intensity statin | CONTINUE - achieving lipid targets |
| Lisinopril | Hypertension | First-line | NICE NG136 | ACE inhibitor recommended | CONTINUE - well tolerated |

Patient 2:
Medications: Warfarin; Digoxin; Furosemide
Diagnoses: Atrial Fibrillation; Heart Failure; Chronic Pain

| Medication | Matched Diagnosis | Guideline Compliance | Citation | Notes | Optimisation Recommendation |
|------------|-------------------|----------------------|----------|-------|---------------------------|
| Warfarin | Atrial Fibrillation | First-line | NICE CG180 | Anticoagulation for stroke prevention | SWITCH - consider DOAC for convenience |
| Digoxin | Heart Failure | Second-line | NICE NG106 | For symptom control in severe HF | CONTINUE - monitor levels |
| Furosemide | Heart Failure | First-line | NICE NG106 | Loop diuretic for fluid management | DOSE-ADJUST - optimize based on symptoms |

Now analyze this patient:
Patient Data:
Medications: {medications}
Diagnoses: {diagnoses}

Do not include any text before or after the table.
Start your response with the table header line exactly as shown above.
"""


## 5) Helper Functions
Format patient fields for the prompt and save raw outputs.


In [None]:
def format_patient_data(patient):
    """Format patient medications and diagnoses for the prompt"""
    medication_list = patient['medication'].split('; ')                                 # split string into list
    medications = [re.sub(r"\s*\(.*?\)", "", med).strip() for med in medication_list]   # remove parentheses and spaces
    medications_str = "; ".join(medications)                                            # join back with semicolons
    diagnoses_str = patient['diagnosis']                                                # take diagnosis as-is
    return medications_str, diagnoses_str

def save_output(patient_id, output):
    os.makedirs("results", exist_ok=True)  # create folder if missing
    with open(f"results/{patient_id}_medication_diagnosis_result.txt", "w") as f:
        f.write(output)


## 6) Generate LLM Outputs for Each Patient
Iterate through all patients, run the model, print, and save raw Markdown tables to `results/` folder.


In [1]:
# Loop through each patient and run clinical assessment
for patient_id, patient_data in patients.items():
    # Prepare patient-specific meds and diagnoses for the prompt
    medications_str, diagnoses_str = format_patient_data(patient_data)
    prompt = medication_diagnosis_prompt_template.format(
        medications=medications_str,
        diagnoses=diagnoses_str
    )
    logger.info(f" Running medication-diagnosis matching for patient {patient_id}...")
    start = time.time() # track start time
    raw_output = pipe(
        prompt,
        return_full_text=False,
        max_new_tokens=1400,                 # token budget
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )[0]["generated_text"].strip()           # get text output and strip whitespace
    duration = time.time() - start
    logger.info(f"✅ Completed for patient {patient_id} in {duration:.2f}s")

    # Print output to console
    print(f" Medication-Diagnosis Analysis for {patient_id}:\n")
    print(raw_output)
    print("\n" + "="*80 + "\n")

    # Save raw output to file
    save_output(patient_id, raw_output)


NameError: name 'patients' is not defined

## 7) Parse Model Output and Save CSVs
Clean hallucinated tags, parse the Markdown table, and export per-patient CSVs to `results_csv/`.


In [None]:
import pandas as pd

def clean_raw_output(text):
    # Remove hallucinated tags like </think> or similar
    return re.sub(r"</?\w+>", "", text).strip()

def parse_markdown_table(text):
    """Parse the Markdown table into a list of dict rows."""
    rows = []
    for line in text.strip().split("\n"):
        if re.match(r"\|\s*\w+", line) and not line.lower().startswith("| medication"):
            parts = [cell.strip() for cell in line.strip("|").split("|")]  # split columns
            if len(parts) == 6:
                rows.append({
                    "medication": parts[0],
                    "matched_diagnosis": parts[1],
                    "guideline_compliance": parts[2],
                    "citation": parts[3],
                    "notes": parts[4],
                    "optimisation_recommendation": parts[5]
                })
    return rows

# Create output folder if not exists
os.makedirs("results_csv", exist_ok=True)

# Loop through all result text files and generate CSVs
for patient_id, patient_data in patients.items():
    result_path = f"results/{patient_id}_medication_diagnosis_result.txt"
    if not os.path.exists(result_path):
        logger.warning(f" Missing result file for {patient_id}, skipping.")
        continue
        
    with open(result_path, "r") as f:
        raw_output = f.read()
        
    cleaned_output = clean_raw_output(raw_output)             # remove unwanted tags
    structured_results = parse_markdown_table(cleaned_output) # parse Markdown
    if structured_results:
        df = pd.DataFrame(structured_results)                 # build DataFrame if valid rows exist
    else:
        # Ensure at least one row exists if "No matching diagnosis found"
        df = pd.DataFrame([{
            "medication": "",
            "matched_diagnosis": "",
            "guideline_compliance": "",
            "citation": "",
            "notes": "No matching diagnosis found",
            "optimisation_recommendation": ""
        }])
    csv_path = f"results_csv/{patient_id}_medication_diagnosis_structured.csv"
    df.to_csv(csv_path, index=False)
    logger.info(f"📁 Saved CSV for {patient_id} to {csv_path}")


This code was used as part of **Olayemi Bakare's** MSc Research Work at Queen Mary University of London (Digital Environment and Innovation Lab)