**Basic pipeline**
1. Extract Information
From the .docx prescription, the model reads all the clean text.

From the .json lab file, it pulls key lab values like:

HbA1c

Fasting Blood Sugar

Postprandial Blood Sugar

Creatinine

2. Generate Features
The text is passed through a pre-trained BERT model to get a high-quality text embedding (a vector representation of the medical note).

The lab values are turned into a numeric tensor.

3. Classify Risk
These two sets of features (text and lab) are combined inside a neural network.

The model uses a few layers to learn patterns and finally outputs a prediction:

0 for Low Risk

1 for High Risk

4. Train the Model
A small simulated dataset is used here for demo purposes.

The model is trained using a basic training loop (5 epochs), and accuracy is printed after each epoch.

5. Make Predictions
After training, the model can:

Take a new .docx + .json pair.

Predict whether that patient is at Low or High Risk.

Also show the confidence (probabilities) for each class.

In [None]:
!pip install python-docx
!pip install transformers
!pip install torch torchvision

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl.metadata (2.0 kB)
Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/244.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-docx
Successfully installed python-docx-1.1.2
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cu

In [None]:
# ===========================================
#  MULTI-MODAL MEDICAL RISK CLASSIFIER
# ===========================================

import json
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertModel
from docx import Document
import os
import random

# ===========================================
#  DATA LOADING UTILITIES
# ===========================================

def load_prescription_text(docx_path):
    """Extract clean text from prescription DOCX"""
    doc = Document(docx_path)
    return "\n".join([p.text for p in doc.paragraphs if p.text.strip()])


def load_lab_data(json_path):
    """Load and return lab result dictionary from JSON"""
    with open(json_path, 'r') as f:
        return json.load(f)["lab_results"]


def extract_lab_features(lab_data):
    """Convert specific lab values into a fixed-size tensor"""
    keys = ["HbA1c", "Fasting Blood Sugar", "Postprandial Blood Sugar", "Creatinine"]
    values = [lab_data.get(k, {}).get("value", 0) for k in keys]
    return torch.tensor(values, dtype=torch.float32)


# ===========================================
#  TOKENIZER & BERT
# ===========================================

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")

def get_text_embedding(text):
    """Returns CLS token embedding from BERT"""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        outputs = bert_model(**inputs)
    return outputs.last_hidden_state[:, 0, :]  # CLS token


# ===========================================
#  MULTI-MODAL MODEL
# ===========================================

class MultiModalRiskClassifier(nn.Module):
    def __init__(self, text_dim=768, lab_dim=4, hidden_dim=128):
        super().__init__()
        self.text_proj = nn.Linear(text_dim, hidden_dim)
        self.lab_proj = nn.Linear(lab_dim, hidden_dim)
        self.classifier = nn.Sequential(
            nn.ReLU(),
            nn.Linear(hidden_dim * 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, 2)  # Binary output
        )

    def forward(self, text_emb, lab_features):
        text_out = self.text_proj(text_emb)
        lab_out = self.lab_proj(lab_features)
        combined = torch.cat([text_out, lab_out], dim=1)
        return self.classifier(combined)


# ===========================================
#  DUMMY DATASET FOR TRAINING DEMO
# ===========================================

class RiskDataset(Dataset):
    def __init__(self, file_list):
        self.file_list = file_list

    def __len__(self):
        return len(self.file_list)

    def __getitem__(self, idx):
        docx_path, json_path, label = self.file_list[idx]
        text = load_prescription_text(docx_path)
        labs = load_lab_data(json_path)
        lab_feats = extract_lab_features(labs)
        text_emb = get_text_embedding(text).squeeze(0)  # [768]
        return text_emb, lab_feats, torch.tensor(label, dtype=torch.long)


# ===========================================
#  TRAINING LOOP
# ===========================================

def train_model(model, dataloader, optimizer, criterion, device):
    model.train()
    total_loss = 0
    for text_emb, lab_feats, labels in dataloader:
        text_emb, lab_feats, labels = text_emb.to(device), lab_feats.to(device), labels.to(device)

        optimizer.zero_grad()
        logits = model(text_emb, lab_feats)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    return total_loss / len(dataloader)


def evaluate_model(model, dataloader, device):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for text_emb, lab_feats, labels in dataloader:
            text_emb, lab_feats = text_emb.to(device), lab_feats.to(device)
            logits = model(text_emb, lab_feats)
            preds = torch.argmax(logits, dim=1)
            correct += (preds == labels.to(device)).sum().item()
            total += labels.size(0)
    return correct / total


# ===========================================
#  SIMULATED TRAINING
# ===========================================

# Simulate a dataset (replace this with real data paths)
simulated_dataset = [
    ("med.docx", "lab_record.json", random.randint(0, 1)),  # random label
    ("med.docx", "lab_record.json", random.randint(0, 1)),
    ("med.docx", "lab_record.json", random.randint(0, 1))
]

# Dataset and DataLoader
dataset = RiskDataset(simulated_dataset)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Initialize model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MultiModalRiskClassifier().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

# Train
print(" Starting Training...")
for epoch in range(5):
    loss = train_model(model, dataloader, optimizer, criterion, device)
    acc = evaluate_model(model, dataloader, device)
    print(f"Epoch {epoch+1} | Loss: {loss:.4f} | Accuracy: {acc:.2f}")

# Inference on new data
model.eval()
text = load_prescription_text("med.docx")
labs = load_lab_data("lab_record.json")
text_emb = get_text_embedding(text)
lab_feats = extract_lab_features(labs).unsqueeze(0)
with torch.no_grad():
    logits = model(text_emb, lab_feats)
    probs = torch.softmax(logits, dim=1)
    pred = torch.argmax(probs, dim=1).item()
    print("\n Final Prediction:", ["Low Risk", "High Risk"][pred])
    print(" Probabilities:", probs.tolist())


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

 Starting Training...
Epoch 1 | Loss: 2.9469 | Accuracy: 0.67
Epoch 2 | Loss: 2.0008 | Accuracy: 0.67
Epoch 3 | Loss: 1.2119 | Accuracy: 0.67
Epoch 4 | Loss: 0.6418 | Accuracy: 0.33
Epoch 5 | Loss: 1.2849 | Accuracy: 0.33

 Final Prediction: High Risk
 Probabilities: [[0.14614054560661316, 0.8538594245910645]]
