# **Challenge: Using LLMs to Structure Clinical Data**

Please see Notebook 6_1 for information on setting up OntoGPT and ollama

# 🧠 Hackathon Challenge: Structuring Pathology Reports to Study Graft-versus-Host Disease (GvHD)

## 🧪 Scenario: A Pathologist’s Data Dilemma

You are a **pathologist** working with a large collection of **free-text pathology reports from cancer patients who have undergone bone marrow transplantation**. These reports are rich with clinical insights but currently unstructured and difficult to analyze at scale.

You are particularly interested in studying **Graft-versus-Host Disease (GvHD)** — a serious and potentially life-threatening condition that occurs **when donor immune cells attack the recipient’s tissues**. This complication commonly arises after **allogeneic hematopoietic stem cell transplantation**, a treatment for patients with cancers such as leukemia, lymphoma, and myeloma.

GvHD can affect multiple organs, especially the **skin, liver, and gastrointestinal (GI) tract**. Histopathological findings in reports often describe:

- Apoptosis in epithelial layers  
- Crypt destruction in the GI tract  
- Interface dermatitis in the skin  
- Bile duct injury in the liver  

To better understand disease patterns, treatment outcomes, and potential genomic correlations, **you need structured data**.

---

## 🎯 Your Mission

Use **OntoGPT** and **Ollama** to **extract and structure information** from these free-text pathology reports into a machine-readable format.

Your output should be:

- One **`.txt` file per report**
- In **YAML format**, as output by OntoGPT
- With structured fields that capture relevant clinical, anatomical, and pathological findings

## 🛠️ Tools & Hints

### 🧠 Step 1: Pick Your Ollama Model

Head to [Ollama's model search](https://ollama.com/search) and explore different LLMs.

✅ **Hint:**  
Consider:
- **Model size** (some larger models may be too slow or memory-intensive)
- **Performance on biomedical or reasoning tasks**
- **Instruction-following ability**  

Pull them using:

```bash
ollama pull [model name]

#### 🚀 Optional: Using a Fine-Tuned Model (Advanced)

While **not required for this hackathon** (we are only using CPU resources), we want to highlight how **fine-tuning can improve performance** for this task.

A **fine-tuned version of LLaMA 3.1 8B**, specifically trained to generate **accurate pathological statements from medical text**, is available here:

🔗 [UoS-HGIG/MIMIC on Hugging Face](https://huggingface.co/UoS-HGIG/MIMIC)

This model has been fine-tuned on clinical data derived from MIMIC (open source medical records) and is tailored for tasks like extracting:

- Pathology statements
- Diagnostic phrases
- Structured clinical observations

You can download the fine-tuned weights and **merge them with the base LLaMA 3.1 8B model** to use it via your own infrastructure (GPU recommended).

🧠 **Reminder:** You do *not* need to use this model for the hackathon, but it's a great example of how domain-specific fine-tuning can boost performance in real-world biomedical NLP tasks.


### 🧬 Step 2: Choose the Best OntoGPT Template

Explore [OntoGPT templates](https://github.com/monarch-initiative/ontogpt/blob/main/src/ontogpt/templates).  
These define the schema for what gets extracted.

✅ **Hint:**  
Look for templates suitable for **histopathology**, **clinical reports**, or **disease findings**. 


### 🧪 Step 3: Write a Script to Process the Reports

Now that you’ve selected your Ollama model and OntoGPT template, it’s time to **automate the processing**.

📝 **Task:**
Write a Python script that:

1. Reads all `.txt` input files from the folder: `../CHIFIR_reports`
2. Sends the content of each file to OntoGPT using your chosen template and model
3. Saves the structured OntoGPT output as a new `.txt` file (in YAML format)  
   — use the same filename, but write it to a new folder (e.g., `structured_outputs/`)

✅ **Hint:**  
OntoGPT has a CLI you can call from Python using `subprocess`. For example:

```bash
ontogpt extract -t histopathology -i input.txt -o output.yaml

e.g.:
import subprocess

subprocess.run([
    "ontogpt", "extract",
    "-t", "histopathology",
    "-i", "path/to/input.txt",
    "-o", "path/to/output.yaml"
])


## 🏆 How to Win

The team that produces the **most accurate, complete, and well-structured outputs** from the pathology reports will be crowned the winner!

### 🧪 Judging Criteria:
- 🧠 **Accuracy**: Are the key findings, anatomical sites, and diagnoses correctly extracted?
- 🧾 **Completeness**: Does the YAML output capture all relevant data from the report?

📦 Bonus points for:
- Comparative evaluation of different models

Let the structuring begin — may the cleanest YAML win! 💪
