<center>
  <a href="https://escience.sdu.dk/index.php/ucloud/">
    <img src="https://escience.sdu.dk/wp-content/uploads/2020/03/logo_esc.svg" width="400" height="186" />
  </a>
</center>
<br>
<p style="font-size: 1.2em;">
  This notebook was tested against an instance of <strong>Label Studio v1.16.0</strong> running on UCloud.
</p>


# 02 - Curating a Medical Q&A Dataset with Label Studio

This tutorial guides you through the process of creating a high-quality dataset for a Generative AI model, with a focus on medical Q&A generation. 
You'll learn how to set up projects in Label Studio, import datasets, and configure tasks to streamline the annotation process.

### ✅ **Prerequisites**
- 🚀 Start a Label Studio instance on UCloud:
    - Import the `label-studio` folder in this repository as the Label Studio database directory.
    - [Connect](https://docs.cloud.sdu.dk/guide/submitting.html#connect-to-other-jobs) this Label Studio instance to the **Triton Inference Server** job serving the distributed **Llama 3.1 Nemotron Nano 4B v1.1** model.
        - Use `triton` as *hostname* when selecting the job.
- 📘 Launch this notebook in an IDE on UCloud (e.g., JupyterLab or Coder):
    - Ensure the notebook can connect to the previously started Label Studio instance.
        - Use `label-studio` as the hostname when connecting to the job.

## 🛠️ Step 1: Install Label-Studio SDK

In [1]:
!pip install -q datasets==3.4.1 label-studio-sdk==1.0.11 python-dotenv tqdm

## 🛠️ Step 2: Setup Label Studio

In [2]:
import os
from dotenv import load_dotenv
from label_studio_sdk.client import LabelStudio

# Load environment variables from label-studio/.env
dotenv_path = os.path.join("label-studio", ".env")
load_dotenv(dotenv_path)

# Retrieve Label Studio SECRET key from the environment variables
LABEL_STUDIO_URL = "http://label-studio:8080"
API_KEY = os.getenv("API_KEY")

if not API_KEY:
    raise ValueError("API_KEY not found in the environment file!")

# Connect to the Label Studio API
client = LabelStudio(base_url=LABEL_STUDIO_URL, api_key=API_KEY)

## 🛠️ Step 3: Question Generation with MeDAL

The [MeDAL dataset](https://huggingface.co/datasets/medal) is a large medical text dataset curated from over 14 million abstracts from PubMed publications.

We can leverage this dataset to establish context for generating a synthetic Q&A dataset. To begin, we'll set up a Label Studio project for question generation.

In [3]:
medal_question_config = """
<View className="root">
  <Style>
  .root {
    font-family: 'Roboto', sans-serif;
    line-height: 1.6;
    background-color: #f0f0f0;
  }
  .container {
    margin: 0 auto;
    padding: 20px;
    background-color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt {
    padding: 20px;
    background-color: #0084ff;
    color: #ffffff;
    border-radius: 5px;
    margin-bottom: 20px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt-input {
    flex-basis: 49%;
    padding: 20px;
    background-color: rgba(44, 62, 80, 0.9);
    color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
    width: 100%;
    border: none;
    font-family: 'Roboto', sans-serif;
    font-size: 16px;
    outline: none;
  }
  .prompt-input:focus {
    outline: none;
  }
  .prompt-input:hover {
    background-color: rgba(52, 73, 94, 0.9);
    cursor: pointer;
    transition: all 0.3s ease;
  }
  .lsf-richtext__line:hover {
    background: unset;
  }
  </Style>
  <Text name="chat" value="$text" layout="dialogue"/>
  <Header value="Question prompt:"/>
  <View className="prompt">
    <TextArea name="prompt" toName="chat" rows="4" editable="true" maxSubmissions="1" showSubmitButton="false"/>
  </View>
  <Header value="Proposed questions:"/>
  <TextArea name="response" toName="chat" rows="3" editable="true" maxSubmissions="1" showSubmitButton="false"/>
</View>
"""

medal_questions_project = client.projects.create(
    title='MeDAL Question Generation',
    color='#ECB800',
    description='',
    label_config=medal_question_config
)

Load the dataset and import it into Label Studio. Since the dataset is quite large, we'll start by loading only a subset of examples.

In [None]:
from datasets import load_dataset

medal_train_dataset = load_dataset("medal", split='train', cache_dir="datasets")
medal_validation_dataset = load_dataset("medal", split='validation', cache_dir="datasets")
medal_test_dataset = load_dataset("medal", split='validation', cache_dir="datasets")

In [None]:
medal_train_dataset.num_rows

In [None]:
# Insert examples into Label Studio
from tqdm import tqdm

num_examples = 10000

for i in tqdm(range(num_examples), desc="Uploading tasks"):
    task = medal_train_dataset[i]
    client.tasks.create(
        project=medal_questions_project.id,
        data=task
    )


For question generation, we need to have a strong prompt to yield solid results. Here is a useful prompt for generating medical questions for examples from the MeDAL dataset.

```txt
Given a block of medical text, generate several direct, succinct, and unique questions that stand alone, focusing on extracting specific medical information such as symptoms, diagnosis, treatment options, or patient management strategies. Each question should aim to elicit precise and informative responses without requiring additional context. The questions should cover diverse aspects of the medical content to ensure a comprehensive understanding. Ensure each question is clear and formulated to be self-contained. Here are examples to guide your question generation:

What are the common symptoms associated with [specific condition]?
How is [specific condition] diagnosed?
What treatment options are available for [specific condition]?
What are the potential side effects of [specific medication]?
What preventive measures are recommended for [specific condition]?

Use these examples as a template, tailoring questions to different parts of the text to maximize the dataset's utility and accuracy. Questions must be separated by a new line without any markers or numbers. Do not output any text before and after the questions. Generate up to 5 questions. 
```

🔧 To set up the ML Backend, open the Label Studio terminal interface and run the commands:
```bash
$ cd /work/label-studio/ml_backend
$ pip install -r requirements.txt
$ source setup_questions.sh
$ gunicorn _wsgi:app --bind 0.0.0.0:9090 --workers 10 --timeout 120 --graceful-timeout 30 --keep-alive 5
```
This will launch a new ML backend configured for generating questions. [Load the ML backend server](https://docs.cloud.sdu.dk/Apps/label-studio.html#load-the-model) into the Label Studio project.

In [7]:
from label_studio_sdk import Client

def get_project_task_ids(label_studio_host, api_token, project_id):
    client = Client(url=label_studio_host, api_key=api_token)
    project = client.get_project(project_id)
    task_ids = project.get_tasks_ids()
    return task_ids

project_id = medal_questions_project.id

tasks_ids = get_project_task_ids(LABEL_STUDIO_URL, API_KEY, project_id)

In [None]:
tasks_ids

In [None]:
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

def retrieve_batch_predictions(project_id, task_ids, api_key):
    url = f"{LABEL_STUDIO_URL}/api/dm/actions?id=retrieve_tasks_predictions&project={project_id}"
    headers = {
        'Authorization': f'Token {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'selectedItems': {
            'all': False,
            'included': task_ids
        }
    }
    response = requests.post(url, headers=headers, json=data)
    return response.json()

batch_size   = 15 # compare with Triton Server MAX_BATCH_SIZE
start_index  = 0
end_index    = 10000
max_workers  = 8   # tune this based on how many parallel requests your server can handle
project_id = medal_questions_project.id

def submit_batches(ids):
    return retrieve_batch_predictions(project_id, ids, API_KEY)

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = []
    for i in range(start_index, end_index, batch_size):
        batch_ids = tasks_ids[i : min(i + batch_size, end_index)]
        futures.append(executor.submit(submit_batches, batch_ids))

    # collect results (or just wait for them)
    for future in as_completed(futures):
        try:
            result = future.result()
            # do something with result, e.g. log success
        except Exception as e:
            # handle per‑batch failures
            logging.exception(f"Batch failed: {e}")

>💬 **Note:**
>
>Use the Label Studio interface to **review and finalize annotations** for the predicted questions. These annotated questions will then be used to **automatically generate corresponding answers**.

## Answer Generation with MeDAL

The final step involves setting up a project for answer generation using the questions created in the previous step.

We'll set up a project, export our questions generated in the previous section and generate answers in Label Studio.

In [11]:
medal_answer_config = '''
<View className="root">
  <Style>
  .root {
    font-family: 'Roboto', sans-serif;
    line-height: 1.6;
    background-color: #f0f0f0;
  }
  .container {
    margin: 0 auto;
    padding: 20px;
    background-color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt {
    padding: 20px;
    background-color: #0084ff;
    color: #ffffff;
    border-radius: 5px;
    margin-bottom: 20px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
  }
  .prompt-input {
    flex-basis: 49%;
    padding: 20px;
    background-color: rgba(44, 62, 80, 0.9);
    color: #ffffff;
    border-radius: 5px;
    box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
    width: 100%;
    border: none;
    font-family: 'Roboto', sans-serif;
    font-size: 16px;
    outline: none;
  }
  .prompt-input:focus {
    outline: none;
  }
  .prompt-input:hover {
    background-color: rgba(52, 73, 94, 0.9);
    cursor: pointer;
    transition: all 0.3s ease;
  }
  .lsf-richtext__line:hover {
    background: unset;
  }
  </Style>
  <Text name="chat" value="$text" layout="dialogue"/>
  <Header value="Answer prompt:"/>
  <View className="prompt">
    <TextArea name="prompt" toName="chat" rows="4" editable="true" maxSubmissions="1" showSubmitButton="false"/>
  </View>
  <Header value="Proposed answer:"/>
  <TextArea name="response" toName="chat" rows="3" editable="true" maxSubmissions="1" showSubmitButton="false"/>
</View>
    '''


medal_answers_project = client.projects.create(
    title='MeDAL Answer Generation',
    color='#617ADA',
    description='',
    label_config=medal_answer_config
)

Export questions from our previous project.

In [12]:
from label_studio_sdk.data_manager import Filters, Column, Type, Operator

filters = Filters.create(Filters.AND, [
    Filters.item(
        Column.completed_at,
        Operator.EMPTY,
        Type.Boolean,
        Filters.value(False)
    )
])

In [13]:

view = client.views.create(
    project=medal_questions_project.id,
    data={
        'title': 'Annotated Tasks',
        'filters': filters
    }
)
tab = client.views.get(id=view.id)

In [None]:
# Download questions from Label Studio
annotated_tasks = list(
    client.tasks.list(
        view=tab.id,
        fields='all',
        page_size=100
    )
)

questions_tasks = annotated_tasks
print(len(questions_tasks))

In [None]:
questions_tasks[0].annotations[0]['result']

In [None]:
questions_tasks[0].annotations[0]['result'][0]['value']['text'][0].split('\n')

Format as a Hugging Face dataset.

In [18]:
import re
from datasets import Dataset

# Extract questions
def extract_questions_data(questions_tasks):
    data = []
    for task in questions_tasks:
        for result in task.annotations[0]['result']:
            if result['from_name'] == 'response':
                # Extract the abstract_id
                abstract_id = task.data['abstract_id']
                
                # Extract the question text and split by newlines to handle multiple questions
                questions = result['value']['text'][0].split('\n')
                
                # Store each question with its corresponding abstract_id
                for question in questions:
                    # Check if the question is not empty and contains at least one alphanumeric character
                    if question.strip() and re.search('[a-zA-Z0-9]', question):
                        data.append({'abstract_id': abstract_id, 'text': question})
                break
    return data

extracted_questions_data = extract_questions_data(questions_tasks)

questions_dataset = Dataset.from_dict({'abstract_id': [item['abstract_id'] for item in extracted_questions_data], 
                             'text': [item['text'] for item in extracted_questions_data]})


Review our dataset and insert it into our answers project.

In [None]:
questions_dataset

In [20]:
# Upload the dataset to our Answers Project
for question in questions_dataset: 
    client.tasks.create(
        project=medal_answers_project.id,
        data=question
    )

Similar to the questions curation, we also need a strong prompt for generating the answers to these questions. Here is a sample prompt that can be used.

```txt
You are a medical expert. Answer the following question using only the information provided in the accompanying text. Follow these strict rules:

- Output only the final answer.
- Do not restate the question.
- Do not explain, elaborate, speculate, or add context.
- Do not add formatting, markdown, notes, or instructions.
- Only use content explicitly stated in the text.
```

🔧 To set up the ML Backend for answer generation, first stop the backend server that was previously used for question generation.
Then, open a terminal in the Label Studio environment and run the following commands:
```bash
$ cd /work/label-studio/ml_backend
$ source setup_answers.sh
$ gunicorn _wsgi:app --bind 0.0.0.0:9090 --workers 10 --timeout 120 --graceful-timeout 30 --keep-alive 5
```
This will launch a new ML backend configured for generating answers. Again, connect the backend to the project.

In [21]:
from label_studio_sdk import Client

def get_project_task_ids(label_studio_host, api_token, project_id):
    client = Client(url=label_studio_host, api_key=api_token)
    project = client.get_project(project_id)
    task_ids = project.get_tasks_ids()
    return task_ids

project_id = medal_answers_project.id

tasks_ids = get_project_task_ids(LABEL_STUDIO_URL, API_KEY, project_id)

In [None]:
tasks_ids

In [23]:
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

def retrieve_batch_predictions(project_id, task_ids, api_key):
    url = f"{LABEL_STUDIO_URL}/api/dm/actions?id=retrieve_tasks_predictions&project={project_id}"
    headers = {
        'Authorization': f'Token {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'selectedItems': {
            'all': False,
            'included': task_ids
        }
    }
    response = requests.post(url, headers=headers, json=data)
    return response.json()


batch_size   = 15  # compare with Triton Server MAX_BATCH_SIZE and INSTANCE_COUNT
start_index  = 0
end_index    = 50000
max_workers  = 8   # tune this based on how many parallel requests your server can handle
project_id = medal_answers_project.id

def submit_batches(ids):
    return retrieve_batch_predictions(project_id, ids, API_KEY)

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    futures = []
    for i in range(start_index, end_index, batch_size):
        batch_ids = tasks_ids[i : min(i + batch_size, end_index)]
        futures.append(executor.submit(submit_batches, batch_ids))

    # collect results (or just wait for them)
    for future in as_completed(futures):
        try:
            result = future.result()
            # do something with result, e.g. log success
        except Exception as e:
            # handle per‑batch failures
            logging.exception(f"Batch failed: {e}")

>💬 Note:
>
>Use the Label Studio interface to **review and validate the predicted answers**. The finalized annotations will be used to **assemble the synthetic Q&A dataset** for fine-tuning.

## Curate Q&A Dataset

Once question-answer pairs are generated and refined, download the synthetic dataset. 

In [24]:
from label_studio_sdk.data_manager import Filters, Column, Type, Operator

filters = Filters.create(Filters.AND, [
    Filters.item(
        Column.completed_at,
        Operator.EMPTY,
        Type.Boolean,
        Filters.value(False)
    )
])

view = client.views.create(
    project=medal_answers_project.id,
    data={
        'title': 'Annotated Tasks',
        'filters': filters
    }
)
tab = client.views.get(id=view.id)

In [None]:
# Download answers from Label Studio
answers_tasks = list(
    client.tasks.list(
        view=tab.id,
        fields='all',
        page_size=100
    )
)
print(len(answers_tasks))

In [26]:
# Create Q&A dataset

from datasets import Dataset

def extract_answers_data(answers_tasks):
    data = []
    for task in answers_tasks:
        for result in task.annotations[0]['result']:
            if result['from_name'] == 'response':
                # Extract the abstract_id
                abstract_id = task.data['abstract_id']
                
                # Extract the question text and split by newlines to handle multiple questions
                answer = result['value']['text'][0]
                question = task.data['text']
                
                # Store each question with its corresponding abstract_id
                data.append({'abstract_id': abstract_id, 'question': question, 'answer': answer})
    return data

extracted_answers_data = extract_answers_data(answers_tasks)

qa_dataset = Dataset.from_dict({'abstract_id': [item['abstract_id'] for item in extracted_answers_data], 
                             'question': [item['question'] for item in extracted_answers_data],
                             'answer': [item['answer'] for item in extracted_answers_data]})

In [None]:
qa_dataset[0]

In [None]:
# Export to a JSON Lines file
qa_dataset.to_json("datasets/medal-qa_synthetic_dataset_v1.jsonl", lines=True)