In [6]:
import pandas as pd

news_path = "data/medquad.csv"
df = pd.read_csv(news_path)
df = df.dropna(subset=["answer"])
documents = df.to_dict(orient='records')

In [5]:
import hashlib

def generate_document_id(doc):
    combined = f"{doc['question']}-{doc['answer']}-{doc['focus_area']}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:9]
    return document_id

In [9]:
documents[3]

{'question': 'What are the treatments for Glaucoma ?',
 'answer': 'Although open-angle glaucoma cannot be cured, it can usually be controlled. While treatments may save remaining vision, they do not improve sight already lost from glaucoma. The most common treatments for glaucoma are medication and surgery. Medications  Medications for glaucoma may be either in the form of eye drops or pills. Some drugs reduce pressure by slowing the flow of fluid into the eye. Others help to improve fluid drainage. (Watch the video to learn more about coping with glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.) For most people with glaucoma, regular use of medications will control the increased fluid pressure. But, these drugs may stop working over time. Or, they may cause side effects. If a problem occurs, the eye care professional may select other drugs, change the dose, or suggest other ways to d

In [8]:
for doc in documents:
    doc['id'] = generate_document_id(doc)

In [10]:
documents[3]

{'question': 'What are the treatments for Glaucoma ?',
 'answer': 'Although open-angle glaucoma cannot be cured, it can usually be controlled. While treatments may save remaining vision, they do not improve sight already lost from glaucoma. The most common treatments for glaucoma are medication and surgery. Medications  Medications for glaucoma may be either in the form of eye drops or pills. Some drugs reduce pressure by slowing the flow of fluid into the eye. Others help to improve fluid drainage. (Watch the video to learn more about coping with glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.) For most people with glaucoma, regular use of medications will control the increased fluid pressure. But, these drugs may stop working over time. Or, they may cause side effects. If a problem occurs, the eye care professional may select other drugs, change the dose, or suggest other ways to d

In [12]:
from collections import defaultdict

In [14]:
defaultdict()

defaultdict(None, {})

In [15]:
hashes = defaultdict(list)

for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)

In [19]:
len(hashes), len(documents)

(16360, 16407)

In [24]:
for k, values in hashes.items():
    if len(values) > 1:
        print(k, len(values))
        if len(values) == 8:
            print(values[0], values[-1])

ebbd3c188 4
a4391e582 4
2eaf79166 4
bfd59f470 4
b81340472 8
{'question': 'What causes Causes of Diabetes ?', 'answer': 'Other types of diabetes have a variety of possible causes.\n                \nGenetic Mutations Affecting Beta Cells, Insulin, and Insulin Action\n                \nSome relatively uncommon forms of diabetes known as monogenic diabetes are caused by mutations, or changes, in a single gene. These mutations are usually inherited, but sometimes the gene mutation occurs spontaneously. Most of these gene mutations cause diabetes by reducing beta cells ability to produce insulin.\n                \nThe most common types of monogenic diabetes are neonatal diabetes mellitus (NDM) and MODY. NDM occurs in the first 6 months of life. MODY is usually found during adolescence or early adulthood but sometimes is not diagnosed until later in life. More information about NDM and MODY is provided in the NIDDK health topic, Monogenic Forms of Diabetes.\n                \nOther rare gen

In [25]:
hashes['bfd59f470']

[{'question': 'What causes Causes of Diabetes ?',
  'answer': 'Insulin Resistance and Beta Cell Dysfunction\n                \nHormones produced by the placenta and other pregnancy-related factors contribute to insulin resistance, which occurs in all women during late pregnancy. Insulin resistance increases the amount of insulin needed to control blood glucose levels. If the pancreas cant produce enough insulin due to beta cell dysfunction, gestational diabetes occurs.\n                \nAs with type 2 diabetes, excess weight is linked to gestational diabetes. Overweight or obese women are at particularly high risk for gestational diabetes because they start pregnancy with a higher need for insulin due to insulin resistance. Excessive weight gain during pregnancy may also increase risk.\n                \nFamily History\n                \nHaving a family history of diabetes is also a risk factor for gestational diabetes, suggesting that genes play a role in its development. Genetics ma

In [26]:
import json

In [27]:
with open('documents-with-ids.json', 'wt') as f_out:
    json.dump(documents, f_out, indent=2)

In [28]:
!head documents-with-ids.json

[
  {
    "question": "What is (are) Glaucoma ?",
    "answer": "Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. While glaucoma can strike anyone, the risk is much greater for people over 60. How Glaucoma Develops  There are several different types of glaucoma. Most of these involve the drainage system within the eye. At the front of the eye there is a small space called the anterior chamber. A clear fluid flows through this chamber and bathes and nourishes the nearby tissues. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.) In glaucoma, for still unknown reasons, the fluid drains too slowly out of the eye. As the fluid builds up, the pressure inside the eye rises. Unless this pressure is controlled, it may cause damage to the optic nerve and other parts of the eye and result in loss o

In [29]:
prompt_template = """
You emulate a patient who's suffering a few diseases.
Formulate 5 questions this patient might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

focus_area: {focus_area}
question: {question}
text: {answer}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

In [30]:
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

api_key = os.getenv("OPENAI_API_KEY") # create a .env in the jupy project directory ($pwd) and write OPENAI_API_KEY="your_open_ai_key"
client = OpenAI(api_key=api_key)

In [31]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model='gpt-5',
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

In [32]:
from tqdm.auto import tqdm

  from .autonotebook import tqdm as notebook_tqdm


In [33]:
results = {}

In [55]:
for doc in tqdm(documents[:800]): 
    doc_id = doc['id']
    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

 72%|███████████████████████████████████████████████████████▍                     | 576/800 [9:05:58<3:32:19, 56.87s/it]


KeyboardInterrupt: 

2025.10.8: GPT-5 cost $9.01 for 576 requests

In [60]:
len(results)

576

In [58]:
parsed_resulst = {}

for doc_id, json_questions in results.items():
    parsed_resulst[doc_id] = json.loads(json_questions)

In [59]:
parsed_resulst

{'641e393e8': ['Could you define glaucoma for me and explain how it can lead to vision loss?',
  'I live with other conditions; is this illness something anyone can get, and does being past 60 raise the odds?',
  'What goes wrong with the eye’s front chamber and fluid outflow that makes pressure climb?',
  'What exactly is open‑angle glaucoma, and how does a slow filter or mesh end up injuring the optic nerve?',
  'Is there a cure—can lost sight be restored, or do treatments mainly preserve the vision I still have?'],
 'dc2b69cd1': ['I am 65 and Latino; does that make me more likely to develop glaucoma?',
  'I am 45 and African American with no eye symptoms; should I still get a dilated exam to check for glaucoma?',
  'My parent had glaucoma; how much does family history raise my risk?',
  'I also have high blood pressure; can that worsen optic nerve damage in glaucoma, and should I manage it closely with my doctor?',
  'How does eye pressure influence glaucoma, and can an exam tell wh

In [61]:
doc_index = {d['id']: d for d in documents[:576]}

In [62]:
doc_index

{'641e393e8': {'question': 'What is (are) Glaucoma ?',
  'answer': "Glaucoma is a group of diseases that can damage the eye's optic nerve and result in vision loss and blindness. While glaucoma can strike anyone, the risk is much greater for people over 60. How Glaucoma Develops  There are several different types of glaucoma. Most of these involve the drainage system within the eye. At the front of the eye there is a small space called the anterior chamber. A clear fluid flows through this chamber and bathes and nourishes the nearby tissues. (Watch the video to learn more about glaucoma. To enlarge the video, click the brackets in the lower right-hand corner. To reduce the video, press the Escape (Esc) button on your keyboard.) In glaucoma, for still unknown reasons, the fluid drains too slowly out of the eye. As the fluid builds up, the pressure inside the eye rises. Unless this pressure is controlled, it may cause damage to the optic nerve and other parts of the eye and result in los

In [63]:
final_results = []

for doc_id, questions in parsed_resulst.items():
    focus_area = doc_index[doc_id]['focus_area']
    for q in questions:
        final_results.append((q, focus_area, doc_id))

In [64]:
import pandas as pd

In [69]:
df = pd.DataFrame(final_results, columns=['question', 'focus_area', 'id'])

In [70]:
df.to_csv('ground-truth-data_576.csv', index=False)

In [73]:
!head ground-truth-data_576.csv

question,focus_area,id
Could you define glaucoma for me and explain how it can lead to vision loss?,Glaucoma,641e393e8
"I live with other conditions; is this illness something anyone can get, and does being past 60 raise the odds?",Glaucoma,641e393e8
What goes wrong with the eye’s front chamber and fluid outflow that makes pressure climb?,Glaucoma,641e393e8
"What exactly is open‑angle glaucoma, and how does a slow filter or mesh end up injuring the optic nerve?",Glaucoma,641e393e8
"Is there a cure—can lost sight be restored, or do treatments mainly preserve the vision I still have?",Glaucoma,641e393e8
I am 65 and Latino; does that make me more likely to develop glaucoma?,Glaucoma,dc2b69cd1
I am 45 and African American with no eye symptoms; should I still get a dilated exam to check for glaucoma?,Glaucoma,dc2b69cd1
My parent had glaucoma; how much does family history raise my risk?,Glaucoma,dc2b69cd1
"I also have high blood pressure; can that worsen optic nerve damage in glaucoma, and s

In [72]:
df

Unnamed: 0,question,focus_area,id
0,Could you define glaucoma for me and explain h...,Glaucoma,641e393e8
1,I live with other conditions; is this illness ...,Glaucoma,641e393e8
2,What goes wrong with the eye’s front chamber a...,Glaucoma,641e393e8
3,"What exactly is open‑angle glaucoma, and how d...",Glaucoma,641e393e8
4,"Is there a cure—can lost sight be restored, or...",Glaucoma,641e393e8
...,...,...,...
2875,"When I go upstairs or walk any distance, my le...",Peripheral Arterial Disease (P.A.D.),8c7e1b0c2
2876,"I get cramping in my buttocks, thighs, calves,...",Peripheral Arterial Disease (P.A.D.),8c7e1b0c2
2877,My foot pulses seem weak and I have toe sores ...,Peripheral Arterial Disease (P.A.D.),8c7e1b0c2
2878,The skin on my lower legs sometimes looks pale...,Peripheral Arterial Disease (P.A.D.),8c7e1b0c2
