# 4 Generating the Bank for the MicroTasks

To generate the bank for the microtasks I will use an API for an LLM.
The output will be two questions for each core course of each programme.

Basically I will create the perfect prompt that will use the columns of the df_courses to generate the microstasks. 

We use two prompts: broad + disambiguaition (Kenneth Style)

In [1]:
from pathlib import Path
import pandas as pd
#!pip install --upgrade openai
import os, re
from openai import OpenAI
import json
import numpy as np
from tqdm import tqdm  # optional progress bar, pip install tqdm


## 1 Load the data and filter for max 2 courses

In [2]:
# load the csv file about the courses forwhih we have to gen the tasks
silver = Path("../data_programmes_courses/silver")

df_courses_tasks = pd.read_csv(silver / "df_courses_tasks_silver.csv", encoding="utf-8-sig")
print("The shape of the courses tasks dataframe is:", df_courses_tasks.shape)

# keep only first two courses from each programme
df_courses_tasks = df_courses_tasks.groupby("programme_title").head(2).reset_index(drop=True)
print("After keeping only first two courses from each programme the shape is:", df_courses_tasks.shape)

The shape of the courses tasks dataframe is: (36, 21)
After keeping only first two courses from each programme the shape is: (28, 21)


## 2. Set up OpenAI client 

In [3]:

key_path = Path("../data_bank_microtasks") / "api_key.txt"

# Read the key and strip spaces and newlines
api_key = key_path.read_text(encoding="utf8").strip()

# Create the client using this key
client = OpenAI(api_key=api_key)

models = client.models.list()
#for m in models.data:
#    print(m.id)

model_gpt = "gpt-4.1-mini"  # smaller and cheaper model


## 3. Define the Prompts
Here is the prompt that generates for each programme the questions based on the core courses.

In [4]:
SYSTEM_PROMPT_BROAD = """
You generate one broad RIASEC question for a playful study choice tool.

The question should feel like a first step in a real course task.

INPUT
You receive one JSON object with:
- programme_title
- course_code
- course_name
- course_objective
- course_content
- teaching_methods
- assessment

TASK
Create exactly one multiple choice question with six options.

Rules
1. Use the course information to imagine a realistic first year situation.
2. Write a short question string that describes the situation and ends with a question.
3. Create tiny_learn as a list of exactly three short sentences:
   a. definition of the key concept
   b. method reminder
   c. common mistake
4. Create six options labelled A to F.
5. Each option describes a plausible first action the student could take.
6. Each option must be tagged with one RIASEC code:
   R, I, A, S, E, or C.
7. Across the six options you must use each of the six RIASEC codes exactly once.
8. Do not mention RIASEC, personality, or profiles in the visible text.

OUTPUT
Return one JSON object with this shape:

{
  "question": string,
  "tiny_learn": [string, string, string],
  "options": {
    "A": {"text": string, "riasec": "R" | "I" | "A" | "S" | "E" | "C"},
    "B": {...},
    "C": {...},
    "D": {...},
    "E": {...},
    "F": {...}
  }
}

All strings must be single line strings. Do not insert raw newline characters in any value.
"""

SYSTEM_PROMPT_DISAMB = """
You generate one RIASEC disambiguation question for a playful study choice tool.

The question should feel like a first step in a real course task.

INPUT
You receive one JSON object with:
- programme_title
- course_code
- course_name
- course_objective
- course_content
- teaching_methods
- assessment
- triple_code   for example "RIA"

triple_code contains three distinct letters from R, I, A, S, E, C.

TASK
Create exactly one multiple choice question with three options.

Rules
1. Use the course information to imagine a realistic first year situation.
2. Write a short question string that describes the situation and ends with a question.
3. Create tiny_learn as a list of exactly three short sentences:
   a. definition of the key concept
   b. method reminder
   c. common mistake
4. Create three options labelled A, B, C.
5. The three options together must use exactly the three letters in triple_code, one per option.
   For example triple_code "RIA" means one R, one I, one A.
6. Each option describes a plausible first action the student could take.
7. Each option must be tagged with a riasec letter that is one of the letters in triple_code.
8. Do not mention RIASEC, personality, or profiles in the visible text.

OUTPUT
Return one JSON object with this shape:

{
  "question": string,
  "tiny_learn": [string, string, string],
  "options": {
    "A": {"text": string, "riasec": one letter from triple_code},
    "B": {...},
    "C": {...}
  }
}

All strings must be single line strings. Do not insert raw newline characters in any value.
"""


## 4. Define the helpers functions

In [5]:
import json
import re
from collections import defaultdict
from pathlib import Path
from tqdm import tqdm

def truncate(text, max_chars=1200):
    """Short helper to shorten very long fields."""
    if text is None:
        return ""
    s = str(text)
    if len(s) <= max_chars:
        return s
    return s[:max_chars]

def build_course_payload(row):
    """Context that we pass to the model for one course."""
    return {
        "programme_title": str(row.get("programme_title", "")),
        "course_code": str(row.get("code", "")),
        "course_name": str(row.get("course_name", "")),
        "course_objective": truncate(row.get("course_objective", ""), 800),
        "course_content": truncate(row.get("course_content", ""), 800),
        "teaching_methods": truncate(row.get("additional_information_teaching_methods", ""), 400),
        "assessment": truncate(row.get("method_of_assessment", ""), 400),
    }

def safe_parse_json(raw_text: str):
    """
    Clean and parse model output as JSON.
    Removes newlines and obvious trailing commas, then parses.
    """
    if raw_text is None:
        raise ValueError("Model returned no text")

    text = raw_text.strip()
    if not text:
        raise ValueError("Model returned empty text")

    # normalise whitespace
    text = text.replace("\r\n", " ").replace("\n", " ").replace("\t", " ")
    text = re.sub(r"\s+", " ", text)

    # remove trailing commas before closing braces or brackets
    text = re.sub(r",\s*([}\]])", r"\1", text)

    # try direct
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # try first and last brace
    start = text.find("{")
    end = text.rfind("}")
    if start == -1 or end == -1 or end <= start:
        print("Could not find JSON object, preview:")
        print(text[:400])
        raise ValueError("No JSON object found")

    candidate = text[start : end + 1]
    candidate = re.sub(r",\s*([}\]])", r"\1", candidate)

    try:
        return json.loads(candidate)
    except json.JSONDecodeError as e:
        print("Still could not parse, preview:")
        print(candidate[:400])
        raise e


def call_broad_question(row, model_name="gpt-4.1-mini"):
    payload = build_course_payload(row)

    response = client.responses.create(
        model=model_name,
        input=json.dumps(payload),
        instructions=SYSTEM_PROMPT_BROAD,
        max_output_tokens=500,
    )

    raw = response.output_text
    print("BROAD PREVIEW:", repr((raw or "")[:160]))
    return safe_parse_json(raw)


def call_disamb_question(row, triple_code, model_name="gpt-4.1-mini"):
    payload = build_course_payload(row)
    payload["triple_code"] = triple_code

    response = client.responses.create(
        model=model_name,
        input=json.dumps(payload),
        instructions=SYSTEM_PROMPT_DISAMB,
        max_output_tokens=400,
    )

    raw = response.output_text
    print(f"DISAMB {triple_code} PREVIEW:", repr((raw or "")[:160]))
    return safe_parse_json(raw)



## 5. Function that calls the API and provide the prompt

In [6]:
from collections import defaultdict
from pathlib import Path
import json
from tqdm import tqdm

TRIPLES = ["RIA", "RIS", "REC", "IEC", "ASE", "ASC"]

# final result:
# {
#   "Programme": {
#       "broad": [ {question_code, question, tiny_learn, options}, ... ],
#       "R": [ {question_code, question, tiny_learn, options}, ... ],
#       "I": [...], "A": [...], "S": [...], "E": [...], "C": [...]
#   }
# }
ml_structure = defaultdict(lambda: {
    "broad": [],
    "R": [],
    "I": [],
    "A": [],
    "S": [],
    "E": [],
    "C": [],
})

# track how many questions we have emitted per course
from collections import Counter
question_counter = Counter()

max_courses = 5   # start small, then increase

for i, (_, row) in enumerate(tqdm(df_courses_tasks.iterrows(), total=len(df_courses_tasks))):
    if i >= max_courses:
        break

    programme = str(row.get("programme_title", "UNKNOWN PROGRAMME"))
    course_code = str(row.get("code", ""))

    # 1) broad question for this course
    try:
        broad_obj = call_broad_question(row)
    except Exception as e:
        print(f"Problem making broad question for course {course_code}: {e}")
        continue

    question_counter[course_code] += 1
    broad_qcode = f"{course_code}_{question_counter[course_code]}"

    broad_entry = {
        "question_code": broad_qcode,
        "question": broad_obj["question"],
        "tiny_learn": broad_obj["tiny_learn"],
        "options": broad_obj["options"],
    }

    ml_structure[programme]["broad"].append(broad_entry)

    # 2) six disambiguation questions for this course
    for triple in TRIPLES:
        try:
            disamb_obj = call_disamb_question(row, triple)
        except Exception as e:
            print(f"Problem making disamb {triple} for course {course_code}: {e}")
            continue

        question_counter[course_code] += 1
        disamb_qcode = f"{course_code}_{question_counter[course_code]}"

        disamb_entry = {
            "question_code": disamb_qcode,
            "question": disamb_obj["question"],
            "tiny_learn": disamb_obj["tiny_learn"],
            "options": disamb_obj["options"],
        }

        # attach under each RIASEC letter in the triple
        for letter in set(triple):
            if letter in "RIASEC":
                ml_structure[programme][letter].append(disamb_entry)


  0%|          | 0/28 [00:00<?, ?it/s]

BROAD PREVIEW: '{\n  "question": "You have been assigned to examine a Roman coin from the Allard Pierson collection as your first research object. What is your initial step to b'
DISAMB RIA PREVIEW: '{\n  "question": "You are starting your research on a selected ancient object from the archaeological collection. What is the first step you should take to evalu'
DISAMB RIS PREVIEW: '{\n  "question": "You are starting your research on an ancient coin from the Allard Pierson collection. What should be your first step?",\n  "tiny_learn": [\n    "'
DISAMB REC PREVIEW: '{\n  "question": "You have just been assigned to research an ancient statue for your first paper. What should be your initial step?",\n  "tiny_learn": [\n    "Mate'
DISAMB IEC PREVIEW: '{\n  "question": "You found an ancient coin from the archaeological collection that interests you. What is your first step in using this object as a source?",\n  '
DISAMB ASE PREVIEW: '{\n  "question": "You have selected an ancient statue for y

  4%|▎         | 1/28 [00:29<13:25, 29.83s/it]

DISAMB ASC PREVIEW: '{\n  "question": "You have been assigned to research a statue from the ancient collection for your first writing assignment. What should you do first?",\n  "tiny_'
BROAD PREVIEW: '{\n  "question": "You are assigned to research a canonical item not yet discussed in class for your group presentation. What is your first step?",\n  "tiny_learn"'
DISAMB RIA PREVIEW: '{\n  "question": "You are tasked with choosing a canonical item for your group\'s research project. What should you do first?",\n  "tiny_learn": [\n    "A canonical'
DISAMB RIS PREVIEW: '{\n  "question": "You are starting your group research project on an ancient canonical item not covered in class. What should you do first to ensure a successful'
DISAMB REC PREVIEW: '{\n  "question": "You are starting your group project on a lesser-known classical monument. What is your first step to ensure a strong foundation for your resear'
DISAMB IEC PREVIEW: '{\n  "question": "You have been assigned to a group project

  7%|▋         | 2/28 [00:58<12:41, 29.31s/it]

DISAMB ASC PREVIEW: '{\n  "question": "You have to start preparing your group presentation on a canonical item for the seminar. What would be your first step?",\n  "tiny_learn": [\n   '
BROAD PREVIEW: '{\n  "question": "You have just attended your first lecture on communication theories that integrate psychological, sociological, and language perspectives. What'
DISAMB RIA PREVIEW: '{\n  "question": "You have just started the course and need to prepare for the first seminar by understanding key communication theories. What is the best first '
DISAMB RIS PREVIEW: '{\n  "question": "You have just been assigned your first seminar reading on psychological and sociological communication theories. Which approach will help you u'
DISAMB REC PREVIEW: '{\n  "question": "You have just started the Introduction to Communication Studies course and need to prepare for the first seminar. How do you begin your prepara'
DISAMB IEC PREVIEW: '{\n  "question": "You have just received the first reading ass

 11%|█         | 3/28 [01:31<12:55, 31.03s/it]

DISAMB ASC PREVIEW: '{\n  "question": "You have just attended your first seminar on communication theories. To prepare effectively for the next group discussion, what should you do f'
BROAD PREVIEW: '{\n  "question": "You have just received a set of beginner exercises on phonetics and language acquisition. What is your first step to start understanding the ba'
DISAMB RIA PREVIEW: '{\n  "question": "You are given a dataset of recorded speech sounds and asked to start analyzing its linguistic features. What is your first step?",\n  "tiny_lear'
DISAMB RIS PREVIEW: '{\n  "question": "You have just received a short recording of a child speaking as part of your first assignment. Which approach would you take first to analyze t'
DISAMB REC PREVIEW: '{\n  "question": "You are tasked with analyzing a short transcript of child speech to identify patterns in sound acquisition. What is your first step to approach'
DISAMB IEC PREVIEW: '{\n  "question": "You have been given a dataset of sentences in

 14%|█▍        | 4/28 [02:04<12:43, 31.81s/it]

DISAMB ASC PREVIEW: '{\n  "question": "You have just received a dataset containing recorded child speech samples to analyze their language development. What should you do first to st'
BROAD PREVIEW: '{\n  "question": "You encounter a challenging problem requiring you to prove the existence and uniqueness of solutions to an equation involving real functions. W'
DISAMB RIA PREVIEW: '{\n  "question": "You are starting to study the concept of limits in calculus. Which approach should you take first to best understand this topic?",\n  "tiny_lear'
DISAMB RIS PREVIEW: '{\n  "question": "You need to solve a problem involving the calculation of derivatives and then present your reasoning clearly. What is your first step?",\n  "tin'
DISAMB REC PREVIEW: '{\n  "question": "You are starting the course and encounter a challenging problem involving limits and proofs. What should you do first?",\n  "tiny_learn": [\n    '
DISAMB IEC PREVIEW: '{\n  "question": "You are starting the course Calculus and A

 18%|█▊        | 5/28 [02:34<11:52, 30.99s/it]

DISAMB ASC PREVIEW: '{\n  "question": "You encounter a problem about finding extreme values of a function with constraints. What is your first step?",\n  "tiny_learn": [\n    "Extreme '





## 6. Saving

In [7]:
ml_structure = dict(ml_structure)

output_dir = Path("../data_bank_microtasks")
output_dir.mkdir(parents=True, exist_ok=True)

out_path = output_dir / "microtasks_prototype.json"

with out_path.open("w", encoding="utf8") as f:
    json.dump(ml_structure, f, ensure_ascii=False, indent=2)

print("Saved ML structure to:", out_path)


Saved ML structure to: ..\data_bank_microtasks\microtasks_prototype.json


In [8]:
with out_path.open("r", encoding="utf8") as f:
    data = json.load(f)

print("Programmes:", list(data.keys())[:5])

prog_name = next(iter(data.keys()))
p = data[prog_name]

print("\nProgramme:", prog_name)
print(" broad questions:", len(p["broad"]))
print(" R questions:", len(p["R"]))
print(" I questions:", len(p["I"]))
print(" A questions:", len(p["A"]))
print(" S questions:", len(p["S"]))
print(" E questions:", len(p["E"]))
print(" C questions:", len(p["C"]))

from pprint import pprint

print("\nExample broad question:")
pprint(p["broad"][0])

print("\nExample R disambiguation question:")
if p["R"]:
    pprint(p["R"][0])


Programmes: ['Ancient Studies', 'Communication and Information Studies', 'Econometrics and Operations Research']

Programme: Ancient Studies
 broad questions: 2
 R questions: 6
 I questions: 6
 A questions: 6
 S questions: 6
 E questions: 6
 C questions: 6

Example broad question:
{'options': {'A': {'riasec': 'R',
                   'text': "Visit the museum to closely observe the coin's "
                           'features and take detailed notes.'},
             'B': {'riasec': 'I',
                   'text': 'Search the university library catalog for academic '
                           'articles about Roman coins.'},
             'C': {'riasec': 'A',
                   'text': 'Sketch your interpretation of the coin’s imagery '
                           'to explore its symbolic meaning.'},
             'D': {'riasec': 'S',
                   'text': 'Discuss with fellow students how this coin might '
                           'reflect everyday life in ancient Rome.'},
        