# 4 Generating the Bank for the MicroTasks

To generate the bank for the microtasks I will use an API for an LLM.
The output will be two questions for each core course of each programme.

Basically I will create the perfect prompt that will use the columns of the df_courses to generate the microstasks. 

We use two prompts: broad + disambiguaition

In [2]:
from pathlib import Path
import pandas as pd
#!pip install --upgrade openai
import os, re
from openai import OpenAI
import json
import numpy as np
from tqdm import tqdm  # optional progress bar, pip install tqdm


## 1 Load the data and filter for max 2 courses

In [3]:
# load the csv file about the courses forwhih we have to gen the tasks
silver = Path("../data_programmes_courses/silver")

df_courses_tasks = pd.read_csv(silver / "df_courses_tasks_silver.csv", encoding="utf-8-sig")
print("The shape of the courses tasks dataframe is:", df_courses_tasks.shape)

# keep only first two courses from each programme
df_courses_tasks = df_courses_tasks.groupby("programme_title").head(2).reset_index(drop=True)
print("After keeping only first two courses from each programme the shape is:", df_courses_tasks.shape)

The shape of the courses tasks dataframe is: (36, 21)
After keeping only first two courses from each programme the shape is: (28, 21)


## 2. Set up OpenAI client 

In [4]:

key_path = Path("../data_bank_microtasks") / "api_key.txt"

# Read the key and strip spaces and newlines
api_key = key_path.read_text(encoding="utf8").strip()

# Create the client using this key
client = OpenAI(api_key=api_key)

models = client.models.list()
#for m in models.data:
#    print(m.id)

model_gpt = "gpt-4.1-mini"  # smaller and cheaper model


## 3. Define the Prompts
Here is the prompt that generates for each programme the questions based on the core courses.

In [5]:
SYSTEM_PROMPT = """
You generate microtasks for a playful study choice tool.

Students see tiny tasks that feel like doing a real course.
They choose what they would do first.
The system uses their choices to infer style signals and to show fit with programmes.

INPUT

You receive one JSON object with these fields:

programme_title
course_code
course_name
course_objective
course_content
additional_information_teaching_methods
method_of_assessment

Use the course and programme texts to infer what the course is about and what students actually do.
Focus on realistic first year situations and concrete thinking tasks.

SHARED RULES FOR ALL TASKS

1. Pick one specific concept or method that clearly belongs in this course.
2. Write a short stimulus. It describes a realistic situation for a first year student and ends with a question about what to do first.
3. Write a tiny_learn bubble with exactly three items. Each item is one short sentence.
   a. One line that defines the key concept.
   b. One line that reminds the student of a useful method or principle.
   c. One line that names a common mistake or confusion.
   Keep the total reading load under eighty words.
4. Use simple language. Avoid formulas unless the course is clearly very quantitative.
5. Do not mention RIASEC, personality, traits or any scoring logic in the task text.
6. Do not mention that this is a test or an assessment. It should feel like a normal study question.

BROAD TASK  first_step_approach

Goal

Give six plausible first actions that all fit the scenario but express different styles.
Each option maps to one of the six RIASEC profiles.

Use these profiles as style hints:

Realistic  practical, hands on, using tools, data or physical activity
Investigative  analytical, exploring patterns, questions and evidence
Artistic  creative representation, open ended design and expression
Social  interaction with people, explanation, collaboration and feedback
Enterprising  initiative, decision, persuasion, planning and pitching
Conventional  organising, documenting, structuring and checking details

Instructions

1. The question in the stimulus must clearly be about the first step. For example
   "What do you do first" or "Which first step makes most sense".
2. Create six options labeled from A to F.
3. Each option must describe a different first step that a student could reasonably take in this situation.
4. Map the options to the six RIASEC profiles as follows:
   exactly one option is Realistic
   exactly one option is Investigative
   exactly one option is Artistic
   exactly one option is Social
   exactly one option is Enterprising
   exactly one option is Conventional
5. Make the mapping clear by including a riasec_code for each option.
6. All options must be plausible. None of them should be obviously wrong or silly. They just reflect different ways to start.
7. The action in each option must be a method or behaviour, not a statement of fact or a vague attitude.

DISAMBIGUATION TASKS  RIASEC TRIPLES

Goal

Create six disambiguation tasks for this course.
Each task has three options and is tied to one triple of RIASEC codes.
The triples are:

RIA
RIS
REC
IEC
ASE
ASC

Each triple is used once. For example RIA means one option with R, one with I, one with A.

Instructions for every triple

1. Create one stimulus that fits the course and asks clearly about the first step in a realistic situation.
2. Create three options labeled A, B and C.
3. Each option must be a plausible first step in the situation.
4. Each option must carry a riasec_code that is one of the three letters in the triple.
5. The three options together must cover all three letters of the triple exactly once.
6. The options should feel different in style, aligned with their RIASEC codes, but none of them is obviously wrong.

OUTPUT

For the given course return one JSON object with this structure:

{
  "programme_title": string,
  "course_code": string,
  "course_name": string,
  "broad_task": {
    "task_type": "first_step_approach",
    "stimulus": string,
    "tiny_learn": [string, string, string],
    "options": [
      {
        "label": "A" | "B" | "C" | "D" | "E" | "F",
        "text": string,
        "riasec_code": "R" | "I" | "A" | "S" | "E" | "C"
      },
      ... five more options ...
    ]
  },
  "disambiguation_tasks": {
    "RIA": {
      "triple_code": "RIA",
      "stimulus": string,
      "tiny_learn": [string, string, string],
      "options": [
        {
          "label": "A" | "B" | "C",
          "text": string,
          "riasec_code": "R" | "I" | "A"
        },
        ... two more options ...
      ]
    },
    "RIS": { same shape, with codes R I S },
    "REC": { same shape, with codes R E C },
    "IEC": { same shape, with codes I E C },
    "ASE": { same shape, with codes A S E },
    "ASC": { same shape, with codes A S C }
  }
}

GENERAL OUTPUT RULES

1. Always return a single JSON object for the course.
2. Start the reply with "{" and end the reply with "}".
3. Do not wrap the JSON in code fences.
4. Do not add any commentary, headings or explanation outside the JSON.
"""


## 4. Define the helpers functions

In [6]:
# avoid very long texts
def truncate(text, max_chars=1200):
    """
    Simple helper to shorten long course texts.
    """
    if text is None:
        return ""
    s = str(text)
    if len(s) <= max_chars:
        return s
    return s[:max_chars]


def build_course_payload(row):
    """
    Prepare the JSON object that will be sent to the model for one course.
    """
    return {
        "programme_title": str(row.get("programme_title", "")),
        "course_code": str(row.get("code", "")),
        "course_name": str(row.get("course_name", "")),
        "course_objective": truncate(row.get("course_objective", ""), 1200),
        "course_content": truncate(row.get("course_content", ""), 1200),
        "additional_information_teaching_methods": truncate(
            row.get("additional_information_teaching_methods", ""), 800
        ),
        "method_of_assessment": truncate(row.get("method_of_assessment", ""), 800),
    }


def safe_parse_json(raw_text):
    """
    Try to parse the model output as JSON.
    If that fails, try to extract the first {...} block.
    Before parsing, normalise whitespace and try to remove common JSON glitches.
    If it still fails, write the candidate JSON to a debug file so it can be inspected.
    """

    if raw_text is None:
        raise ValueError("Model returned no text at all")

    # Strip and normalise whitespace
    text = raw_text.strip()
    if not text:
        raise ValueError("Model returned an empty string, no JSON to parse")

    # Replace all newline and tab characters with spaces
    text = text.replace("\r\n", " ").replace("\n", " ").replace("\t", " ")

    # Helper to clean likely JSON issues
    def clean(s: str) -> str:
        # Collapse multiple spaces
        s = re.sub(r"\s+", " ", s)
        # Remove trailing commas before closing braces or brackets
        s = re.sub(r",\s*([}\]])", r"\1", s)
        return s

    text = clean(text)

    # First simple attempt on the full text
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # Second attempt, look for first and last curly brace
    start = text.find("{")
    end = text.rfind("}")
    if start == -1 or end == -1 or end <= start:
        print("Could not find any JSON object in the text. Here is a preview:")
        print(text[:400])
        raise ValueError("No JSON object found in model output")

    candidate = text[start : end + 1]
    candidate = clean(candidate)

    try:
        return json.loads(candidate)
    except json.JSONDecodeError as e:
        # Write the broken candidate to a debug file so it can be opened in VS Code
        debug_path = Path("debug_candidate.json.txt")
        debug_path.write_text(candidate, encoding="utf8")
        print("\nFailed to parse JSON candidate. Wrote the text to:", debug_path)
        print("Open that file in VS Code and look around the position mentioned in the error.")
        raise e




## 5. Function that calls the API and provide the prompt

In [7]:
def generate_tasks_for_course(row, model_name=model_gpt):
    """
    Call the Responses API for a single course.

    The model returns:
      broad_task with six RIASEC options
      disambiguation_tasks with six triples, each with three options

    Returns a Python dict that matches the JSON described in SYSTEM_PROMPT.
    """

    payload = build_course_payload(row)

    response = client.responses.create(
        model=model_name,            # choose a small text model available in your account
        input=json.dumps(payload),
        instructions=SYSTEM_PROMPT,
        max_output_tokens=2000,
    )

    raw_text = response.output_text

    print("RAW TEXT PREVIEW:")
    print(repr((raw_text or "")[:300]))
    print("END RAW TEXT PREVIEW\n")

    if raw_text is None or not raw_text.strip():
        print("Full response object for debugging:")
        print(response)
        raise ValueError("Model returned no text, check model name and prompt")

    bundle = safe_parse_json(raw_text)

    return bundle





In [8]:
# test
test_row = df_courses_tasks.iloc[0]
test_bundle = generate_tasks_for_course(test_row, model_name=model_gpt)

print(json.dumps(test_bundle, indent=2, ensure_ascii=False))


RAW TEXT PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW115",\n  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",\n  "broad_task": {\n    "task_type": "first_step_approach",\n    "stimulus": "You have been assigned an ancient artifact from the arch'
END RAW TEXT PREVIEW

{
  "programme_title": "Ancient Studies",
  "course_code": "L_AABAOHW115",
  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",
  "broad_task": {
    "task_type": "first_step_approach",
    "stimulus": "You have been assigned an ancient artifact from the archaeological collection to research for your first writing assignment. Before you dive into writing, what is the first step you should take to understand this object in its historical context?",
    "tiny_learn": [
      "Material culture refers to physical objects made or used by people in the past.",
      "Start by exploring both the object's featu

In [9]:
# save the test task bundle to a file
output_folder = Path("../data_bank_microtasks") / "test_output"
output_folder.mkdir(parents=True, exist_ok=True)
output_file = output_folder / f"{test_row['code']}_tasks.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(test_bundle, f, ensure_ascii=False, indent=2)

## 6.Loop over courses, pick templates, save tasks

Now build a simple loop that chooses a template for each course, calls the function, and collects results.

In [10]:
from tqdm import tqdm
from pathlib import Path
import json

bundles = []

max_courses = 2  # for safety during first runs

for i, (_, row) in enumerate(tqdm(df_courses_tasks.iterrows(), total=len(df_courses_tasks))):
    if i >= max_courses:
        break

    try:
        bundle = generate_tasks_for_course(row, model_name="gpt-4.1-mini")
        bundles.append(bundle)
    except Exception as e:
        print(f"Problem on row {i} with course {row.get('code')}: {e}")


  0%|          | 0/28 [00:00<?, ?it/s]

  4%|▎         | 1/28 [00:49<22:08, 49.20s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW115",\n  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",\n  "broad_task": {\n    "task_type": "first_step_approach",\n    "stimulus": "You have just selected an ancient artifact from the univ'
END RAW TEXT PREVIEW



  7%|▋         | 2/28 [01:20<17:25, 40.23s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW101",\n  "course_name": "The Classical Canon I: The Heritage of Antiquity",\n  "broad_task": {\n    "task_type": "first_step_approach",\n    "stimulus": "You need to prepare your first group presentation on a classical canonical item n'
END RAW TEXT PREVIEW






In [11]:
# Group by programme so each programme appears only once
programmes = {}

for bundle in bundles:
    programme = bundle.get("programme_title", "UNKNOWN PROGRAMME")
    course_code = bundle.get("course_code", "")
    course_name = bundle.get("course_name", "")
    broad_task = bundle.get("broad_task", {})
    disamb = bundle.get("disambiguation_tasks", {})

    if programme not in programmes:
        programmes[programme] = {
            "programme_title": programme,
            "courses": []
        }

    programmes[programme]["courses"].append(
        {
            "course_code": course_code,
            "course_name": course_name,
            "broad_task": broad_task,
            "disambiguation_tasks": disamb,
        }
    )

# Optionally sort courses inside each programme by course code
for prog in programmes.values():
    prog["courses"].sort(key=lambda c: c["course_code"])

# Save to JSON in the nested format
microtasks_saving_path = Path("../data_bank_microtasks")
out_path = microtasks_saving_path / "microtasks_by_programme.json"

with out_path.open("w", encoding="utf8") as f:
    json.dump(programmes, f, ensure_ascii=False, indent=2)

print("Saved nested microtasks to:", out_path)


Saved nested microtasks to: ..\data_bank_microtasks\microtasks_by_programme.json


In [12]:
# Transformation to the expected shape

In [None]:
from collections import defaultdict
from pathlib import Path
import json

# The final structure your colleague expects
# {
#   "Programme name": {
#       "broad": [ ... questions ... ],
#       "R": [ ... questions ... ],
#       "I": [ ... ],
#       "A": [ ... ],
#       "S": [ ... ],
#       "E": [ ... ],
#       "C": [ ... ]
#   },
#   "Another programme": { ... }
# }

ml_structure = defaultdict(lambda: {
    "broad": [],
    "R": [],
    "I": [],
    "A": [],
    "S": [],
    "E": [],
    "C": [],
})

def options_list_to_dict(options_list):
    """
    Turn our list of options into the expected dict:
    { "A": {"text": "...", "riasec": "R"}, ... }
    """
    options_dict = {}
    for opt in options_list:
        label = opt["label"]
        options_dict[label] = {
            "text": opt["text"],
            "riasec": opt["riasec_code"],
        }
    return options_dict

for bundle in bundles:
    prog = bundle.get("programme_title", "UNKNOWN PROGRAMME")

    # 1. Add the broad six option question for this course
    broad = bundle.get("broad", {})
    if broad:
        broad_question = {
            "question": broad["stimulus"],
            "options": options_list_to_dict(broad["options"]),
        }
        ml_structure[prog]["broad"].append(broad_question)

    # 2. Add the disambiguation questions, grouped under R I A S E C
    disamb = bundle.get("disambiguation", {})

    for triple_code, triple_block in disamb.items():
        q = {
            "question": triple_block["stimulus"],
            "options": options_list_to_dict(triple_block["options"]),
        }

        # Each triple is something like "RIA"
        # We add this question once under each letter that appears in the triple
        for letter in set(triple_code):
            if letter in "RIASEC":
                ml_structure[prog][letter].append(q)

# Convert defaultdict to normal dict
ml_structure = dict(ml_structure)

# Save to the shared file
microtasks_saving_path = Path("../data_bank_microtasks")
out_path = microtasks_saving_path / "microtasks_copy.json"

with out_path.open("w", encoding="utf8") as f:
    json.dump(ml_structure, f, ensure_ascii=False, indent=2)

print("Saved ML structure to:", out_path)


Saved ML structure to: ..\data_bank_microtasks\microtasks_for_ml.json
