# 4 Generating the Bank for the MicroTasks

To generate the bank for the microtasks I will use an API for an LLM.
The output will be two questions for each core course of each programme.

Basically I will create the perfect prompt that will use the columns of the df_courses to generate the microstasks. 

We use two prompts: broad + disambiguaition

In [1]:
from pathlib import Path
import pandas as pd
#!pip install --upgrade openai
import os, re
from openai import OpenAI
import json
import numpy as np
from tqdm import tqdm  # optional progress bar, pip install tqdm


## 1 Load the data and filter for max 2 courses

In [2]:
# load the csv file about the courses forwhih we have to gen the tasks
silver = Path("../data_programmes_courses/silver")

df_courses_tasks = pd.read_csv(silver / "df_courses_tasks_silver.csv", encoding="utf-8-sig")
print("The shape of the courses tasks dataframe is:", df_courses_tasks.shape)

# keep only first two courses from each programme
df_courses_tasks = df_courses_tasks.groupby("programme_title").head(2).reset_index(drop=True)
print("After keeping only first two courses from each programme the shape is:", df_courses_tasks.shape)

The shape of the courses tasks dataframe is: (36, 21)
After keeping only first two courses from each programme the shape is: (28, 21)


## 2. Set up OpenAI client 

In [3]:

key_path = Path("../data_bank_microtasks") / "api_key.txt"

# Read the key and strip spaces and newlines
api_key = key_path.read_text(encoding="utf8").strip()

# Create the client using this key
client = OpenAI(api_key=api_key)

models = client.models.list()
#for m in models.data:
#    print(m.id)

model_gpt = "gpt-4.1-mini"  # smaller and cheaper model


## 3. Define the Prompts
Here is the prompt that generates for each programme the questions based on the core courses.

In [4]:
SYSTEM_PROMPT_BROAD = """
You generate a single broad microtask for a playful study choice tool.

Students see tiny tasks that feel like doing a real course.
They choose what they would do first.

INPUT

You receive one JSON object with these fields:

programme_title
course_code
course_name
course_objective
course_content
additional_information_teaching_methods
method_of_assessment

Use the course and programme texts to infer what the course is about and what students actually do.

TASK

Create one broad first step task with six options, one for each RIASEC profile.

Rules

1. Pick one specific concept or method that clearly belongs in this course.
2. Write a short stimulus that ends with a question about what to do first.
3. Write tiny_learn with exactly three short sentences:
   a. definition of the key concept
   b. method reminder
   c. common mistake
4. Create six options labeled A to F.
5. Each option must describe a different first step that a student could reasonably take.
6. Map options to RIASEC:
   one option Realistic
   one Investigative
   one Artistic
   one Social
   one Enterprising
   one Conventional
7. Each option must have a riasec_code with one of R I A S E C.
8. Do not mention RIASEC or personality explicitly in the text.

OUTPUT

Return one JSON object with keys:

{
  "programme_title": string,
  "course_code": string,
  "course_name": string,
  "broad": {
    "task_type": "first_step_approach",
    "stimulus": string,
    "tiny_learn": [string, string, string],
    "options": [
      {
        "label": "A" | "B" | "C" | "D" | "E" | "F",
        "text": string,
        "riasec_code": "R" | "I" | "A" | "S" | "E" | "C"
      },
      ... five more options ...
    ]
  }
}

Rules for strings

All string values must be on a single line.
Do not insert raw newline characters inside stimulus, tiny_learn items or option text.
"""


In [5]:
SYSTEM_PROMPT_TRIPLE = """
You generate one disambiguation microtask for a playful study choice tool.

INPUT

You receive one JSON object with these fields:

programme_title
course_code
course_name
course_objective
course_content
triple_code

triple_code is one of: "RIA","RIS","REC","IEC","ASE","ASC".

TASK

Create one first step task with three options.
Each option must correspond to one letter in triple_code.

Rules

1. Use the course information to write a realistic first year situation.
2. The stimulus must end with a question about what to do first.
3. Write tiny_learn with exactly three short sentences:
   a. definition of the key concept
   b. method reminder
   c. common mistake
4. Create three options labeled A B C.
5. The three options must together cover all three letters in triple_code exactly once.
   For example for RIA you have one R, one I, one A.
6. Each option must be a plausible first step in the situation.
7. Each option must have a riasec_code equal to one of the letters in triple_code.
8. Do not mention RIASEC or personality in the text.

OUTPUT

Return one JSON object with keys:

{
  "programme_title": string,
  "course_code": string,
  "course_name": string,
  "triple_code": string,
  "disamb_task": {
    "stimulus": string,
    "tiny_learn": [string, string, string],
    "options": [
      {
        "label": "A" | "B" | "C",
        "text": string,
        "riasec_code": one letter from triple_code
      },
      ... two more options ...
    ]
  }
}

Rules for strings

All string values must be on a single line.
Do not insert raw newline characters inside stimulus, tiny_learn items or option text.
"""


## 4. Define the helpers functions

In [6]:
# avoid very long texts
def truncate(text, max_chars=1200):
    """
    Simple helper to shorten long course texts.
    """
    if text is None:
        return ""
    s = str(text)
    if len(s) <= max_chars:
        return s
    return s[:max_chars]


def build_course_payload(row):
    return {
        "programme_title": str(row.get("programme_title", "")),
        "course_code": str(row.get("code", "")),
        "course_name": str(row.get("course_name", "")),
        "course_objective": truncate(row.get("course_objective", ""), 1200),
        "course_content": truncate(row.get("course_content", ""), 1200),
        "additional_information_teaching_methods": truncate(
            row.get("additional_information_teaching_methods", ""), 800
        ),
        "method_of_assessment": truncate(row.get("method_of_assessment", ""), 800),
    }


def safe_parse_json(raw_text):
    """
    Try to parse the model output as JSON.
    If that fails, try to extract the first {...} block.
    Before parsing, normalise whitespace and try to remove common JSON glitches.
    If it still fails, write the candidate JSON to a debug file so it can be inspected.
    """

    if raw_text is None:
        raise ValueError("Model returned no text at all")

    # Strip and normalise whitespace
    text = raw_text.strip()
    if not text:
        raise ValueError("Model returned an empty string, no JSON to parse")

    # Replace all newline and tab characters with spaces
    text = text.replace("\r\n", " ").replace("\n", " ").replace("\t", " ")

    # Helper to clean likely JSON issues
    def clean(s: str) -> str:
        # Collapse multiple spaces
        s = re.sub(r"\s+", " ", s)
        # Remove trailing commas before closing braces or brackets
        s = re.sub(r",\s*([}\]])", r"\1", s)
        return s

    text = clean(text)

    # First simple attempt on the full text
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # Second attempt, look for first and last curly brace
    start = text.find("{")
    end = text.rfind("}")
    if start == -1 or end == -1 or end <= start:
        print("Could not find any JSON object in the text. Here is a preview:")
        print(text[:400])
        raise ValueError("No JSON object found in model output")

    candidate = text[start : end + 1]
    candidate = clean(candidate)

    try:
        return json.loads(candidate)
    except json.JSONDecodeError as e:
        # Write the broken candidate to a debug file so it can be opened in VS Code
        debug_path = Path("debug_candidate.json.txt")
        debug_path.write_text(candidate, encoding="utf8")
        print("\nFailed to parse JSON candidate. Wrote the text to:", debug_path)
        print("Open that file in VS Code and look around the position mentioned in the error.")
        raise e




## 5. Function that calls the API and provide the prompt

In [7]:
TRIPLES = ["RIA", "RIS", "REC", "IEC", "ASE", "ASC"]

def generate_broad_task_for_course(row, model_name="gpt-4.1-mini"):
    payload = build_course_payload(row)

    response = client.responses.create(
        model=model_name,
        input=json.dumps(payload),
        instructions=SYSTEM_PROMPT_BROAD,
        max_output_tokens=600,
    )

    raw_text = response.output_text

    print("BROAD RAW PREVIEW:")
    print(repr((raw_text or "")[:200]))
    print()

    if raw_text is None or not raw_text.strip():
        raise ValueError("Model returned no text for broad task")

    bundle = safe_parse_json(raw_text)

    return bundle



def generate_disamb_for_course_and_triple(row, triple_code, model_name="gpt-4.1-mini"):
    payload = build_course_payload(row)
    payload["triple_code"] = triple_code

    response = client.responses.create(
        model=model_name,
        input=json.dumps(payload),
        instructions=SYSTEM_PROMPT_TRIPLE,
        max_output_tokens=500,
    )

    raw_text = response.output_text

    print(f"DISAMB {triple_code} RAW PREVIEW:")
    print(repr((raw_text or "")[:200]))
    print()

    if raw_text is None or not raw_text.strip():
        raise ValueError(f"Model returned no text for triple {triple_code}")

    bundle = safe_parse_json(raw_text)

    return bundle




In [8]:
def build_course_bundle(row, model_name="gpt-4.1-mini"):
    # broad task
    broad_bundle = generate_broad_task_for_course(row, model_name=model_name)

    programme_title = broad_bundle.get("programme_title", "")
    course_code = broad_bundle.get("course_code", "")
    course_name = broad_bundle.get("course_name", "")
    broad_task = broad_bundle.get("broad", {})

    disamb_block = {}

    # six triples
    for triple in TRIPLES:
        triple_bundle = generate_disamb_for_course_and_triple(
            row,
            triple_code=triple,
            model_name=model_name,
        )
        disamb_block[triple] = {
            "triple_code": triple,
            "stimulus": triple_bundle["disamb_task"]["stimulus"],
            "tiny_learn": triple_bundle["disamb_task"]["tiny_learn"],
            "options": triple_bundle["disamb_task"]["options"],
        }

    # final per course bundle
    return {
        "programme_title": programme_title,
        "course_code": course_code,
        "course_name": course_name,
        "broad": broad_task,
        "disambiguation": disamb_block,
    }


In [9]:
test_row = df_courses_tasks.iloc[0]
test_course_bundle = build_course_bundle(test_row, model_name="gpt-4.1-mini")
print(json.dumps(test_course_bundle, indent=2, ensure_ascii=False))

BROAD RAW PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW115",\n  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",\n  "broad": {\n    "task_type": "'

DISAMB RIA RAW PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW115",\n  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",\n  "triple_code": "RIA",\n  "disa'

DISAMB RIS RAW PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW115",\n  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",\n  "triple_code": "RIS",\n  "disa'

DISAMB REC RAW PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW115",\n  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",\n  "triple_code": "REC",\n  "disa'

DISAMB IEC RAW PREVIEW:
'{"programme_title":"Ancient Studies","course_cod

In [None]:
# save the test task bundle to a file
output_folder = Path("../data_bank_microtasks") / "test_output"
output_folder.mkdir(parents=True, exist_ok=True)
output_file = output_folder / f"{test_row['code']}_tasks.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(test_bundle, f, ensure_ascii=False, indent=2)

## 6.Loop over courses, pick templates, save tasks

Now build a simple loop that chooses a template for each course, calls the function, and collects results.

In [None]:

# 2) FULL MICROTASK GENERATION

bundles = []

max_courses = 10  # for safety on first run

for i, (_, row) in enumerate(tqdm(df_courses_tasks.iterrows(), total=len(df_courses_tasks))):
    if i >= max_courses:
        break

    try:
        bundle = generate_tasks_for_course(row, model_name="gpt-4.1-mini")
        bundles.append(bundle)
    except Exception as e:
        print(f"Problem on row {i} with course {row.get('code')}: {e}")



In [None]:
microtasks_saving_path = Path("../data_bank_microtasks")
out_path = microtasks_saving_path / "microtasks_by_course.json"

with out_path.open("w", encoding="utf8") as f:
    json.dump(bundles, f, ensure_ascii=False, indent=2)

print("Saved course level tasks to:", out_path)
    

In [None]:

microtasks_saving_path = Path("../data_bank_microtasks")
in_path = microtasks_saving_path / "microtasks_by_course.json"
out_path = microtasks_saving_path / "microtasks_by_programme.json"

# 1. Load the list of course bundles
with in_path.open("r", encoding="utf8") as f:
    bundles = json.load(f)

# 2. Group courses by programme_title
programmes = {}

for bundle in bundles:
    programme = bundle.get("programme_title", "UNKNOWN PROGRAMME")
    course_code = bundle.get("course_code", "")
    course_name = bundle.get("course_name", "")
    tasks = bundle.get("tasks", [])

    # If this programme is not yet in the dict, create the container
    if programme not in programmes:
        programmes[programme] = {
            "programme_title": programme,
            "courses": []
        }

    # Add this course to the courses list for that programme
    programmes[programme]["courses"].append(
        {
            "course_code": course_code,
            "course_name": course_name,
            "tasks": tasks,
        }
    )

# 3. Save the nested structure as pretty JSON
with out_path.open("w", encoding="utf8") as f:
    json.dump(programmes, f, ensure_ascii=False, indent=2)

print("Saved nested programmes file to:", out_path)
