# 4 Generating the Bank for the MicroTasks

To generate the bank for the microtasks I will use an API for an LLM.
The output will be two questions for each core course of each programme.

Basically I will create the perfect prompt that will use the columns of the df_courses to generate the microstasks. 

We use two prompts: broad + disambiguaition (Kenneth Style)

In [1]:
from pathlib import Path
import pandas as pd
#!pip install --upgrade openai
import os, re
from openai import OpenAI
import json
import numpy as np
from tqdm import tqdm  # optional progress bar, pip install tqdm


## 1 Load the data and filter for max 2 courses

In [2]:
# load the csv file about the courses forwhih we have to gen the tasks
silver = Path("../data_programmes_courses/silver")

df_courses_tasks = pd.read_csv(silver / "df_courses_tasks_silver.csv", encoding="utf-8-sig")
print("The shape of the courses tasks dataframe is:", df_courses_tasks.shape)

# keep only first two courses from each programme
df_courses_tasks = df_courses_tasks.groupby("programme_title").head(2).reset_index(drop=True)
print("After keeping only first two courses from each programme the shape is:", df_courses_tasks.shape)

The shape of the courses tasks dataframe is: (36, 21)
After keeping only first two courses from each programme the shape is: (28, 21)


## 2. Set up OpenAI client 

In [3]:

key_path = Path("../data_bank_microtasks") / "api_key.txt"

# Read the key and strip spaces and newlines
api_key = key_path.read_text(encoding="utf8").strip()

# Create the client using this key
client = OpenAI(api_key=api_key)

models = client.models.list()
#for m in models.data:
#    print(m.id)

model_gpt = "gpt-4.1-mini"  


## 3. Define the Prompts
Here is the prompt that generates for each programme the questions based on the core courses.

In [4]:
SYSTEM_PROMPT_BROAD = """
You generate one broad RIASEC question for a playful study choice tool.

The question should feel like a first step in a real course task.

INPUT
You receive one JSON object with:
- programme_title
- course_code
- course_name
- course_objective
- course_content
- teaching_methods
- assessment

TASK
Create exactly one multiple choice question with six options.

Rules
1. Use the course information to imagine a realistic first year situation.
2. Write a short question string that describes the situation and ends with a question.
3. Create tiny_learn as a list of exactly three short sentences:
   a. definition of the key concept
   b. method reminder
   c. common mistake
4. Create six options labelled A to F.
5. Each option describes a plausible first action the student could take.
6. Each option must be tagged with one RIASEC code:
   R, I, A, S, E, or C.
7. Across the six options you must use each of the six RIASEC codes exactly once.
8. Do not mention RIASEC, personality, or profiles in the visible text.

OUTPUT
Return one JSON object with this shape:

{
  "question": string,
  "tiny_learn": [string, string, string],
  "options": {
    "A": {"text": string, "riasec": "R" | "I" | "A" | "S" | "E" | "C"},
    "B": {...},
    "C": {...},
    "D": {...},
    "E": {...},
    "F": {...}
  }
}

All strings must be single line strings. Do not insert raw newline characters in any value.
"""

SYSTEM_PROMPT_DISAMB = """
You generate one RIASEC disambiguation question for a playful study choice tool.

The question should feel like a first step in a real course task.

INPUT
You receive one JSON object with:
- programme_title
- course_code
- course_name
- course_objective
- course_content
- teaching_methods
- assessment
- triple_code   for example "RIA"

triple_code contains three distinct letters from R, I, A, S, E, C.

TASK
Create exactly one multiple choice question with three options.

Rules
1. Use the course information to imagine a realistic first year situation.
2. Write a short question string that describes the situation and ends with a question.
3. Create tiny_learn as a list of exactly three short sentences:
   a. definition of the key concept
   b. method reminder
   c. common mistake
4. Create three options labelled A, B, C.
5. The three options together must use exactly the three letters in triple_code, one per option.
   For example triple_code "RIA" means one R, one I, one A.
6. Each option describes a plausible first action the student could take.
7. Each option must be tagged with a riasec letter that is one of the letters in triple_code.
8. Do not mention RIASEC, personality, or profiles in the visible text.

OUTPUT
Return one JSON object with this shape:

{
  "question": string,
  "tiny_learn": [string, string, string],
  "options": {
    "A": {"text": string, "riasec": one letter from triple_code},
    "B": {...},
    "C": {...}
  }
}

All strings must be single line strings. Do not insert raw newline characters in any value.
"""


## 4. Define the helpers functions

In [5]:
import json
import re
from collections import defaultdict
from pathlib import Path
from tqdm import tqdm

def truncate(text, max_chars=1200):
    """Short helper to shorten very long fields."""
    if text is None:
        return ""
    s = str(text)
    if len(s) <= max_chars:
        return s
    return s[:max_chars]

def build_course_payload(row):
    """Context that we pass to the model for one course."""
    return {
        "programme_title": str(row.get("programme_title", "")),
        "course_code": str(row.get("code", "")),
        "course_name": str(row.get("course_name", "")),
        "course_objective": truncate(row.get("course_objective", ""), 800),
        "course_content": truncate(row.get("course_content", ""), 800),
        "teaching_methods": truncate(row.get("additional_information_teaching_methods", ""), 400),
        "assessment": truncate(row.get("method_of_assessment", ""), 400),
    }

def safe_parse_json(raw_text: str):
    """
    Clean and parse model output as JSON.
    Removes newlines and obvious trailing commas, then parses.
    """
    if raw_text is None:
        raise ValueError("Model returned no text")

    text = raw_text.strip()
    if not text:
        raise ValueError("Model returned empty text")

    # normalise whitespace
    text = text.replace("\r\n", " ").replace("\n", " ").replace("\t", " ")
    text = re.sub(r"\s+", " ", text)

    # remove trailing commas before closing braces or brackets
    text = re.sub(r",\s*([}\]])", r"\1", text)

    # try direct
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # try first and last brace
    start = text.find("{")
    end = text.rfind("}")
    if start == -1 or end == -1 or end <= start:
        print("Could not find JSON object, preview:")
        print(text[:400])
        raise ValueError("No JSON object found")

    candidate = text[start : end + 1]
    candidate = re.sub(r",\s*([}\]])", r"\1", candidate)

    try:
        return json.loads(candidate)
    except json.JSONDecodeError as e:
        print("Still could not parse, preview:")
        print(candidate[:400])
        raise e


def call_broad_question(row, model_name=model_gpt):
    payload = build_course_payload(row)

    response = client.responses.create(
        model=model_name,
        input=json.dumps(payload),
        instructions=SYSTEM_PROMPT_BROAD,
        max_output_tokens=500,
    )

    raw = response.output_text
    print("BROAD PREVIEW:", repr((raw or "")[:160]))
    return safe_parse_json(raw)


def call_disamb_question(row, triple_code, model_name=model_gpt):
    payload = build_course_payload(row)
    payload["triple_code"] = triple_code

    response = client.responses.create(
        model=model_name,
        input=json.dumps(payload),
        instructions=SYSTEM_PROMPT_DISAMB,
        max_output_tokens=400,
    )

    raw = response.output_text
    print(f"DISAMB {triple_code} PREVIEW:", repr((raw or "")[:160]))
    return safe_parse_json(raw)



## 5. Function that calls the API and provide the prompt

In [None]:
from collections import defaultdict
from pathlib import Path
import json
from tqdm import tqdm

TRIPLES = ["RIA", "RIS", "REC", "IEC", "ASE", "ASC"]

# final result:
# {
#   "Programme": {
#       "broad": [ {question_code, question, tiny_learn, options}, ... ],
#       "R": [ {question_code, question, tiny_learn, options}, ... ],
#       "I": [...], "A": [...], "S": [...], "E": [...], "C": [...]
#   }
# }
ml_structure = defaultdict(lambda: {
    "broad": [],
    "R": [],
    "I": [],
    "A": [],
    "S": [],
    "E": [],
    "C": [],
})

# track how many questions we have emitted per course
from collections import Counter
question_counter = Counter()

max_courses = 100   # start small, then increase

for i, (_, row) in enumerate(tqdm(df_courses_tasks.iterrows(), total=len(df_courses_tasks))):
    if i >= max_courses:
        break

    programme = str(row.get("programme_title", "UNKNOWN PROGRAMME"))
    course_code = str(row.get("code", ""))

    # 1) broad question for this course
    try:
        broad_obj = call_broad_question(row)
    except Exception as e:
        print(f"Problem making broad question for course {course_code}: {e}")
        continue

    question_counter[course_code] += 1
    broad_qcode = f"{course_code}_{question_counter[course_code]}"

    broad_entry = {
        "question_code": broad_qcode,
        "question": broad_obj["question"],
        "tiny_learn": broad_obj["tiny_learn"],
        "options": broad_obj["options"],
    }

    ml_structure[programme]["broad"].append(broad_entry)

    # 2) six disambiguation questions for this course
    for triple in TRIPLES:
        try:
            disamb_obj = call_disamb_question(row, triple)
        except Exception as e:
            print(f"Problem making disamb {triple} for course {course_code}: {e}")
            continue

        question_counter[course_code] += 1
        disamb_qcode = f"{course_code}_{question_counter[course_code]}"

        disamb_entry = {
            "question_code": disamb_qcode,
            "question": disamb_obj["question"],
            "tiny_learn": disamb_obj["tiny_learn"],
            "options": disamb_obj["options"],
        }

        # attach under each RIASEC letter in the triple
        for letter in set(triple):
            if letter in "RIASEC":
                ml_structure[programme][letter].append(disamb_entry)


## 6. Saving

In [8]:
ml_structure = dict(ml_structure)

output_dir = Path("../data_bank_microtasks")
output_dir.mkdir(parents=True, exist_ok=True)

out_path = output_dir / "microtasks_prototype.json"

with out_path.open("w", encoding="utf8") as f:
    json.dump(ml_structure, f, ensure_ascii=False, indent=2)

print("Saved ML structure to:", out_path)


Saved ML structure to: ..\data_bank_microtasks\microtasks_prototype.json


In [9]:
with out_path.open("r", encoding="utf8") as f:
    data = json.load(f)

print("Programmes:", list(data.keys())[:5])

prog_name = next(iter(data.keys()))
p = data[prog_name]

print("\nProgramme:", prog_name)
print(" broad questions:", len(p["broad"]))
print(" R questions:", len(p["R"]))
print(" I questions:", len(p["I"]))
print(" A questions:", len(p["A"]))
print(" S questions:", len(p["S"]))
print(" E questions:", len(p["E"]))
print(" C questions:", len(p["C"]))

from pprint import pprint

print("\nExample broad question:")
pprint(p["broad"][0])

print("\nExample R disambiguation question:")
if p["R"]:
    pprint(p["R"][0])


Programmes: ['Ancient Studies', 'Communication and Information Studies', 'Econometrics and Operations Research', 'History', 'Literature and Society']

Programme: Ancient Studies
 broad questions: 2
 R questions: 6
 I questions: 6
 A questions: 6
 S questions: 6
 E questions: 6
 C questions: 6

Example broad question:
{'options': {'A': {'riasec': 'A',
                   'text': 'Sketch the coin to capture its details and symbols '
                           'for further artistic analysis.'},
             'B': {'riasec': 'I',
                   'text': 'Visit the library to locate academic articles that '
                           "explain the coin's historical period and usage."},
             'C': {'riasec': 'S',
                   'text': 'Contact the museum curator to ask about the '
                           'conservation process and physical characteristics '
                           'of the coin.'},
             'D': {'riasec': 'R',
                   'text': 'Analyze the coin

In [22]:
# the number and names of the programmes for which microtasks were generated
print("\nTotal programmes with microtasks generated:", len(data))
print("Programmes:", list(data.keys()))

# open file of programmes vector:
programme_vectors_path = "../data_RIASEC/df_RIASEC_programmes_vectors.csv"
programme_vectors = pd.read_csv(programme_vectors_path, encoding="utf-8-sig")

# number of programmes with vectors
print("The number of vectors: ", len(programme_vectors))
print("Programme titles with vectors:", list(programme_vectors["programme_title"].unique()))

# diffference between programmes with vectors and programmes with microtasks
programmes_with_microtasks = set(data.keys())
programmes_with_vectors = set(programme_vectors["programme_title"].unique())
programmes_difference = programmes_with_vectors - programmes_with_microtasks
print("Programmes with vectors but no microtasks:", programmes_difference)


Total programmes with microtasks generated: 14
Programmes: ['Ancient Studies', 'Communication and Information Studies', 'Econometrics and Operations Research', 'History', 'Literature and Society', 'Philosophy', 'Biomedical Sciences', 'Business Analytics', 'Computer Science', 'Economics and Business Economics', 'International Business Administration', 'Mathematics', 'Media, Art, Design and Architecture', 'Philosophy, Politics and Economics']
The number of vectors:  17
Programme titles with vectors: ['Ancient Studies', 'Archaeology', 'Artificial Intelligence', 'Biomedical Sciences', 'Business Analytics', 'Communication and Information Studies', 'Computer Science', 'Econometrics and Data Science', 'Econometrics and Operations Research', 'Economics and Business Economics', 'History', 'International Business Administration', 'Literature and Society', 'Mathematics', 'Media, Art, Design and Architecture', 'Philosophy', 'Philosophy, Politics and Economics']
Programmes with vectors but no micr