# 4 Generating the Bank for the MicroTasks

To generate the bank for the microtasks I will use an API for an LLM.
The output will be two questions for each core course of each programme:

1) Programme
2) Core course
3) 
2) Stimulus  #the problem
3) Tiny learn # the three bullet points reading material
4) 6 options each
5) 


Basically I will create the perfect prompt that will use the columns of the df_courses to generate the microstasks. 


In [1]:
from pathlib import Path
import pandas as pd
#!pip install --upgrade openai
import os
from openai import OpenAI
import json
import numpy as np
from tqdm import tqdm  # optional progress bar, pip install tqdm


## 1 Load the data and filter for max 2 courses

In [2]:
# load the csv file about the courses forwhih we have to gen the tasks
silver = Path("../data_programmes_courses/silver")

df_courses_tasks = pd.read_csv(silver / "df_courses_tasks_silver.csv", encoding="utf-8-sig")
print("The shape of the courses tasks dataframe is:", df_courses_tasks.shape)

# keep only first two courses from each programme
df_courses_tasks = df_courses_tasks.groupby("programme_title").head(2).reset_index(drop=True)
print("After keeping only first two courses from each programme the shape is:", df_courses_tasks.shape)

The shape of the courses tasks dataframe is: (21, 21)
After keeping only first two courses from each programme the shape is: (14, 21)


## 2. Set up the client 

In [3]:

# Create the OpenAI client
def get_openai_client():
    """
    Read the API key from a local file and return a client.
    The secrets folder is gitignored so the key never reaches the repo.
    """
    key_path = Path("../data_bank_microtasks") / "api_key.txt"

    # Read the key and strip spaces and newlines
    api_key = key_path.read_text(encoding="utf8").strip()

    # Create the client using this key
    client = OpenAI(api_key=api_key)

    return client

# Create the client once
client = get_openai_client()

In [4]:
models = client.models.list()
#for m in models.data:
#    print(m.id)


model_gpt = "gpt-5-mini"  # smaller and cheaper model

## 3. Define the Prompt
The system prompt carries all project logic once. We reuse it for all courses.


CHANGE: based on the programme vectors such as A: best 2, B medium two an C lowest 2.

In [5]:
SYSTEM_PROMPT = """
You generate microtasks for a playful study choice tool.

Students see tiny tasks that feel like doing a real course.
They choose what they would do first.
The system uses their choices to infer style signals and to show fit with programmes.

INPUT

You receive one JSON object with these fields:

course_code
course_name
course_objective
course_content
additional_information_teaching_methods
method_of_assessment
template_type

template_type is either "first_step_approach" or "challenge".

Use the course texts to infer what the course is about and what students actually do.
Focus on realistic first year situations and concrete thinking tasks.

SHARED RULES FOR ALL TASKS

1. Pick one specific concept or method that clearly belongs in this course.
2. Write a short stimulus. It describes a realistic situation for a first year student and ends with a question about what to do first.
3. Write a tiny_learn bubble with exactly three items. Each item is one short sentence.
   a. One line that defines the key concept.
   b. One line that reminds the student of a useful method or principle.
   c. One line that names a common mistake or confusion.
   Keep the total reading load under eighty words.
4. Use simple language. Avoid formulas unless the course is clearly very quantitative.
5. Do not mention RIASEC, personality, traits or any scoring logic in the task text.
6. Do not mention that this is a test or an assessment. It should feel like a normal study question.

TEMPLATE A  first_step_approach

Goal

Give six plausible first actions that all fit the scenario but express different styles.
Each option maps to one of the six RIASEC profiles.

Use these profiles:

Realistic: "lab","field","equipment","tools","build","repair","operate","install","measure",
          "laboratory","prototype","machinery","hardware","electronics","sample","specimen","safety"
          ,"construction","manual","physical","technician","maintenance","inspection","diagnose","weld"

Investigative "analyze","theory","model","proof","derive","experiment","hypothesis","data",
          "research","statistics","algorithm","simulate","evidence","inference","mathematics","physics","logic"
          ,"quantitative","scientific","compute","computation","evaluate","study","investigate"

Artistic  "design","draw","sketch","compose","write","narrative","visual","media","art",
          "music","film","theatre","creative","story","photography","gallery","curation"
          ,"performance","aesthetic","illustrate","exhibit","craft","fashion","style"

Social  "help","support","advise","coach","teach","tutor","counsel","community","team",
          "care","wellbeing","interview","facilitate","mentor","outreach","collaborate","group","clients"
          ,"service","social","develop","train","educate"

Enterprising  "business","lead","manage","strategy","sales","marketing","finance","entrepreneurship",
          "pitch","negotiate","market","revenue","growth","product","stakeholder","budget","plan"
          ,"customer","commercial","operation","organisational","investor","network"
Conventional  "organize","detail","procedure","policy","regulation","compliance","audit","accounting",
          "schedule","record","document","database","spreadsheet","report","inventory","forms","workflow","quality"
          ,"administration","logistics","systematic","process","standard"

Instructions

1. The question in the stimulus must clearly be about the first step. For example
   "What do you do first" or "Which first step makes most sense".
2. Create six options labeled from A to F.
3. Each option must describe a different first step that a student could reasonably take in this situation.
4. Map the options to the six RIASEC profiles as follows:
   exactly one option is Realistic
   exactly one option is Investigative
   exactly one option is Artistic
   exactly one option is Social
   exactly one option is Enterprising
   exactly one option is Conventional
5. Make the mapping clear by including a riasec_code for each option.
6. All options must be plausible. None of them should be obviously wrong or silly. They just reflect different ways to start.
7. The action in each option must be a method or behaviour, not a statement of fact or a vague attitude.


TEMPLATE B  challenge

Goal

Give a short scenario and four first step options where one option is clearly the best according to the course concept or method.

Instructions

1. Use the same style of stimulus and tiny_learn bubble as above.
2. Create four options labeled from A to D.
3. All options must describe possible first steps in the situation.
4. Exactly one option should be the best answer for a student who understands the course concept.
5. The other three options must be attractive but not the best first step. For example they can:
   focus on a secondary method
   skip an important check
   mix up cause and effect
6. The task should test understanding or reasoning about the course content, not trivia or memory of small details.
7. The student should be able to answer using only the information implied by the course description and normal first year knowledge.

OUTPUT

Return one JSON object for each programme with keys:

task_type: "first_step_approach"

programme: string
course_code: string
course_name: string
   task_type: "first_step_approach"
   stimulus: string
   tiny_learn: list of three short strings
   options: list of six option objects, each with:
   label: one of "A","B","C","D","E","F"
   text: short description of the first step
   riasec_code: one of "R","I","A","S","E","C"

   task_type: "challenge_mcq"
   stimulus: string
   tiny_learn: list of three short strings
   options: list of four option objects, each with:
   label: one of "A","B","C","D"
   correct_option: label of the best answer, one of "A","B","C","D"

GENERAL OUTPUT RULES

1. Always return a single JSON object.
2. Start the reply with "{" and end the reply with "}".
3. Do not wrap the JSON in code fences.
4. Do not add any commentary, headings or explanation outside the JSON.
"""


In [6]:
SYSTEM_PROMPT = """
You generate microtasks for a playful study choice tool.

Students see tiny tasks that feel like doing a real course.
They choose what they would do first.
The system uses their choices to infer style signals and to show fit with programmes.

INPUT

You receive one JSON object with these fields:

programme_title
course_code
course_name
course_objective
course_content
additional_information_teaching_methods
method_of_assessment

Use the course and programme texts to infer what the course is about and what students actually do.
Focus on realistic first year situations and concrete thinking tasks.

SHARED RULES FOR ALL TASKS

1. Pick one specific concept or method that clearly belongs in this course.
2. Write a short stimulus. It describes a realistic situation for a first year student and ends with a question about what to do first.
3. Write a tiny_learn bubble with exactly three items. Each item is one short sentence.
   a. One line that defines the key concept.
   b. One line that reminds the student of a useful method or principle.
   c. One line that names a common mistake or confusion.
   Keep the total reading load under eighty words.
4. Use simple language. Avoid formulas unless the course is clearly very quantitative.
5. Do not mention RIASEC, personality, traits or any scoring logic in the task text.
6. Do not mention that this is a test or an assessment. It should feel like a normal study question.

TEMPLATE A  first_step_approach

Goal

Give six plausible first actions that all fit the scenario but express different styles.
Each option maps to one of the six RIASEC profiles.

Use these profiles as style hints:

Realistic  practical, hands on, using tools, data or physical activity
Investigative  analytical, exploring patterns, questions and evidence
Artistic  creative representation, open ended design and expression
Social  interaction with people, explanation, collaboration and feedback
Enterprising  initiative, decision, persuasion, planning and pitching
Conventional  organising, documenting, structuring and checking details

Instructions

1. The question in the stimulus must clearly be about the first step. For example
   "What do you do first" or "Which first step makes most sense".
2. Create six options labeled from A to F.
3. Each option must describe a different first step that a student could reasonably take in this situation.
4. Map the options to the six RIASEC profiles as follows:
   exactly one option is Realistic
   exactly one option is Investigative
   exactly one option is Artistic
   exactly one option is Social
   exactly one option is Enterprising
   exactly one option is Conventional
5. Make the mapping clear by including a riasec_code for each option.
6. All options must be plausible. None of them should be obviously wrong or silly. They just reflect different ways to start.
7. The action in each option must be a method or behaviour, not a statement of fact or a vague attitude.

TEMPLATE B  challenge

Goal

Give a short scenario and four first step options where one option is clearly the best according to the course concept or method.

Instructions

1. Use the same style of stimulus and tiny_learn bubble as above.
2. Create four options labeled from A to D.
3. All options must describe possible first steps in the situation.
4. Exactly one option should be the best answer for a student who understands the course concept.
5. The other three options must be attractive but not the best first step. For example they can:
   focus on a secondary method
   skip an important check
   mix up cause and effect
6. The task should test understanding or reasoning about the course content, not trivia or memory of small details.
7. The student should be able to answer using only the information implied by the course description and normal first year knowledge.

OUTPUT

For the given course return one JSON object with this structure:

{
  "programme_title": string,
  "course_code": string,
  "course_name": string,
  "tasks": [
    {
      "task_type": "first_step_approach",
      "stimulus": string,
      "tiny_learn": [string, string, string],
      "options": [
        {
          "label": "A" | "B" | "C" | "D" | "E" | "F",
          "text": string,
          "riasec_code": "R" | "I" | "A" | "S" | "E" | "C"
        },
        ... five more options ...
      ]
    },
    {
      "task_type": "challenge",
      "stimulus": string,
      "tiny_learn": [string, string, string],
      "options": [
        {
          "label": "A" | "B" | "C" | "D",
          "text": string
        },
        ... three more options ...
      ],
      "correct_option": "A" | "B" | "C" | "D"
    }
  ]
}

GENERAL OUTPUT RULES

1. Always return a single JSON object.
2. Start the reply with "{" and end the reply with "}".
3. Do not wrap the JSON in code fences.
4. Do not add any commentary, headings or explanation outside the JSON.
"""


## 4. Define the helpers functions

In [7]:

def truncate(text, max_chars=1200):
    """
    Simple helper to shorten long course texts.
    """
    if text is None:
        return ""
    s = str(text)
    if len(s) <= max_chars:
        return s
    return s[:max_chars]


def build_course_payload(row):
    """
    Prepare the JSON object that will be sent to the model for one course.
    """
    return {
        "programme_title": str(row.get("programme_title", "")),
        "course_code": str(row.get("code", "")),
        "course_name": str(row.get("course_name", "")),
        "course_objective": truncate(row.get("course_objective", ""), 1200),
        "course_content": truncate(row.get("course_content", ""), 1200),
        "additional_information_teaching_methods": truncate(
            row.get("additional_information_teaching_methods", ""), 800
        ),
        "method_of_assessment": truncate(row.get("method_of_assessment", ""), 800),
    }



def safe_parse_json(raw_text):
    """
    Try to parse the model output as JSON.
    If that fails, try to extract the first {...} block.
    If that also fails, raise a clear error.
    """
    if raw_text is None:
        raise ValueError("Model returned no text at all")

    text = raw_text.strip()

    # If the model returned an empty string
    if not text:
        raise ValueError("Model returned an empty string, no JSON to parse")

    # First simple attempt
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # Second attempt, look for first and last curly brace
    start = text.find("{")
    end = text.rfind("}")
    if start == -1 or end == -1 or end <= start:
        print("Could not find any JSON object in the text. Here is a preview:")
        print(text[:400])
        raise ValueError("No JSON object found in model output")

    candidate = text[start : end + 1]

    try:
        return json.loads(candidate)
    except json.JSONDecodeError as e:
        print("Failed to parse this JSON candidate:")
        print(candidate[:400])
        raise e



## 5. API call function

In [8]:
def generate_tasks_for_course(row, model_name="gpt-4.1-mini"):
    """
    Call the Responses API for a single course.

    The model returns both tasks:
    one first_step_approach task and one challenge task.

    Returns a Python dict that matches the JSON described in SYSTEM_PROMPT.
    """

    payload = build_course_payload(row)

    response = client.responses.create(
        model=model_name,
        input=json.dumps(payload),
        instructions=SYSTEM_PROMPT,
        max_output_tokens=800,
    )

    raw_text = response.output_text

    print("RAW TEXT PREVIEW:")
    print(repr((raw_text or "")[:300]))
    print("END RAW TEXT PREVIEW\n")

    if raw_text is None or not raw_text.strip():
        print("Full response object for debugging:")
        print(response)
        raise ValueError("Model returned no text, check model name and prompt")

    task_bundle = safe_parse_json(raw_text)

    # Optionally attach any extra metadata here if needed
    return task_bundle




In [10]:
test_row = df_courses_tasks.iloc[0]
test_bundle = generate_tasks_for_course(test_row, model_name="gpt-4.1-mini")
print(json.dumps(test_bundle, indent=2, ensure_ascii=False))

RAW TEXT PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW115",\n  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You have just accessed the archaeological collectio'
END RAW TEXT PREVIEW

{
  "programme_title": "Ancient Studies",
  "course_code": "L_AABAOHW115",
  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",
  "tasks": [
    {
      "task_type": "first_step_approach",
      "stimulus": "You have just accessed the archaeological collection to study an ancient statue that caught your interest. You want to start your research effectively. What do you do first?",
      "tiny_learn": [
        "Material culture includes objects made or used by people in the past to tell us about history.",
        "Begin by identifying and describing the object's features and context in detail.",
     

In [11]:
# save the test task bundle to a file
output_folder = Path("../data_bank_microtasks") / "test_output"
output_folder.mkdir(parents=True, exist_ok=True)
output_file = output_folder / f"{test_row['code']}_tasks.json"
with open(output_file, "w", encoding="utf-8") as f:
    json.dump(test_bundle, f, ensure_ascii=False, indent=2)

## 6.Loop over courses, pick templates, save tasks

Now build a simple loop that chooses a template for each course, calls the function, and collects results.

In [12]:

# 2) FULL MICROTASK GENERATION

bundles = []

max_courses = 10  # for safety on first run

for i, (_, row) in enumerate(tqdm(df_courses_tasks.iterrows(), total=len(df_courses_tasks))):
    if i >= max_courses:
        break

    try:
        bundle = generate_tasks_for_course(row, model_name="gpt-4.1-mini")
        bundles.append(bundle)
    except Exception as e:
        print(f"Problem on row {i} with course {row.get('code')}: {e}")



  7%|▋         | 1/14 [00:13<02:56, 13.60s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW115",\n  "course_name": "Objects in Context. An Interdisciplinary Perspective on the Ancient World",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You have just started your first assignment analyzi'
END RAW TEXT PREVIEW



 14%|█▍        | 2/14 [00:25<02:28, 12.40s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Ancient Studies",\n  "course_code": "L_AABAOHW101",\n  "course_name": "The Classical Canon I: The Heritage of Antiquity",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You have to start a group project about a classical canonical item not previ'
END RAW TEXT PREVIEW



 21%|██▏       | 3/14 [00:35<02:07, 11.62s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Archaeology",\n  "course_code": "L_AABAARC111",\n  "course_name": "Archaeological Sources",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You have just received a box of various archaeological artefacts from a recent excavation. Your task is to'
END RAW TEXT PREVIEW



 29%|██▊       | 4/14 [00:46<01:53, 11.33s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Archaeology",\n  "course_code": "L_AABAARC101",\n  "course_name": "What is Archaeology?",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "As a first-year archaeology student, you need to start preparing for an assignment where you study an archae'
END RAW TEXT PREVIEW



 36%|███▌      | 5/14 [00:56<01:38, 10.91s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Communication and Information Studies",\n  "course_code": "L_AABACIW102",\n  "course_name": "Introduction to Communication Studies",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You have just started the Introduction to Communication Studies c'
END RAW TEXT PREVIEW



 43%|████▎     | 6/14 [01:07<01:26, 10.75s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Communication and Information Studies",\n  "course_code": "L_AABACIW103",\n  "course_name": "Introduction to Linguistics",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You have just started the Introduction to Linguistics course and need to an'
END RAW TEXT PREVIEW



 50%|█████     | 7/14 [01:35<01:55, 16.49s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Econometrics and Operations Research",\n  "course_code": "XB_0160",\n  "course_name": "Calculus and Analysis I",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You need to solve a problem involving finding the local maximum of a function with tw'
END RAW TEXT PREVIEW



 57%|█████▋    | 8/14 [01:44<01:24, 14.03s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "Econometrics and Operations Research",\n  "course_code": "XB_0099",\n  "course_name": "Introduction to Programming",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You have to start your first programming assignment where you need to write a Pyt'
END RAW TEXT PREVIEW



 64%|██████▍   | 9/14 [01:54<01:04, 12.94s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "History",\n  "course_code": "L_AABAGESACS",\n  "course_name": "Academic Skills for Historians",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You are starting to write your first status quaestionis on a historical topic for your course. You hav'
END RAW TEXT PREVIEW

Failed to parse this JSON candidate:
{
  "programme_title": "History",
  "course_code": "L_AABAGESACS",
  "course_name": "Academic Skills for Historians",
  "tasks": [
    {
      "task_type": "first_step_approach",
      "stimulus": "You are starting to write your first status quaestionis on a historical topic for your course. You have a long list of books and articles but need to decide how to begin organizing your research. What d
Problem on row 8 with course L_AABAGESACS: Illegal trailing comma before end of object: line 66 column 102 (char 2763)


 71%|███████▏  | 10/14 [02:06<00:50, 12.68s/it]

RAW TEXT PREVIEW:
'{\n  "programme_title": "History",\n  "course_code": "L_AABAGESACV",\n  "course_name": "Academic Skills for Historians",\n  "tasks": [\n    {\n      "task_type": "first_step_approach",\n      "stimulus": "You have just received your historiographical theme for the status quaestionis assignment. You need to'
END RAW TEXT PREVIEW






In [13]:
microtasks_saving_path = Path("../data_bank_microtasks")
out_path = microtasks_saving_path / "microtasks_by_course.json"

with out_path.open("w", encoding="utf8") as f:
    json.dump(bundles, f, ensure_ascii=False, indent=2)

print("Saved course level tasks to:", out_path)
    

Saved course level tasks to: ..\data_bank_microtasks\microtasks_by_course.json


In [14]:

microtasks_saving_path = Path("../data_bank_microtasks")
in_path = microtasks_saving_path / "microtasks_by_course.json"
out_path = microtasks_saving_path / "microtasks_by_programme.json"

# 1. Load the list of course bundles
with in_path.open("r", encoding="utf8") as f:
    bundles = json.load(f)

# 2. Group courses by programme_title
programmes = {}

for bundle in bundles:
    programme = bundle.get("programme_title", "UNKNOWN PROGRAMME")
    course_code = bundle.get("course_code", "")
    course_name = bundle.get("course_name", "")
    tasks = bundle.get("tasks", [])

    # If this programme is not yet in the dict, create the container
    if programme not in programmes:
        programmes[programme] = {
            "programme_title": programme,
            "courses": []
        }

    # Add this course to the courses list for that programme
    programmes[programme]["courses"].append(
        {
            "course_code": course_code,
            "course_name": course_name,
            "tasks": tasks,
        }
    )

# 3. Save the nested structure as pretty JSON
with out_path.open("w", encoding="utf8") as f:
    json.dump(programmes, f, ensure_ascii=False, indent=2)

print("Saved nested programmes file to:", out_path)


Saved nested programmes file to: ..\data_bank_microtasks\microtasks_by_programme.json
