# Generation of Video Challenges

We want to generate a new task bank called "videochallenge_bank.json".

Each entry will be a micro challenge that says for this programme, show this YouTube Short, then ask this question with four options.

To keep it grounded we reuse the course context that you already use for aptitude tasks.

We use the YouTube API once offline to find candidate Shorts per programme.

We store those candidate videos in a small table with programme, video id, url, title, maybe duration.

Then we fetch transcripts for those videos and store another table.

Then we feed programme context plus transcript into an OpenAI prompt to generate the question and options.


## 1. Setup, imports, paths, and OpenAI client

In [1]:
from pathlib import Path
import json
import re

import pandas as pd
from tqdm import tqdm

# we import the OpenAI client
from openai import OpenAI

In [2]:
# here we define the folders that we use in the project
# we keep them consistent with the other notebooks
silver_dir = Path("../data_programmes_courses/silver")
bank_dir = Path("../data_bank_microtasks")

# we make sure that the bank folder exists
bank_dir.mkdir(parents=True, exist_ok=True)

In [3]:
# here we read the OpenAI API key from a local file
# we store the key in a simple text file that is not tracked by git
# for example we add data_bank_microtasks/api_key.txt to .gitignore
openai_key_path = bank_dir / "api_key.txt"
openai_api_key = openai_key_path.read_text(encoding="utf8").strip()

# we create the OpenAI client that we will use later
client = OpenAI(api_key=openai_api_key)

# we choose a default model name, the same we use in the other notebooks
model_gpt = "gpt-4.1-mini"

print("OpenAI client and model ready.")

OpenAI client and model ready.


## 2. Load course data

In [4]:
# Here we load the course dataframe and build a helper to describe each programme

# we load the courses that we already prepared for microtasks
df_courses_tasks = pd.read_csv(
    silver_dir / "df_courses_tasks_silver.csv",
    encoding="utf-8-sig"
)

print("Original courses tasks shape:", df_courses_tasks.shape)



# we keep only the first two courses per programme
# we do this to keep the context short and focused for the prompts
df_courses_tasks_small = (
    df_courses_tasks
    .groupby("programme_title", as_index=False)
    .head(2)
    .reset_index(drop=True)
)

print("After keeping two core courses per programme:", df_courses_tasks_small.shape)

# we list the programme names
programmes = sorted(df_courses_tasks_small["programme_title"].unique())
print("Number of programmes:", len(programmes))
print("First few programmes:", programmes[:5])

Original courses tasks shape: (36, 21)
After keeping two core courses per programme: (28, 21)
Number of programmes: 14
First few programmes: ['Ancient Studies', 'Biomedical Sciences', 'Business Analytics', 'Communication and Information Studies', 'Computer Science']


## 3. build a programme context helper

In [5]:
def build_programme_context(df_prog: pd.DataFrame) -> str:
    """
    With this function we build a compact text snippet for one programme.
    We use the two core courses that we kept above.

    Input:
        df_prog: small dataframe with two rows for one programme

    Output:
        context: string that describes the programme and the two courses
    """
    lines = []

    # we take the programme name from the first row
    prog_name = df_prog["programme_title"].iloc[0]
    lines.append(f"Programme: {prog_name}")

    # we loop over the two core courses
    for _, row in df_prog.iterrows():
        course_title = row.get("course_name", "")
        course_obj = row.get("course_objective", "")
        course_cont = row.get("course_content", "")

        if isinstance(course_title, str) and course_title.strip():
            lines.append(f"Course: {course_title.strip()}")

        if isinstance(course_obj, str) and course_obj.strip():
            # we keep the full objective text
            lines.append(f"Objectives: {course_obj.strip()}")

        if isinstance(course_cont, str) and course_cont.strip():
            # we keep only a short part of the content to control length
            snippet = course_cont.strip()[:400]
            lines.append(f"Content snippet: {snippet}")

    # we join all lines into a single context block
    context = "\n".join(lines)
    return context

# here we test the helper for one programme
test_prog = programmes[0]
df_test_prog = df_courses_tasks_small[
    df_courses_tasks_small["programme_title"] == test_prog
]
print("Context for test programme:")
print(build_programme_context(df_test_prog))

Context for test programme:
Programme: Ancient Studies
Course: Objects in Context. An Interdisciplinary Perspective on the Ancient World
Objectives: A distinct feature of Ancient Studies is the combined use of written and material sources. In this course the focus is on material culture, and how this can be used as a source of information about the past. You will familiarize yourself with some of the main categories of objects such as statues, reliefs, coins and utensils as well as with complex assemblages that are part of the sacred, domestic and funerary domain. An important issue is how we evaluate objects as sources of historical information about the past. What are the possibilities and limitations? What types of questions can we ask? For this we take a closer look at the historical context of material sources: we explore why material culture was created by people in the past, and how this affects our image of antiquity. Special attention will be paid to confronting material and l

## 4. Set up YouTube API access 

three code blocks: 
- a: YouTube API setup
- b: search plus enrich with details
- c: add captions to build df_video

In [6]:
# STEP a
# Here we set up YouTube API access for searching and getting video details

import requests  # we use this to call the YouTube API

# we read the YouTube API key from a local file
# we keep youtube_api_key.txt out of git by adding it to .gitignore
yt_key_path = bank_dir / "youtube_api_key.txt"
yt_api_key = yt_key_path.read_text(encoding="utf8").strip()

# base URLs for YouTube Data API v3
YOUTUBE_SEARCH_URL = "https://www.googleapis.com/youtube/v3/search"
YOUTUBE_VIDEOS_URL = "https://www.googleapis.com/youtube/v3/videos"

print("YouTube API key loaded.")

YouTube API key loaded.


In [11]:
# STEP b
# Here we define helpers to search for Shorts and then enrich them with more details

def search_shorts_for_programme(programme_name: str,
                                max_results: int = 10,
                                language: str = "en") -> list[dict]:
    """
    With this function we search YouTube for short videos related to one programme.

    Input:
        programme_name: name of the bachelor programme
        max_results: how many videos we request from the API
        language: main language that we prefer in the search

    Output:
        list of dict, each dict has basic info coming from the search endpoint
    """
    # here we build a simple search query, we can refine this later per programme
    query = f"Course of {programme_name}. Short educational video."

    params = {
        "key": yt_api_key,
        "part": "snippet",
        "q": query,
        "type": "video",
        "maxResults": max_results,
        "videoDuration": "short",  # this means shorter than four minutes
        "relevanceLanguage": language,
        "safeSearch": "moderate",
    }

    # we send the request to the YouTube search endpoint
    resp = requests.get(YOUTUBE_SEARCH_URL, params=params)
    resp.raise_for_status()
    data = resp.json()

    results = []
    for item in data.get("items", []):
        vid = item["id"]["videoId"]
        snippet = item["snippet"]

        # here we extract basic fields from the snippet part
        title = snippet.get("title", "")
        description = snippet.get("description", "")
        channel_title = snippet.get("channelTitle", "")
        published_at = snippet.get("publishedAt", "")

        video_url = f"https://www.youtube.com/watch?v={vid}"

        results.append({
            "programme_title": programme_name,
            "video_id": vid,
            "video_url": video_url,
            "video_title": title,
            "video_description": description,
            "channel_title": channel_title,
            "published_at": published_at,
        })

    return results


def enrich_videos_with_details(video_rows: list[dict]) -> list[dict]:
    """
    With this function we take a list of basic video rows with video_id.
    We call the videos endpoint to get duration, caption flag, viewCount,
    likeCount, and topic categories, then we merge these back.

    Input:
        video_rows: list of dict from search_shorts_for_programme

    Output:
        new list of dict where each row has duration, statistics, and topics added
    """
    if not video_rows:
        return video_rows

    # here we collect the video ids into one comma separated string
    video_ids = [row["video_id"] for row in video_rows]
    id_str = ",".join(video_ids)

    params = {
        "key": yt_api_key,
        "part": "snippet,contentDetails,statistics,topicDetails",
        "id": id_str,
    }

    resp = requests.get(YOUTUBE_VIDEOS_URL, params=params)
    resp.raise_for_status()
    data = resp.json()

    # here we build a small index from video id to details
    details_by_id = {}
    for item in data.get("items", []):
        vid = item["id"]
        snippet = item.get("snippet", {})
        content = item.get("contentDetails", {})
        stats = item.get("statistics", {})
        topics = item.get("topicDetails", {})

        # duration is ISO 8601, for example PT45S
        duration = content.get("duration", "")
        caption_flag = content.get("caption", "")

        view_count = stats.get("viewCount")
        like_count = stats.get("likeCount")

        # topic categories is usually a list of URLs, we keep them as a simple list of strings
        topic_categories = topics.get("topicCategories", [])

        details_by_id[vid] = {
            "duration": duration,
            "caption_flag": caption_flag,
            "viewCount": view_count,
            "likeCount": like_count,
            "topicCategories": topic_categories,
        }

    # here we merge the details into the original rows
    enriched = []
    for row in video_rows:
        info = details_by_id.get(row["video_id"], {})
        new_row = row.copy()
        new_row.update(info)
        enriched.append(new_row)

    return enriched




## 5. Testing the YT API and search

In [12]:
# here we test both helpers on one programme and wrap the results into a dataframe
test_prog = programmes[0]

basic_rows = search_shorts_for_programme(test_prog, max_results=10)
enriched_rows = enrich_videos_with_details(basic_rows)

df_video_test = pd.DataFrame(enriched_rows)
print("Test search and enrich for programme:", test_prog)
df_video_test.head()

Test search and enrich for programme: Ancient Studies


Unnamed: 0,programme_title,video_id,video_url,video_title,video_description,channel_title,published_at,duration,caption_flag,viewCount,likeCount,topicCategories
0,Ancient Studies,pFK9nC6iD6I,https://www.youtube.com/watch?v=pFK9nC6iD6I,THE ANCIENT AGE | Educational Videos for Kids.,"Learn what the Ancient Age is, when it begins ...",Happy Learning English,2023-11-02T11:00:48Z,PT3M,False,39877,182,[https://en.wikipedia.org/wiki/Knowledge]
1,Ancient Studies,IUZKg3KdtYo,https://www.youtube.com/watch?v=IUZKg3KdtYo,Ancient Greece | Educational Videos for Kids,PREMIERES! https://www.youtube.com/playlist?li...,Happy Learning English,2019-02-19T12:06:38Z,PT3M48S,False,970847,6190,[https://en.wikipedia.org/wiki/Knowledge]
2,Ancient Studies,b9bcohqsTGk,https://www.youtube.com/watch?v=b9bcohqsTGk,ROMAN EMPIRE | Educational Video for Kids.,PREMIERES! https://www.youtube.com/playlist?li...,Happy Learning English,2017-11-21T16:30:00Z,PT3M27S,False,1261368,6166,[https://en.wikipedia.org/wiki/Knowledge]
3,Ancient Studies,ce-A8bRc7ZA,https://www.youtube.com/watch?v=ce-A8bRc7ZA,Sinkhole reveals Lost Ancient Civilization,,History Piece,2024-01-23T15:27:04Z,PT55S,False,1698292,56719,[https://en.wikipedia.org/wiki/Knowledge]
4,Ancient Studies,jIufj6A0Sxo,https://www.youtube.com/watch?v=jIufj6A0Sxo,#Ancient #History #Timeline,Ancient History Timeline.,Padhlo Padhalo,2023-02-18T16:46:13Z,PT9S,False,81410,1030,[https://en.wikipedia.org/wiki/Knowledge]
