<a href="https://colab.research.google.com/github/creation-extro/ai-nlp/blob/main/AI%20Content%20Analysis%20%26%20Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install moviepy whisper-timestamped spacy pydantic
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m126.9 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


# Task
Generate 3 to 4 multiple-choice quiz questions based on key concepts identified in the `arrays.mp4` video transcript. Each quiz question will feature a question, one correct option derived from the segment's topic, and three incorrect options generated from other noun chunks in the segment or generic placeholders, and be saved to a `all_quizzes.json` file.

## Select Multiple Quiz Segments

### Subtask:
Modify the main execution block to iterate through the `topic_list` and select 3 to 4 distinct segments (or as many as available up to 4) that are suitable for quiz generation.


**Reasoning**:
The subtask requires modifying the main execution block to select multiple quiz segments from the `topic_list`. I need to iterate through the `topic_list`, select up to 4 segments, and then call the `map_llm_output_to_quiz_json` function for each selected segment.



**Reasoning**:
The previous instruction asked to modify the main execution block to select multiple quiz segments, which was already implemented in the provided code in `cell_id: f54347f1`. Since the system requested to 'generate the full fixed code again', I am providing the current, correct version of that cell.



**Reasoning**:
The previous instruction asked to modify the main execution block to select multiple quiz segments, which was already implemented in the provided code in `cell_id: f54347f1`. Since the system requested to 'generate the full fixed code again', I am providing the current, correct version of that cell.



**Reasoning**:
The subtask requires saving all generated quiz JSON objects into a single 'all_quizzes.json' file. I need to modify the `map_llm_output_to_quiz_json` function to return the quiz data instead of writing to individual files, and then collect all quizzes in the main execution block before writing them to a single file.



# Task
Implement a new function `generate_rule_based_quiz(text_chunk: str, topic: str, all_noun_chunks: list)` to programmatically generate quiz questions and options based on the provided parameters. Additionally, extract all noun chunks from the entire `raw_transcript_data` to be used as a pool for incorrect options in the quiz generation.

## Implement Rule-Based Quiz Generation

### Subtask:
Create a new function, `generate_rule_based_quiz(text_chunk: str, topic: str, all_noun_chunks: list)`, that programmatically generates a quiz. The question will be framed around the `topic`, and the correct option will be derived from the `topic` or a prominent phrase in the `text_chunk`. Three incorrect options will be selected from `all_noun_chunks` (other significant noun chunks from the *entire transcript* or current segment) or simple generic placeholders to ensure variety. The function will return a dictionary conforming to the `Question` Pydantic model structure.


**Reasoning**:
The subtask requires implementing the `generate_rule_based_quiz` function. This function will take a text chunk, a topic, and a list of all noun chunks to create a rule-based quiz question with one correct and three incorrect options, conforming to the `Question` Pydantic model.



**Reasoning**:
The previous code execution completed successfully after fixing the `NameError`, indicating that the task, which involved generating rule-based quizzes and saving them to a JSON file, has been accomplished.



In [13]:
import whisper_timestamped as whisper
from moviepy.editor import VideoFileClip
import json
import os
import spacy
from pydantic import BaseModel, Field
import random

# --- CONFIGURATION (UPDATE THIS) ---
# NOTE: Ensure this path points to a file you have uploaded or mounted in Colab.
VIDEO_FILE_PATH = 'arrays.mp4'

# Placeholder for Arya's LLM output - will be replaced by rule-based generation
LLM_RAW_OUTPUT_STR = """
{
    "question": "Placeholder question",
    "options": [
        {"text": "Option A", "is_correct": false},
        {"text": "Option B", "is_correct": false},
        {"text": "Option C", "is_correct": false},
        {"text": "Option D", "is_correct": true}
    ]
}
"""
# --- END CONFIGURATION ---


# --- LLM OUTPUT STRUCTURE (Day 4 Requirement) ---
class QuizOption(BaseModel):
    text: str = Field(description="The text of the answer option.")
    is_correct: bool = Field(description="True if this is the correct answer.")

class Question(BaseModel):
    question: str = Field(description="The multiple-choice question text.")
    options: list[QuizOption] = Field(description="A list of 4 possible answers, with exactly one marked as correct.")

# --- INITIALIZE NLP MODEL (Day 3 Requirement) ---
try:
    # This assumes you ran the separate setup command: !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")
except Exception as e:
    print(f"ERROR: spaCy model not loaded. Please run !python -m spacy download en_core_web_sm. Error: {e}")
    nlp = None


# --- FUNCTION 1: DAY 2 - Transcription with Whisper ---
def transcribe_video_with_timestamps(video_path: str, output_filename="transcript_raw.json") -> dict:
    """Extracts audio, transcribes, and saves the raw Whisper output."""

    if not os.path.exists(video_path):
        print(f"ERROR: Video file not found at path: {video_path}")
        return {}

    temp_audio_path = "temp_audio_mahesh.mp3"

    # 1. Extract audio from video
    print(f"DEBUG: Extracting audio from video: {video_path}...")
    try:
        video_clip = VideoFileClip(video_path)
        video_clip.audio.write_audiofile(temp_audio_path, logger=None)
        video_clip.close()
        print(f"DEBUG: Audio successfully extracted to {temp_audio_path}")
    except Exception as e:
        print(f"ERROR: Audio extraction failed. Check FFMPEG or video format. Error: {e}")
        return {}

    # 2. Load model and transcribe
    print("Day 2 Task: Loading Whisper model ('small') and transcribing...")
    result = {}
    try:
        # Using 'small' model. Change to 'base' if 'small' is too slow.
        model = whisper.load_model("small")
        result = whisper.transcribe(model, temp_audio_path, language="en", verbose=False)
    except Exception as e:
        print(f"ERROR: Whisper transcription failed. Error: {e}")
        result = {}

    # 3. SAVE OUTPUT TO FILE
    if result:
        with open(output_filename, 'w', encoding='utf-8') as f:
            json.dump(result, f, indent=4)
        print(f"\n✅ Day 2 Output SAVED to {output_filename}")

    # Clean up the temporary audio file
    if os.path.exists(temp_audio_path):
        os.remove(temp_audio_path)
        print(f"DEBUG: Cleaned up temporary file: {temp_audio_path}")

    return result


# --- FUNCTION 2: DAY 3 - NLP Topic Segmentation ---
def detect_segment_topics(transcript_data: dict, output_filename="topic_timestamps.json") -> list:
    """Analyzes each segment for key entities or noun chunks to determine quiz trigger points."""
    if not nlp:
        print("ERROR: Cannot run detect_segment_topics. spaCy model not loaded.")
        return []

    print("Day 3 Task: Running spaCy NER/Noun Chunk extraction for topic detection...")
    topic_segments = []
    all_noun_chunks = [] # To collect all noun chunks for incorrect options

    for segment in transcript_data.get('segments', []):
        text = segment['text'].strip()
        if not text:
            continue

        start_time = segment['start']
        doc = nlp(text)

        significant_noun_chunks = [
            chunk.text for chunk in doc.noun_chunks
            if not all(token.is_stop for token in chunk) and len(chunk.text.split()) > 1
        ]
        named_entities = [ent.text for ent in doc.ents]

        current_segment_noun_chunks = list(set(named_entities + significant_noun_chunks))
        all_noun_chunks.extend(current_segment_noun_chunks)

        if current_segment_noun_chunks:
            topic_segments.append({
                "start_sec": int(start_time),
                "topic": " | ".join(sorted(current_segment_noun_chunks)),
                "text_chunk": text
            })

    # Dedup and save all noun chunks for later use
    all_noun_chunks = list(set(all_noun_chunks))
    # SAVE OUTPUT TO FILE
    with open(output_filename, 'w', encoding='utf-8') as f:
        json.dump(topic_segments, f, indent=4)
    print(f"\n✅ Day 3 Output SAVED to {output_filename}")

    return topic_segments, all_noun_chunks # Return all_noun_chunks as well


# --- FUNCTION: RULE-BASED QUIZ GENERATION ---
def generate_rule_based_quiz(text_chunk: str, topic: str, all_noun_chunks: list) -> dict:
    """Programmatically generates a quiz question and options based on rules."""

    # 1. Formulate the question
    question_text = f"Which of the following best describes: {topic}?"

    # 2. Create the correct option
    # For simplicity, let's use the topic itself as the correct option or a related phrase from the text_chunk
    correct_option_text = topic.split(' | ')[0] if ' | ' in topic else topic # Use the first part if topic is combined
    # Optionally, try to find a better phrasing from text_chunk that contains the topic
    if correct_option_text.lower() not in text_chunk.lower():
        # Fallback to a generic correct statement related to the topic if not directly found
        correct_option_text = f"Understanding the concept of {correct_option_text}"

    options = [QuizOption(text=correct_option_text, is_correct=True)]

    # 3. Generate three incorrect options
    potential_incorrect_options = [nc for nc in all_noun_chunks if nc != correct_option_text and nc not in topic.split(' | ')]
    random.shuffle(potential_incorrect_options)

    incorrect_count = 0
    used_incorrect_options = set()

    for opt in potential_incorrect_options:
        if incorrect_count < 3 and opt not in used_incorrect_options:
            options.append(QuizOption(text=opt, is_correct=False))
            used_incorrect_options.add(opt)
            incorrect_count += 1

    # If not enough distinct noun chunks, use generic placeholders
    generic_placeholders = [
        "Arrays are always dynamically sized",
        "Linked lists offer faster random access",
        "Space complexity is always O(1)",
        "All data structures have the same performance characteristics"
    ]
    random.shuffle(generic_placeholders)

    for placeholder in generic_placeholders:
        if incorrect_count < 3 and placeholder not in used_incorrect_options and placeholder != correct_option_text:
            options.append(QuizOption(text=placeholder, is_correct=False))
            used_incorrect_options.add(placeholder)
            incorrect_count += 1

    # Ensure exactly 4 options by adding more generic ones if necessary
    while len(options) < 4:
        # Pick a generic one not already used or similar to correct option
        for placeholder in generic_placeholders:
            if placeholder not in used_incorrect_options and placeholder != correct_option_text:
                options.append(QuizOption(text=placeholder, is_correct=False))
                used_incorrect_options.add(placeholder)
                break
        # If all generic are used, add a very generic one
        if len(options) < 4:
            options.append(QuizOption(text=f"Some other irrelevant fact {len(options)}", is_correct=False))


    random.shuffle(options) # Shuffle to mix correct and incorrect options

    # 4. Structure into Question Pydantic model
    quiz_data = Question(
        question=question_text,
        options=options
    )
    return quiz_data.model_dump() # Return as dict


# --- FUNCTION 3: DAY 4 - LLM Output Mapping ---
def map_llm_output_to_quiz_json(llm_output_text: str, trigger_time_sec: int, quiz_index: int) -> dict:
    """Parses raw LLM JSON output into the final application JSON structure."""

    print(f"Day 4 Task: Mapping LLM output to final structured JSON for quiz {quiz_index + 1}...")

    try:
        # 1. Parse the LLM's JSON string output
        llm_data = json.loads(llm_output_text)

        # 2. Validate against Pydantic model
        validated_question = Question(**llm_data)

        # 3. Create the final required structure
        final_quiz_data = {
            "id": f"quiz-{trigger_time_sec}-{quiz_index}",
            "trigger_time_sec": trigger_time_sec,
            "question": validated_question.question,
            # Use model_dump to convert Pydantic objects back to dicts
            "options": [opt.model_dump() for opt in validated_question.options]
        }

        return final_quiz_data

    except json.JSONDecodeError:
        print("ERROR: LLM output is not valid JSON. Check Arya's prompt structure.")
        return {}
    except Exception as e:
        print(f"ERROR: Pydantic validation or mapping failed: {e}")
        return {}


# --------------------------------------------------------------------------------
# --- MAIN EXECUTION PIPELINE (Run this section) ---
# --------------------------------------------------------------------------------
if __name__ == "__main__":

    print(f"--- Running Full Mahesh Pipeline on: {VIDEO_FILE_PATH} ---")

    # 1. DAY 2 EXECUTION: Get raw transcription data
    raw_transcript_data = transcribe_video_with_timestamps(VIDEO_FILE_PATH)

    if raw_transcript_data:
        # 2. DAY 3 EXECUTION: Find topic changes and collect all noun chunks
        topic_list, all_noun_chunks = detect_segment_topics(raw_transcript_data)

        if topic_list:
            # Select up to 4 distinct quiz segments
            selected_quiz_segments = topic_list[:min(len(topic_list), 4)]

            print(f"\nSelected {len(selected_quiz_segments)} segments for quiz generation.")

            all_final_quizzes = []
            for i, segment in enumerate(selected_quiz_segments):
                trigger_time = segment['start_sec']
                segment_topic = segment['topic']
                segment_text_chunk = segment['text_chunk']

                print(f"\nProcessing quiz for segment at {trigger_time} seconds (Topic: {segment_topic}).")

                # Use the new rule-based quiz generation function
                generated_quiz = generate_rule_based_quiz(
                    segment_text_chunk,
                    segment_topic,
                    all_noun_chunks
                )

                if generated_quiz:
                    # Add id and trigger_time_sec to the generated quiz
                    generated_quiz["id"] = f"quiz-{trigger_time}-{i}"
                    generated_quiz["trigger_time_sec"] = trigger_time
                    all_final_quizzes.append(generated_quiz)
                else:
                    print(f"\n❌ RULE-BASED QUIZ GENERATION FAILED for quiz {i + 1}.")

            if all_final_quizzes:
                # Save all quizzes to a single file
                output_filename_all_quizzes = "all_quizzes.json"
                with open(output_filename_all_quizzes, 'w', encoding='utf-8') as f:
                    json.dump(all_final_quizzes, f, indent=4)
                print(f"\n✅ All {len(all_final_quizzes)} quizzes SAVED to {output_filename_all_quizzes}")

                print("\n--- All Final Quiz JSON Outputs (Preview) ---")
                print(json.dumps(all_final_quizzes, indent=4))
                print("\n\n✅ FULL PIPELINE COMPLETE. Check your Colab file explorer for the 'all_quizzes.json' file.")
            else:
                print("\n❌ PIPELINE FAILED: No quizzes were successfully generated.")
        else:
            print("\n❌ PIPELINE FAILED at Day 3: No key topics/entities were detected.")
    else:
        print("\n❌ PIPELINE FAILED at Day 2: Transcription failed or video file was not found.")


--- Running Full Mahesh Pipeline on: arrays.mp4 ---
DEBUG: Extracting audio from video: arrays.mp4...
DEBUG: Audio successfully extracted to temp_audio_mahesh.mp3
Day 2 Task: Loading Whisper model ('small') and transcribing...


100%|██████████| 5981/5981 [00:48<00:00, 122.06frames/s]


✅ Day 2 Output SAVED to transcript_raw.json
DEBUG: Cleaned up temporary file: temp_audio_mahesh.mp3
Day 3 Task: Running spaCy NER/Noun Chunk extraction for topic detection...

✅ Day 3 Output SAVED to topic_timestamps.json

Selected 4 segments for quiz generation.

Processing quiz for segment at 0 seconds (Topic: a data structure | firstly | the data | the form).

Processing quiz for segment at 7 seconds (Topic: 1 | 10,4,2,99 | 2 | number 2 point array | similar type).

Processing quiz for segment at 14 seconds (Topic: the integer type).

Processing quiz for segment at 16 seconds (Topic: a separate array | floating value).

✅ All 4 quizzes SAVED to all_quizzes.json

--- All Final Quiz JSON Outputs (Preview) ---
[
    {
        "question": "Which of the following best describes: a data structure | firstly | the data | the form?",
        "options": [
            {
                "text": "2",
                "is_correct": false
            },
            {
                "text": "a dat




**Reasoning**:
The previous code execution successfully generated the quizzes and saved them to `all_quizzes.json`. The task is complete.



In [14]:
import whisper_timestamped as whisper
from moviepy.editor import VideoFileClip
import json
import os
import spacy
from pydantic import BaseModel, Field
import random

# --- CONFIGURATION (UPDATE THIS) ---
# NOTE: Ensure this path points to a file you have uploaded or mounted in Colab.
VIDEO_FILE_PATH = 'arrays.mp4'

# Placeholder for Arya's LLM output - will be replaced by rule-based generation
LLM_RAW_OUTPUT_STR = """
{
    "question": "Placeholder question",
    "options": [
        {"text": "Option A", "is_correct": false},
        {"text": "Option B", "is_correct": false},
        {"text": "Option C", "is_correct": false},
        {"text": "Option D", "is_correct": true}
    ]
}
"""
# --- END CONFIGURATION ---


# --- LLM OUTPUT STRUCTURE (Day 4 Requirement) ---
class QuizOption(BaseModel):
    text: str = Field(description="The text of the answer option.")
    is_correct: bool = Field(description="True if this is the correct answer.")

class Question(BaseModel):
    question: str = Field(description="The multiple-choice question text.")
    options: list[QuizOption] = Field(description="A list of 4 possible answers, with exactly one marked as correct.")

# --- INITIALIZE NLP MODEL (Day 3 Requirement) ---
try:
    # This assumes you ran the separate setup command: !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")
except Exception as e:
    print(f"ERROR: spaCy model not loaded. Please run !python -m spacy download en_core_web_sm. Error: {e}")
    nlp = None


# --- FUNCTION 1: DAY 2 - Transcription with Whisper ---
def transcribe_video_with_timestamps(video_path: str, output_filename="transcript_raw.json") -> dict:
    """Extracts audio, transcribes, and saves the raw Whisper output."""

    if not os.path.exists(video_path):
        print(f"ERROR: Video file not found at path: {video_path}")
        return {}

    temp_audio_path = "temp_audio_mahesh.mp3"

    # 1. Extract audio from video
    print(f"DEBUG: Extracting audio from video: {video_path}...")
    try:
        video_clip = VideoFileClip(video_path)
        video_clip.audio.write_audiofile(temp_audio_path, logger=None)
        video_clip.close()
        print(f"DEBUG: Audio successfully extracted to {temp_audio_path}")
    except Exception as e:
        print(f"ERROR: Audio extraction failed. Check FFMPEG or video format. Error: {e}")
        return {}

    # 2. Load model and transcribe
    print("Day 2 Task: Loading Whisper model ('small') and transcribing...")
    result = {}
    try:
        # Using 'small' model. Change to 'base' if 'small' is too slow.
        model = whisper.load_model("small")
        result = whisper.transcribe(model, temp_audio_path, language="en", verbose=False)
    except Exception as e:
        print(f"ERROR: Whisper transcription failed. Error: {e}")
        result = {}

    # 3. SAVE OUTPUT TO FILE
    if result:
        with open(output_filename, 'w', encoding='utf-8') as f:
            json.dump(result, f, indent=4)
        print(f"\n✅ Day 2 Output SAVED to {output_filename}")

    # Clean up the temporary audio file
    if os.path.exists(temp_audio_path):
        os.remove(temp_audio_path)
        print(f"DEBUG: Cleaned up temporary file: {temp_audio_path}")

    return result


# --- FUNCTION 2: DAY 3 - NLP Topic Segmentation ---
def detect_segment_topics(transcript_data: dict, output_filename="topic_timestamps.json") -> list:
    """Analyzes each segment for key entities or noun chunks to determine quiz trigger points."""
    if not nlp:
        print("ERROR: Cannot run detect_segment_topics. spaCy model not loaded.")
        return []

    print("Day 3 Task: Running spaCy NER/Noun Chunk extraction for topic detection...")
    topic_segments = []
    all_noun_chunks = [] # To collect all noun chunks for incorrect options

    for segment in transcript_data.get('segments', []):
        text = segment['text'].strip()
        if not text:
            continue

        start_time = segment['start']
        doc = nlp(text)

        significant_noun_chunks = [
            chunk.text for chunk in doc.noun_chunks
            if not all(token.is_stop for token in chunk) and len(chunk.text.split()) > 1
        ]
        named_entities = [ent.text for ent in doc.ents]

        current_segment_noun_chunks = list(set(named_entities + significant_noun_chunks))
        all_noun_chunks.extend(current_segment_noun_chunks)

        if current_segment_noun_chunks:
            topic_segments.append({
                "start_sec": int(start_time),
                "topic": " | ".join(sorted(current_segment_noun_chunks)),
                "text_chunk": text
            })

    # Dedup and save all noun chunks for later use
    all_noun_chunks = list(set(all_noun_chunks))
    # SAVE OUTPUT TO FILE
    with open(output_filename, 'w', encoding='utf-8') as f:
        json.dump(topic_segments, f, indent=4)
    print(f"\n✅ Day 3 Output SAVED to {output_filename}")

    return topic_segments, all_noun_chunks # Return all_noun_chunks as well


# --- FUNCTION: RULE-BASED QUIZ GENERATION ---
def generate_rule_based_quiz(text_chunk: str, topic: str, all_noun_chunks: list) -> dict:
    """Programmatically generates a quiz question and options based on rules."""

    # 1. Formulate the question
    question_text = f"Which of the following best describes: {topic}?"

    # 2. Create the correct option
    # For simplicity, let's use the topic itself as the correct option or a related phrase from the text_chunk
    correct_option_text = topic.split(' | ')[0] if ' | ' in topic else topic # Use the first part if topic is combined
    # Optionally, try to find a better phrasing from text_chunk that contains the topic
    if correct_option_text.lower() not in text_chunk.lower():
        # Fallback to a generic correct statement related to the topic if not directly found
        correct_option_text = f"Understanding the concept of {correct_option_text}"

    options = [QuizOption(text=correct_option_text, is_correct=True)]

    # 3. Generate three incorrect options
    potential_incorrect_options = [nc for nc in all_noun_chunks if nc != correct_option_text and nc not in topic.split(' | ')]
    random.shuffle(potential_incorrect_options)

    incorrect_count = 0
    used_incorrect_options = set()

    for opt in potential_incorrect_options:
        if incorrect_count < 3 and opt not in used_incorrect_options:
            options.append(QuizOption(text=opt, is_correct=False))
            used_incorrect_options.add(opt)
            incorrect_count += 1

    # If not enough distinct noun chunks, use generic placeholders
    generic_placeholders = [
        "Arrays are always dynamically sized",
        "Linked lists offer faster random access",
        "Space complexity is always O(1)",
        "All data structures have the same performance characteristics"
    ]
    random.shuffle(generic_placeholders)

    for placeholder in generic_placeholders:
        if incorrect_count < 3 and placeholder not in used_incorrect_options and placeholder != correct_option_text:
            options.append(QuizOption(text=placeholder, is_correct=False))
            used_incorrect_options.add(placeholder)
            incorrect_count += 1

    # Ensure exactly 4 options by adding more generic ones if necessary
    while len(options) < 4:
        # Pick a generic one not already used or similar to correct option
        for placeholder in generic_placeholders:
            if placeholder not in used_incorrect_options and placeholder != correct_option_text:
                options.append(QuizOption(text=placeholder, is_correct=False))
                used_incorrect_options.add(placeholder)
                break
        # If all generic are used, add a very generic one
        if len(options) < 4:
            options.append(QuizOption(text=f"Some other irrelevant fact {len(options)}", is_correct=False))


    random.shuffle(options) # Shuffle to mix correct and incorrect options

    # 4. Structure into Question Pydantic model
    quiz_data = Question(
        question=question_text,
        options=options
    )
    return quiz_data.model_dump() # Return as dict


# --- FUNCTION 3: DAY 4 - LLM Output Mapping ---
def map_llm_output_to_quiz_json(llm_output_text: str, trigger_time_sec: int, quiz_index: int) -> dict:
    """Parses raw LLM JSON output into the final application JSON structure."""

    print(f"Day 4 Task: Mapping LLM output to final structured JSON for quiz {quiz_index + 1}...")

    try:
        # 1. Parse the LLM's JSON string output
        llm_data = json.loads(llm_output_text)

        # 2. Validate against Pydantic model
        validated_question = Question(**llm_data)

        # 3. Create the final required structure
        final_quiz_data = {
            "id": f"quiz-{trigger_time_sec}-{quiz_index}",
            "trigger_time_sec": trigger_time_sec,
            "question": validated_question.question,
            # Use model_dump to convert Pydantic objects back to dicts
            "options": [opt.model_dump() for opt in validated_question.options]
        }

        return final_quiz_data

    except json.JSONDecodeError:
        print("ERROR: LLM output is not valid JSON. Check Arya's prompt structure.")
        return {}
    except Exception as e:
        print(f"ERROR: Pydantic validation or mapping failed: {e}")
        return {}


# --------------------------------------------------------------------------------
# --- MAIN EXECUTION PIPELINE (Run this section) ---
# --------------------------------------------------------------------------------
if __name__ == "__main__":

    print(f"--- Running Full Mahesh Pipeline on: {VIDEO_FILE_PATH} ---")

    # 1. DAY 2 EXECUTION: Get raw transcription data
    raw_transcript_data = transcribe_video_with_timestamps(VIDEO_FILE_PATH)

    if raw_transcript_data:
        # 2. DAY 3 EXECUTION: Find topic changes and collect all noun chunks
        topic_list, all_noun_chunks = detect_segment_topics(raw_transcript_data)

        if topic_list:
            # Select up to 4 distinct quiz segments
            selected_quiz_segments = topic_list[:min(len(topic_list), 4)]

            print(f"\nSelected {len(selected_quiz_segments)} segments for quiz generation.")

            all_final_quizzes = []
            for i, segment in enumerate(selected_quiz_segments):
                trigger_time = segment['start_sec']
                segment_topic = segment['topic']
                segment_text_chunk = segment['text_chunk']

                print(f"\nProcessing quiz for segment at {trigger_time} seconds (Topic: {segment_topic}).")

                # Use the new rule-based quiz generation function
                generated_quiz = generate_rule_based_quiz(
                    segment_text_chunk,
                    segment_topic,
                    all_noun_chunks
                )

                if generated_quiz:
                    # Add id and trigger_time_sec to the generated quiz
                    generated_quiz["id"] = f"quiz-{trigger_time}-{i}"
                    generated_quiz["trigger_time_sec"] = trigger_time
                    all_final_quizzes.append(generated_quiz)
                else:
                    print(f"\n❌ RULE-BASED QUIZ GENERATION FAILED for quiz {i + 1}.")

            if all_final_quizzes:
                # Save all quizzes to a single file
                output_filename_all_quizzes = "all_quizzes.json"
                with open(output_filename_all_quizzes, 'w', encoding='utf-8') as f:
                    json.dump(all_final_quizzes, f, indent=4)
                print(f"\n✅ All {len(all_final_quizzes)} quizzes SAVED to {output_filename_all_quizzes}")

                print("\n--- All Final Quiz JSON Outputs (Preview) ---")
                print(json.dumps(all_final_quizzes, indent=4))
                print("\n\n✅ FULL PIPELINE COMPLETE. Check your Colab file explorer for the 'all_quizzes.json' file.")
            else:
                print("\n❌ PIPELINE FAILED: No quizzes were successfully generated.")
        else:
            print("\n❌ PIPELINE FAILED at Day 3: No key topics/entities were detected.")
    else:
        print("\n❌ PIPELINE FAILED at Day 2: Transcription failed or video file was not found.")


--- Running Full Mahesh Pipeline on: arrays.mp4 ---
DEBUG: Extracting audio from video: arrays.mp4...
DEBUG: Audio successfully extracted to temp_audio_mahesh.mp3
Day 2 Task: Loading Whisper model ('small') and transcribing...


100%|██████████| 5981/5981 [00:48<00:00, 122.95frames/s]


✅ Day 2 Output SAVED to transcript_raw.json
DEBUG: Cleaned up temporary file: temp_audio_mahesh.mp3
Day 3 Task: Running spaCy NER/Noun Chunk extraction for topic detection...

✅ Day 3 Output SAVED to topic_timestamps.json

Selected 4 segments for quiz generation.

Processing quiz for segment at 0 seconds (Topic: a data structure | firstly | the data | the form).

Processing quiz for segment at 7 seconds (Topic: 1 | 10,4,2,99 | 2 | number 2 point array | similar type).

Processing quiz for segment at 14 seconds (Topic: the integer type).

Processing quiz for segment at 16 seconds (Topic: a separate array | floating value).

✅ All 4 quizzes SAVED to all_quizzes.json

--- All Final Quiz JSON Outputs (Preview) ---
[
    {
        "question": "Which of the following best describes: a data structure | firstly | the data | the form?",
        "options": [
            {
                "text": "floating value",
                "is_correct": false
            },
            {
                "




**Reasoning**:
The previous code execution successfully generated the quizzes and saved them to `all_quizzes.json`. The task is complete.



In [16]:
import whisper_timestamped as whisper
from moviepy.editor import VideoFileClip
import json
import os
import spacy
from pydantic import BaseModel, Field
import random

# --- CONFIGURATION (UPDATE THIS) ---
# NOTE: Ensure this path points to a file you have uploaded or mounted in Colab.
VIDEO_FILE_PATH = 'arrays.mp4'

# Placeholder for Arya's LLM output - will be replaced by rule-based generation
LLM_RAW_OUTPUT_STR = """
{
    "question": "Placeholder question",
    "options": [
        {"text": "Option A", "is_correct": false},
        {"text": "Option B", "is_correct": false},
        {"text": "Option C", "is_correct": false},
        {"text": "Option D", "is_correct": true}
    ]
}
"""
# --- END CONFIGURATION ---


# --- LLM OUTPUT STRUCTURE (Day 4 Requirement) ---
class QuizOption(BaseModel):
    text: str = Field(description="The text of the answer option.")
    is_correct: bool = Field(description="True if this is the correct answer.")

class Question(BaseModel):
    question: str = Field(description="The multiple-choice question text.")
    options: list[QuizOption] = Field(description="A list of 4 possible answers, with exactly one marked as correct.")

# --- INITIALIZE NLP MODEL (Day 3 Requirement) ---
try:
    # This assumes you ran the separate setup command: !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")
except Exception as e:
    print(f"ERROR: spaCy model not loaded. Please run !python -m spacy download en_core_web_sm. Error: {e}")
    nlp = None


# --- FUNCTION 1: DAY 2 - Transcription with Whisper ---
def transcribe_video_with_timestamps(video_path: str, output_filename="transcript_raw.json") -> dict:
    """Extracts audio, transcribes, and saves the raw Whisper output."""

    if not os.path.exists(video_path):
        print(f"ERROR: Video file not found at path: {video_path}")
        return {}

    temp_audio_path = "temp_audio_mahesh.mp3"

    # 1. Extract audio from video
    print(f"DEBUG: Extracting audio from video: {video_path}...")
    try:
        video_clip = VideoFileClip(video_path)
        video_clip.audio.write_audiofile(temp_audio_path, logger=None)
        video_clip.close()
        print(f"DEBUG: Audio successfully extracted to {temp_audio_path}")
    except Exception as e:
        print(f"ERROR: Audio extraction failed. Check FFMPEG or video format. Error: {e}")
        return {}

    # 2. Load model and transcribe
    print("Day 2 Task: Loading Whisper model ('small') and transcribing...")
    result = {}
    try:
        # Using 'small' model. Change to 'base' if 'small' is too slow.
        model = whisper.load_model("small")
        result = whisper.transcribe(model, temp_audio_path, language="en", verbose=False)
    except Exception as e:
        print(f"ERROR: Whisper transcription failed. Error: {e}")
        result = {}

    # 3. SAVE OUTPUT TO FILE
    if result:
        with open(output_filename, 'w', encoding='utf-8') as f:
            json.dump(result, f, indent=4)
        print(f"\n✅ Day 2 Output SAVED to {output_filename}")

    # Clean up the temporary audio file
    if os.path.exists(temp_audio_path):
        os.remove(temp_audio_path)
        print(f"DEBUG: Cleaned up temporary file: {temp_audio_path}")

    return result


# --- FUNCTION 2: DAY 3 - NLP Topic Segmentation ---
def detect_segment_topics(transcript_data: dict, output_filename="topic_timestamps.json") -> list:
    """Analyzes each segment for key entities or noun chunks to determine quiz trigger points."""
    if not nlp:
        print("ERROR: Cannot run detect_segment_topics. spaCy model not loaded.")
        return []

    print("Day 3 Task: Running spaCy NER/Noun Chunk extraction for topic detection...")
    topic_segments = []
    all_noun_chunks = [] # To collect all noun chunks for incorrect options

    for segment in transcript_data.get('segments', []):
        text = segment['text'].strip()
        if not text:
            continue

        start_time = segment['start']
        doc = nlp(text)

        significant_noun_chunks = [
            chunk.text for chunk in doc.noun_chunks
            if not all(token.is_stop for token in chunk) and len(chunk.text.split()) > 1
        ]
        named_entities = [ent.text for ent in doc.ents]

        current_segment_noun_chunks = list(set(named_entities + significant_noun_chunks))
        all_noun_chunks.extend(current_segment_noun_chunks)

        if current_segment_noun_chunks:
            topic_segments.append({
                "start_sec": int(start_time),
                "topic": " | ".join(sorted(current_segment_noun_chunks)),
                "text_chunk": text
            })

    # Dedup and save all noun chunks for later use
    all_noun_chunks = list(set(all_noun_chunks))
    # SAVE OUTPUT TO FILE
    with open(output_filename, 'w', encoding='utf-8') as f:
        json.dump(topic_segments, f, indent=4)
    print(f"\n✅ Day 3 Output SAVED to {output_filename}")

    return topic_segments, all_noun_chunks # Return all_noun_chunks as well


# --- FUNCTION: RULE-BASED QUIZ GENERATION ---
def generate_rule_based_quiz(text_chunk: str, topic: str, all_noun_chunks: list) -> dict:
    """Programmatically generates a quiz question and options based on rules."""

    # 1. Formulate the question
    question_text = f"Which of the following best describes: {topic}?"

    # 2. Create the correct option
    # For simplicity, let's use the topic itself as the correct option or a related phrase from the text_chunk
    correct_option_text = topic.split(' | ')[0] if ' | ' in topic else topic # Use the first part if topic is combined
    # Optionally, try to find a better phrasing from text_chunk that contains the topic
    if correct_option_text.lower() not in text_chunk.lower():
        # Fallback to a generic correct statement related to the topic if not directly found
        correct_option_text = f"Understanding the concept of {correct_option_text}"

    options = [QuizOption(text=correct_option_text, is_correct=True)]

    # 3. Generate three incorrect options
    potential_incorrect_options = [nc for nc in all_noun_chunks if nc != correct_option_text and nc not in topic.split(' | ')]
    random.shuffle(potential_incorrect_options)

    incorrect_count = 0
    used_incorrect_options = set()

    for opt in potential_incorrect_options:
        if incorrect_count < 3 and opt not in used_incorrect_options:
            options.append(QuizOption(text=opt, is_correct=False))
            used_incorrect_options.add(opt)
            incorrect_count += 1

    # If not enough distinct noun chunks, use generic placeholders
    generic_placeholders = [
        "Arrays are always dynamically sized",
        "Linked lists offer faster random access",
        "Space complexity is always O(1)",
        "All data structures have the same performance characteristics"
    ]
    random.shuffle(generic_placeholders)

    for placeholder in generic_placeholders:
        if incorrect_count < 3 and placeholder not in used_incorrect_options and placeholder != correct_option_text:
            options.append(QuizOption(text=placeholder, is_correct=False))
            used_incorrect_options.add(placeholder)
            incorrect_count += 1

    # Ensure exactly 4 options by adding more generic ones if necessary
    while len(options) < 4:
        # Pick a generic one not already used or similar to correct option
        for placeholder in generic_placeholders:
            if placeholder not in used_incorrect_options and placeholder != correct_option_text:
                options.append(QuizOption(text=placeholder, is_correct=False))
                used_incorrect_options.add(placeholder)
                break
        # If all generic are used, add a very generic one
        if len(options) < 4:
            options.append(QuizOption(text=f"Some other irrelevant fact {len(options)}", is_correct=False))


    random.shuffle(options) # Shuffle to mix correct and incorrect options

    # 4. Structure into Question Pydantic model
    quiz_data = Question(
        question=question_text,
        options=options
    )
    return quiz_data.model_dump() # Return as dict


# --- FUNCTION 3: DAY 4 - LLM Output Mapping ---
def map_llm_output_to_quiz_json(llm_output_text: str, trigger_time_sec: int, quiz_index: int) -> dict:
    """Parses raw LLM JSON output into the final application JSON structure."""

    print(f"Day 4 Task: Mapping LLM output to final structured JSON for quiz {quiz_index + 1}...")

    try:
        # 1. Parse the LLM's JSON string output
        llm_data = json.loads(llm_output_text)

        # 2. Validate against Pydantic model
        validated_question = Question(**llm_data)

        # 3. Create the final required structure
        final_quiz_data = {
            "id": f"quiz-{trigger_time_sec}-{quiz_index}",
            "trigger_time_sec": trigger_time_sec,
            "question": validated_question.question,
            # Use model_dump to convert Pydantic objects back to dicts
            "options": [opt.model_dump() for opt in validated_question.options]
        }

        return final_quiz_data

    except json.JSONDecodeError:
        print("ERROR: LLM output is not valid JSON. Check Arya's prompt structure.")
        return {}
    except Exception as e:
        print(f"ERROR: Pydantic validation or mapping failed: {e}")
        return {}


# --------------------------------------------------------------------------------
# --- MAIN EXECUTION PIPELINE (Run this section) ---
# --------------------------------------------------------------------------------
if __name__ == "__main__":

    print(f"--- Running Full Mahesh Pipeline on: {VIDEO_FILE_PATH} ---")

    # 1. DAY 2 EXECUTION: Get raw transcription data
    raw_transcript_data = transcribe_video_with_timestamps(VIDEO_FILE_PATH)

    if raw_transcript_data:
        # 2. DAY 3 EXECUTION: Find topic changes and collect all noun chunks
        topic_list, all_noun_chunks = detect_segment_topics(raw_transcript_data)

        if topic_list:
            # Select up to 4 distinct quiz segments
            selected_quiz_segments = topic_list[:min(len(topic_list), 4)]

            print(f"\nSelected {len(selected_quiz_segments)} segments for quiz generation.")

            all_final_quizzes = []
            for i, segment in enumerate(selected_quiz_segments):
                trigger_time = segment['start_sec']
                segment_topic = segment['topic']
                segment_text_chunk = segment['text_chunk']

                print(f"\nProcessing quiz for segment at {trigger_time} seconds (Topic: {segment_topic}).")

                # Use the new rule-based quiz generation function
                generated_quiz = generate_rule_based_quiz(
                    segment_text_chunk,
                    segment_topic,
                    all_noun_chunks
                )

                if generated_quiz:
                    # Add id and trigger_time_sec to the generated quiz
                    generated_quiz["id"] = f"quiz-{trigger_time}-{i}"
                    generated_quiz["trigger_time_sec"] = trigger_time
                    all_final_quizzes.append(generated_quiz)
                else:
                    print(f"\n❌ RULE-BASED QUIZ GENERATION FAILED for quiz {i + 1}.")

            if all_final_quizzes:
                # Save all quizzes to a single file
                output_filename_all_quizzes = "all_quizzes.json"
                with open(output_filename_all_quizzes, 'w', encoding='utf-8') as f:
                    json.dump(all_final_quizzes, f, indent=4)
                print(f"\n✅ All {len(all_final_quizzes)} quizzes SAVED to {output_filename_all_quizzes}")

                print("\n--- All Final Quiz JSON Outputs (Preview) ---")
                print(json.dumps(all_final_quizzes, indent=4))
                print("\n\n✅ FULL PIPELINE COMPLETE. Check your Colab file explorer for the 'all_quizzes.json' file.")
            else:
                print("\n❌ PIPELINE FAILED: No quizzes were successfully generated.")
        else:
            print("\n❌ PIPELINE FAILED at Day 3: No key topics/entities were detected.")
    else:
        print("\n❌ PIPELINE FAILED at Day 2: Transcription failed or video file was not found.")


--- Running Full Mahesh Pipeline on: arrays.mp4 ---
DEBUG: Extracting audio from video: arrays.mp4...
DEBUG: Audio successfully extracted to temp_audio_mahesh.mp3
Day 2 Task: Loading Whisper model ('small') and transcribing...


100%|██████████| 5981/5981 [00:51<00:00, 115.79frames/s]



✅ Day 2 Output SAVED to transcript_raw.json
DEBUG: Cleaned up temporary file: temp_audio_mahesh.mp3
Day 3 Task: Running spaCy NER/Noun Chunk extraction for topic detection...

✅ Day 3 Output SAVED to topic_timestamps.json

Selected 4 segments for quiz generation.

Processing quiz for segment at 0 seconds (Topic: a data structure | firstly | the data | the form).

Processing quiz for segment at 7 seconds (Topic: 1 | 10,4,2,99 | 2 | number 2 point array | similar type).

Processing quiz for segment at 14 seconds (Topic: the integer type).

Processing quiz for segment at 16 seconds (Topic: a separate array | floating value).

✅ All 4 quizzes SAVED to all_quizzes.json

--- All Final Quiz JSON Outputs (Preview) ---
[
    {
        "question": "Which of the following best describes: a data structure | firstly | the data | the form?",
        "options": [
            {
                "text": "fourth element",
                "is_correct": false
            },
            {
                "