# Project Overview Notebook

**Note:** This notebook is **not intended to run the full training loop** or to rebuild all models from scratch.  
Its purpose is to provide the reader with a **clear understanding of the repository workflow**, the **datasets**, and our **approach to solving the problem statement**.  

It serves as a guided explanation of what has been done so far, including preprocessing steps, model design, and evaluation, without executing the heavy computations.


## Dataset

For this project, we use the **Ego4D dataset**, a large-scale dataset consisting of **long first-person videos** along with **narrations describing the events happening in the videos**.  

- The dataset is **terabyte-scale**, making it **infeasible to use the full video data** given computational and storage constraints.  
- Our approach does **not require video data during the initial stages**; videos are only used at the inference stage.  

### What we are using:
- **File:** `em_train_narrations.pkl` (located in the `data/` directory of the repository)  
- This file contains all narrations from the dataset.  
- We use these narrations to:
  1. **Create a question-answer dataset** for our task.  
  2. Include the **temporal grounding annotations**, which serve as ground truth for training.  

By focusing on the narrations first, we can efficiently **train our model** without handling the massive video files until necessary.


## Sample Narrations from the Dataset

In this snippet, we load the `em_train_narrations.pkl` file and display the narrations for **one video clip**.  

- The dataset contains a total of **10,777 videos**, each with multiple narrations describing the events occurring in the video.  
- For demonstration, we selected one video and printed **all its narrations** (instead of just the first 5) to show the type of information available.  
- Each narration entry contains:
  - `narration_text`: a textual description of an event.
  - `timestamps`: the start and end times (in seconds) of the event within the video.  

**Note:** In the narrations, **'C' represents the person performing the task or appearing in the video**, while **'O' represents another person**.  
These narrations form the basis for creating **question-answer pairs** in our dataset.


In [2]:
import pickle
import random

# Load the narrations file
with open('data/em_train_narrations.pkl', 'rb') as f:
    clip_narrations = pickle.load(f)

print(f"Total number of clips: {len(clip_narrations)}\n")

# Pick a random clip 
clip = random.choice(clip_narrations) 

print(f"Clip ID: {clip['clip_uid']}")
print(f"Narration Pass: {clip['narration_pass']}")
print(f"Number of narrations in this clip: {len(clip['narrations'])}\n")

# Print the narrations and their timestamps
for i, narration in enumerate(clip['narrations']):
    print(f"Narration {i+1}: {narration['narration_text']}")
    print(f"Timestamps: {narration['timestamps']}\n")


Total number of clips: 10777

Clip ID: 30723771-7198-49cd-a5e3-ae413c904490
Narration Pass: narration_pass_1
Number of narrations in this clip: 13

Narration 1: C opens a door to the garage with his right hand. .
Timestamps: [0.2458468333333332, 9.45211]

Narration 2: C closes the door with his right hand. .
Timestamps: [5.30399, 14.510253166666663]

Narration 3: C walks towards a bicycle parked outside. .
Timestamps: [9.824106833333335, 19.08162]

Narration 4: C mounts the bicycle. .
Timestamps: [14.88225, 24.13976316666667]

Narration 5: C rides the bicycle along the road. .
Timestamps: [22.042896833333334, 32.159183166666665]

Narration 6: C turns the bicycle right. .
Timestamps: [28.91194683333333, 36.35805]

Narration 7: C rides the bicycle along the road. .
Timestamps: [33.97009, 41.416193166666666]

Narration 8: C rides the bicycle along the road. .
Timestamps: [55.77895683333333, 65.89524316666666]

Narration 9: C rides the bicycle along the road. .
Timestamps: [99.105806833333

## Question Generation from Narrations

In this snippet, we demonstrate how a **single narration** is used to generate a **question-answer pair** using the **LLaMA-2 13B model**.  

**Key steps in the snippet:**
1. The model and tokenizer are loaded on **GPU** with **FP16** precision for faster inference.  
2. A **prompt** is prepared asking the model to generate a QA pair in **JSON format**, with constraints on the length of the question (≤10 words) and answer (≤5 words).  
3. A single narration from one video clip is fed into the model, and the output is parsed into a JSON QA pair.

**Note:**  
- In this example, we only generate a QA pair for one narration.  
- In practice, the same process is applied to **all narrations in the `em_train_narrations.pkl` file**.  
- In the lab, this is done efficiently using **multiple GPUs** by leveraging the `generate_open_qa.py` and `merge.py` script in the `utils/` directory of the repository, which handles batching, parallel processing, and merging of generated QA pairs for the full dataset.


In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from ast import literal_eval
import torch

# ---------------------------
# Load model and tokenizer
# ---------------------------
model_id = "meta-llama/Llama-2-13b-chat-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float32
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto"
)

# Ensure padding tokens are set
pipe.tokenizer.pad_token = pipe.tokenizer.eos_token
pipe.model.config.pad_token_id = pipe.model.config.eos_token_id

# ---------------------------
# Example narration
# ---------------------------
narration_text = "C pours hot water into the bowl and stirs it with a spoon."

# Prompt for question generation
prompt = f"""
<s>[INST] <<SYS>>
You are an AI assistant. Generate one QA pair in JSON format: {{"Q": <question>, "A": <answer>}} 
based on the following narration. The question should be in past tense, within 10 words. 
The answer should be concise, within 5 words. 'C' is the person performing the actions.
<</SYS>>

<s>[INST] {narration_text} [/INST]
"""

# ---------------------------
# Generate QA
# ---------------------------
output = pipe(prompt, max_new_tokens=64, do_sample=True, temperature=0.5, return_full_text=False)
qa_pair = literal_eval(output[0]["generated_text"])

print("Generated QA pair:")
print(qa_pair)


  from .autonotebook import tqdm as notebook_tqdm
Loading checkpoint shards: 100%|██████████| 3/3 [00:18<00:00,  6.23s/it]


In [None]:
# ---------------------------
# Load model and tokenizer
# ---------------------------
# model_id = "meta-llama/Llama-2-13b-chat-hf"
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# tokenizer.pad_token = tokenizer.eos_token

# model = AutoModelForCausalLM.from_pretrained(
#     model_id,
#     device_map="auto",
#     torch_dtype=torch.float32
# )

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto"
)

pipe.tokenizer.pad_token = pipe.tokenizer.eos_token
pipe.model.config.pad_token_id = pipe.model.config.eos_token_id

# ---------------------------
# Example QA pair
# ---------------------------
question = "What did I pour in the bowl?"
answer = "boiling water"

# ---------------------------
# Prompt for generating wrong answers
# ---------------------------
prompt = f"""
<s>[INST] <<SYS>>
I'll provide a question and its correct answer. Generate three plausible, but incorrect, answers that closely resemble the correct one. Make it challenging to identify the right answer.
<</SYS>>

No preamble, get right to the three wrong answers and present them in a list format.
Question: {question} Correct Answer: {answer}. Wrong Answers: [/INST]
"""

# ---------------------------
# Generate wrong answers
# ---------------------------
output = pipe(
    prompt,
    max_new_tokens=64,
    do_sample=True,
    temperature=0.5,
    top_k=10,
    return_full_text=False
)

wrong_answers = literal_eval(output[0]["generated_text"])
print("Generated wrong answers:")
print(wrong_answers)

# ---------------------------
# Combine into final QA entry
# ---------------------------
qa_entry = {
    "question": question,
    "answer": answer,
    "wrong_answers": wrong_answers
}

print("\nFinal QA entry with wrong answers:")
print(qa_entry)


## OpenQA and ClosedQA Annotation Generation

In our project, we handle **both Open Question-Answering (OpenQA) and Closed Question-Answering (ClosedQA)**:

- **OpenQA:** Generates a question-answer pair directly from the narrations, using the LLaMA model to produce a meaningful question and its corresponding answer. This forms the core of our question-answer dataset.  
- **ClosedQA:** Takes an existing question and its correct answer, and generates **three plausible but incorrect answers**. These wrong answers are designed to be challenging and closely resemble the correct one, which is useful for training models that require multiple-choice style questions.  

The process is applied to all narrations across all video clips in the dataset. After running the pipeline for all files, the final output is stored as **`annotations.EgoTimeQA.json`** in the `data/` directory of the repository.  

This JSON file contains:
- The question generated from narrations.  
- The correct answer.  
- Three incorrect but plausible wrong answers.  
- Metadata such as video ID and timestamps for temporal grounding.  

This combined OpenQA + ClosedQA dataset serves as the foundation for training and evaluation in our task.


In [1]:
import torch
print(torch.cuda.is_available())

True
