# OpenEQA QuickStart

Welcome! This notebook is intended to provide a quick-start guide to OpenEQA and its workflow. Before you can start using this notebook, please perform a few basic installation steps.

## Packages and Dependencies
- Install [openeqa](https://github.com/facebookresearch/open-eqa/blob/main/setup.py) and [required packages](https://github.com/facebookresearch/open-eqa/blob/main/requirements.txt).
- You don't need the full dataset to run this notebook and get started. But we highly encourage you to look at instructions to [download the data](https://github.com/facebookresearch/open-eqa/blob/main/data/README.md) and get started early! We realize getting access to the dataset has a few hoops to jump through and thus we provide this repo with a small "demo dataset" to give you a flavor of the benchmark.
- Get API access keys ready for different models (e.g. GPT, Claude) or host the model on your end (e.g. LLaMA, Mixtral).

## Data Visualization

Let's first start by visualizing a demo dataset generated from a [public HM3D scene](https://aihabitat.org/datasets/hm3d/).

In [1]:
import os
os.environ['OPENAI_API_KEY'] = '<OPENAI_API_KEY>'
os.environ['OPENAI_AZURE_DEPLOYMENT'] = '1'

import matplotlib.pyplot as plt
from IPython.display import Video, Image

Video("data/videos/hm3d-v0/000-hm3d-BFRyYbPCCPE-0.mp4", embed=True, width=480)

In [2]:
from PIL import Image
from tqdm import tqdm

from openeqa.utils.demo_utils import (
    decode_frames_from_video_path,
    get_equally_spaced_frames,
)

frames_per_traj = 15

# get the frames from the video
frames = decode_frames_from_video_path("data/videos/hm3d-v0/000-hm3d-BFRyYbPCCPE-0.mp4")

# extract equally spaced frames
frames = get_equally_spaced_frames(frames, frames_per_traj)

# display a few frames
_, axs = plt.subplots(1, 4, figsize=(12, 12))
for img, ax in zip(frames[::4], axs):
    ax.axis("off")
    ax.imshow(img)
plt.tight_layout()
plt.show()

## Load the Questions

In [3]:
import json

eqa_data = json.load(open("data/open-eqa-v0.json"))
hm3d_eqa_data = list(filter(lambda x: 'hm3d-v0' in x['episode_history'], eqa_data))
print(f'#HM3D EQA : {len(hm3d_eqa_data)} instances')
print('\nTop-5 Samples:')
print(json.dumps(hm3d_eqa_data[:5], indent=2))

#HM3D EQA : 557 instances

Top-5 Samples:
[
  {
    "question": "What is the white object on the wall above the TV?",
    "answer": "Air conditioning unit",
    "category": "object recognition",
    "question_id": "f2e82760-5c3c-41b1-88b6-85921b9e7b32",
    "episode_history": "hm3d-v0/000-hm3d-BFRyYbPCCPE"
  },
  {
    "question": "What material is the ceiling in the living room?",
    "answer": "Wood panel",
    "category": "attribute recognition",
    "question_id": "7447d782-d1a7-4c87-86dc-b5eafc5a0f76",
    "episode_history": "hm3d-v0/000-hm3d-BFRyYbPCCPE"
  },
  {
    "question": "What color is the staircase railing?",
    "answer": "Brown",
    "category": "attribute recognition",
    "question_id": "e2ccf6f4-22a9-47d1-ab8d-a05a13435b82",
    "episode_history": "hm3d-v0/000-hm3d-BFRyYbPCCPE"
  },
  {
    "question": "What is in between the two picture frames on the blue wall in the living room?",
    "answer": "The TV",
    "category": "spatial understanding",
    "question_id": 

## Ask the Model

Let's start with a text-only (or blind) LLM. We'll use GPT-4 for illustration, but we encourage you to try out your own model!

First set the OPENAI_API_KEY environment variable to your own openai api key.

In [4]:
assert "OPENAI_API_KEY" in os.environ
if os.environ.get('OPENAI_AZURE_DEPLOYMENT') == '1':
    from openeqa.baselines.gpt4_azure import ask_question as ask_blind_gpt4
else:
    from openeqa.baselines.gpt4 import ask_question as ask_blind_gpt4

verbose = True

for item in tqdm(hm3d_eqa_data[:5]):
    q = item["question"]
    a = item["answer"]
    blind_gpt4_a = ask_blind_gpt4(
        question=q,
        openai_model="gpt-4o",
    )
    item["blind_gpt4_answer"] = blind_gpt4_a
    if verbose:
        # print the question and the model's answer
        print("Q: {}\nA: {}\n".format(q, blind_gpt4_a))

 20%|███████████████████████████████████▊                                                                                                                                               | 1/5 [00:01<00:07,  1.76s/it]

Q: What is the white object on the wall above the TV?
A: A clock



 40%|███████████████████████████████████████████████████████████████████████▌                                                                                                           | 2/5 [00:02<00:03,  1.18s/it]

Q: What material is the ceiling in the living room?
A: Drywall



 60%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                                       | 3/5 [00:03<00:02,  1.05s/it]

Q: What color is the staircase railing?
A: brown



 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                   | 4/5 [00:05<00:01,  1.29s/it]

Q: What is in between the two picture frames on the blue wall in the living room?
A: a clock



100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.20s/it]

Q: Is there room on the dining table to eat?
A: Yes






Next, lets try a multi-modal LLM. We'll use GPT-4V, but other models are implemented in the open-eqa codebase. 

In [5]:
assert "OPENAI_API_KEY" in os.environ
import glob

if os.environ.get('OPENAI_AZURE_DEPLOYMENT') == '1':
    from openeqa.baselines.gpt4o_azure import ask_question as ask_gpt4o
else:
    from openeqa.baselines.gpt4v import ask_question as ask_gpt4o

verbose = True

for item in tqdm(hm3d_eqa_data[:5]):
    q = item["question"]
    image_paths = sorted(glob.glob(f"data/frames/{item['episode_history']}/*.png"))
    filt_image_paths = []
    for depth_img, rgb_img in zip(image_paths[::16], image_paths[1::16]):
        filt_image_paths.append(depth_img)
        filt_image_paths.append(rgb_img)
    
    gpt4o_a = ask_gpt4o(
        question=q,
        image_paths=filt_image_paths,
        openai_key=os.environ["OPENAI_API_KEY"],
        openai_model="gpt-4o",
    )
    item["gpt4o_answer"] = gpt4o_a
    if verbose:
        # print the question and the model's answer
        print("Q: {}\nA: {}\n".format(q, gpt4o_a))

 20%|███████████████████████████████████▊                                                                                                                                               | 1/5 [00:06<00:24,  6.11s/it]

Q: What is the white object on the wall above the TV?
A: The white object on the wall above the TV is an air conditioning unit.



 40%|███████████████████████████████████████████████████████████████████████▌                                                                                                           | 2/5 [00:12<00:18,  6.24s/it]

Q: What material is the ceiling in the living room?
A: The ceiling in the living room appears to be made of wood.



 60%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                                       | 3/5 [00:19<00:13,  6.54s/it]

Q: What color is the staircase railing?
A: The staircase railing is dark brown.



 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                   | 4/5 [00:29<00:07,  7.79s/it]

Q: What is in between the two picture frames on the blue wall in the living room?
A: There is an air conditioning unit in between the two picture frames on the blue wall in the living room.



100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:36<00:00,  7.36s/it]

Q: Is there room on the dining table to eat?
A: Yes, there appears to be room on the dining table to eat. The table is visible in the images and seems to have enough space for dining.






## Evaluate the Answers w/ LLM-Match

Finally, we'll evaluate the open-vocabulary answers produced by an LLMs or multi-modal LLMs by comparing against the ground truth answer(s) using LLM-Match, which uses a rating of 1-5 to score answers. See the paper for more details.

In [6]:
assert "OPENAI_API_KEY" in os.environ
from openeqa.evaluation.llm_match import get_llm_match_score

# evaluate GPT-4 answers
for item in tqdm(hm3d_eqa_data[:5]):
    c = get_llm_match_score(
        question=item["question"],
        answer=item["answer"],
        prediction=item["blind_gpt4_answer"],
    )  # correctness score (see Eq. 1 and Table 2 in paper)
    item["blind_gpt4_score"] = c

# evaluate GPT-4o answers
for item in tqdm(hm3d_eqa_data[:5]):
    c = get_llm_match_score(
        question=item["question"],
        answer=item["answer"],
        prediction=item["gpt4o_answer"],
    )  # correctness score (see Eq. 1 and Table 2 in paper)
    item["gpt4o_score"] = c

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.62it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00,  1.47it/s]


In [7]:
# print the final results
print(json.dumps(hm3d_eqa_data[:5], indent=2))

[
  {
    "question": "What is the white object on the wall above the TV?",
    "answer": "Air conditioning unit",
    "category": "object recognition",
    "question_id": "f2e82760-5c3c-41b1-88b6-85921b9e7b32",
    "episode_history": "hm3d-v0/000-hm3d-BFRyYbPCCPE",
    "blind_gpt4_answer": "A clock",
    "gpt4o_answer": "The white object on the wall above the TV is an air conditioning unit.",
    "blind_gpt4_score": 1,
    "gpt4o_score": 5
  },
  {
    "question": "What material is the ceiling in the living room?",
    "answer": "Wood panel",
    "category": "attribute recognition",
    "question_id": "7447d782-d1a7-4c87-86dc-b5eafc5a0f76",
    "episode_history": "hm3d-v0/000-hm3d-BFRyYbPCCPE",
    "blind_gpt4_answer": "Drywall",
    "gpt4o_answer": "The ceiling in the living room appears to be made of wood.",
    "blind_gpt4_score": 1,
    "gpt4o_score": 5
  },
  {
    "question": "What color is the staircase railing?",
    "answer": "Brown",
    "category": "attribute recognition",


In [19]:
os.environ['OPENAI_AZURE_DEPLOYMENT'] = '1'


1
