# Convert JSONL VQA Dataset to Parquet Format

This notebook converts the RSVLM-QA JSONL dataset into a Parquet file with structured columns:
- **id**: Record identifier
- **image**: Image file path
- **caption**: Detailed caption extracted from vqa_pairs
- **QA questions and answers**: Separate columns for each question-answer pair

## 1. Import Required Libraries

In [1]:
import json
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load JSONL Dataset

In [2]:
# Load the JSONL file
jsonl_file = "RSVLM-QA.jsonl"

data = []
with open(jsonl_file, 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

print(f"Loaded {len(data)} records from {jsonl_file}")

Loaded 13820 records from RSVLM-QA.jsonl


## 3. Explore Dataset Structure

In [3]:
# Display first record structure
print("Sample record structure:")
print(f"Keys: {data[0].keys()}")
print(f"\nNumber of VQA pairs in first record: {len(data[0]['vqa_pairs'])}")
print(f"\nFirst VQA pair:")
print(data[0]['vqa_pairs'][0])
print(f"\nCaption question (question_type='caption'):")
caption_qa = [qa for qa in data[0]['vqa_pairs'] if qa['question_type'] == 'caption']
if caption_qa:
    print(f"Question: {caption_qa[0]['question']}")
    print(f"Answer (Caption): {caption_qa[0]['answer'][:200]}...")

Sample record structure:
Keys: dict_keys(['id', 'image', 'vqa_pairs', 'tags', 'relations'])

Number of VQA pairs in first record: 10

First VQA pair:
{'question_id': '1', 'question_type': 'spatial', 'question': 'Where is the highway interchange located in the image?', 'answer': 'The highway interchange is located in the central portion of the image.'}

Caption question (question_type='caption'):
Question: Generate a detailed caption for this image.
Answer (Caption): The image depicts a highly developed urban area characterized by a prominent highway interchange that dominates the central portion of the scene. Surrounding the highways are dense residential neighbo...


## 4. Extract and Flatten VQA Pairs

In [None]:
def process_record_captions(record):
    """
    Process a single JSONL record and extract id, image, and caption.
    Returns a dictionary with id, image, and caption.
    """
    processed = {
        'id': record['id'],
        'image': record['image']
    }
    
    # Extract caption from vqa_pairs (where question_type == 'caption')
    caption = None
    
    for qa in record['vqa_pairs']:
        if qa['question_type'] == 'caption':
            caption = qa['answer']
            break
    
    processed['caption'] = caption
    
    return processed

def process_record_qa(record):
    """
    Process a single JSONL record and extract all QA pairs.
    Returns a list of dictionaries, each containing id, question_type, question, and answer.
    """
    qa_list = []
    
    for qa in record['vqa_pairs']:
        if qa['question_type'] != 'caption':  # Exclude caption from QA pairs
            qa_list.append({
                'id': record['id'],
                'question_type': qa['question_type'],
                'question': qa['question'],
                'answer': qa['answer']
            })
    
    return qa_list

# Process all records for captions
captions_data = [process_record_captions(record) for record in data]

# Process all records for QA pairs
qa_data = []
for record in data:
    qa_data.extend(process_record_qa(record))

print(f"Processed {len(captions_data)} records for captions")
print(f"Processed {len(qa_data)} QA pairs total")

Processed 13820 records

Sample processed record keys: ['id', 'image', 'caption', 'question_1', 'answer_1', 'question_type_1', 'question_2', 'answer_2', 'question_type_2', 'question_3']...


## 5. Create Two DataFrames

In [None]:
# Create DataFrame for captions (id, image, caption)
df_captions = pd.DataFrame(captions_data)

print("Captions DataFrame:")
print(f"Shape: {df_captions.shape}")
print(f"Columns: {df_captions.columns.tolist()}")
print(f"\nFirst few rows:")
df_captions.head()

DataFrame shape: (13820, 99)

Column names:
['id', 'image', 'caption', 'question_1', 'answer_1', 'question_type_1', 'question_2', 'answer_2', 'question_type_2', 'question_3', 'answer_3', 'question_type_3', 'question_4', 'answer_4', 'question_type_4', 'question_5', 'answer_5', 'question_type_5', 'question_6', 'answer_6', 'question_type_6', 'question_7', 'answer_7', 'question_type_7', 'question_8', 'answer_8', 'question_type_8', 'question_9', 'answer_9', 'question_type_9', 'question_10', 'answer_10', 'question_type_10', 'question_11', 'answer_11', 'question_type_11', 'question_12', 'answer_12', 'question_type_12', 'question_13', 'answer_13', 'question_type_13', 'question_14', 'answer_14', 'question_type_14', 'question_15', 'answer_15', 'question_type_15', 'question_16', 'answer_16', 'question_type_16', 'question_17', 'answer_17', 'question_type_17', 'question_18', 'answer_18', 'question_type_18', 'question_19', 'answer_19', 'question_type_19', 'question_20', 'answer_20', 'question_type_2

Unnamed: 0,id,image,caption,question_1,answer_1,question_type_1,question_2,answer_2,question_type_2,question_3,...,question_type_29,question_30,answer_30,question_type_30,question_31,answer_31,question_type_31,question_32,answer_32,question_type_32
0,0,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image depicts a highly developed urban are...,Where is the highway interchange located in th...,The highway interchange is located in the cent...,spatial,In which parts of the image are the residentia...,The houses are primarily found in the northern...,spatial,Where are the recreational facilities situated...,...,,,,,,,,,,
1,1,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image presents a clear contrast between tw...,Where is the dense urban grid located in the i...,The dense urban grid is located on the right s...,spatial,What feature runs through the center of the im...,"A greenbelt or park runs through the center, p...",spatial,On which side are houses spaced farther apart ...,...,,,,,,,,,,
2,2,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image primarily depicts a suburban neighbo...,Where is the wide highway located in relation ...,The wide highway is located above the resident...,spatial,Which area is situated above the residential n...,The commercial zone is situated above the resi...,spatial,Where are the larger buildings such as schools...,...,,,,,,,,,,
3,3,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The landscape is characterized by a prominent ...,Where is the institutional or educational camp...,The institutional or educational campus is loc...,spatial,What type of buildings are found in the upper ...,Larger commercial or office buildings are foun...,spatial,What areas surround the educational campus in ...,...,,,,,,,,,,
4,4,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image depicts a highly urbanized environme...,Where are the commercial and institutional str...,They dominate the central and western sections.,spatial,In which parts of the image are the residentia...,In the northern and eastern parts.,spatial,What is located adjacent to the green space or...,...,,,,,,,,,,


In [None]:
# Create DataFrame for QA pairs (id, question_type, question, answer)
df_qa = pd.DataFrame(qa_data)

print("QA Pairs DataFrame:")
print(f"Shape: {df_qa.shape}")
print(f"Columns: {df_qa.columns.tolist()}")
print(f"\nQuestion types distribution:")
print(df_qa['question_type'].value_counts())
print(f"\nFirst few rows:")
df_qa.head()

Sample data with id, caption, and first 2 QA pairs:


Unnamed: 0,id,image,caption,question_1,answer_1,question_type_1,question_2,answer_2,question_type_2
0,0,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image depicts a highly developed urban are...,Where is the highway interchange located in th...,The highway interchange is located in the cent...,spatial,In which parts of the image are the residentia...,The houses are primarily found in the northern...,spatial
1,1,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image presents a clear contrast between tw...,Where is the dense urban grid located in the i...,The dense urban grid is located on the right s...,spatial,What feature runs through the center of the im...,"A greenbelt or park runs through the center, p...",spatial


## 6. Save to Two Parquet Files

In [None]:
# Save captions DataFrame to Parquet
captions_file = "RSVLM-QA-captions.parquet"
df_captions.to_parquet(captions_file, engine='pyarrow', compression='snappy', index=False)

print(f"✓ Saved {len(df_captions)} caption records to {captions_file}")
print(f"  File size: {Path(captions_file).stat().st_size / (1024*1024):.2f} MB")

# Save QA pairs DataFrame to Parquet
qa_file = "RSVLM-QA-questions.parquet"
df_qa.to_parquet(qa_file, engine='pyarrow', compression='snappy', index=False)

print(f"\n✓ Saved {len(df_qa)} QA pairs to {qa_file}")
print(f"  File size: {Path(qa_file).stat().st_size / (1024*1024):.2f} MB")

Successfully saved 13820 records to RSVLM-QA.parquet
File size: 8.20 MB


## 7. Verify Both Parquet Files

In [None]:
# Verify captions Parquet file
df_captions_verify = pd.read_parquet(captions_file)

print("=" * 60)
print("CAPTIONS PARQUET FILE")
print("=" * 60)
print(f"Shape: {df_captions_verify.shape}")
print(f"\nColumns: {df_captions_verify.columns.tolist()}")
print(f"\nData types:")
print(df_captions_verify.dtypes)
print(f"\nSample rows:")
df_captions_verify.head()

Verified DataFrame shape: (13820, 99)

Data types:
id                  str
image               str
caption             str
question_1          str
answer_1            str
                   ... 
answer_31           str
question_type_31    str
question_32         str
answer_32           str
question_type_32    str
Length: 99, dtype: object

Sample rows:


Unnamed: 0,id,image,caption,question_1,answer_1,question_type_1,question_2,answer_2,question_type_2,question_3,...,question_type_29,question_30,answer_30,question_type_30,question_31,answer_31,question_type_31,question_32,answer_32,question_type_32
0,0,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image depicts a highly developed urban are...,Where is the highway interchange located in th...,The highway interchange is located in the cent...,spatial,In which parts of the image are the residentia...,The houses are primarily found in the northern...,spatial,Where are the recreational facilities situated...,...,,,,,,,,,,
1,1,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image presents a clear contrast between tw...,Where is the dense urban grid located in the i...,The dense urban grid is located on the right s...,spatial,What feature runs through the center of the im...,"A greenbelt or park runs through the center, p...",spatial,On which side are houses spaced farther apart ...,...,,,,,,,,,,
2,2,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image primarily depicts a suburban neighbo...,Where is the wide highway located in relation ...,The wide highway is located above the resident...,spatial,Which area is situated above the residential n...,The commercial zone is situated above the resi...,spatial,Where are the larger buildings such as schools...,...,,,,,,,,,,
3,3,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The landscape is characterized by a prominent ...,Where is the institutional or educational camp...,The institutional or educational campus is loc...,spatial,What type of buildings are found in the upper ...,Larger commercial or office buildings are foun...,spatial,What areas surround the educational campus in ...,...,,,,,,,,,,
4,4,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image depicts a highly urbanized environme...,Where are the commercial and institutional str...,They dominate the central and western sections.,spatial,In which parts of the image are the residentia...,In the northern and eastern parts.,spatial,What is located adjacent to the green space or...,...,,,,,,,,,,


In [None]:
# Verify QA pairs Parquet file
df_qa_verify = pd.read_parquet(qa_file)

print("=" * 60)
print("QA PAIRS PARQUET FILE")
print("=" * 60)
print(f"Shape: {df_qa_verify.shape}")
print(f"\nColumns: {df_qa_verify.columns.tolist()}")
print(f"\nData types:")
print(df_qa_verify.dtypes)
print(f"\nQuestion types distribution:")
print(df_qa_verify['question_type'].value_counts())
print(f"\nSample rows:")
df_qa_verify.head(10)

Example record:

ID: 0
Image: RSVLM-QA/INRIA-Aerial-Image-Labeling/train/images/austin11.tif

Caption: The image depicts a highly developed urban area characterized by a prominent highway interchange that dominates the central portion of the scene. Surrounding the highways are dense residential neighbo...

Question 1 (spatial): Where is the highway interchange located in the image?
Answer 1: The highway interchange is located in the central portion of the image.

Question 2 (spatial): In which parts of the image are the residential houses primarily found?
Answer 2: The houses are primarily found in the northern and eastern parts of the image.


In [None]:
# Display example from captions file
print("=" * 60)
print("EXAMPLE FROM CAPTIONS FILE")
print("=" * 60)
print(f"ID: {df_captions_verify.iloc[0]['id']}")
print(f"Image: {df_captions_verify.iloc[0]['image']}")
print(f"\nCaption:\n{df_captions_verify.iloc[0]['caption']}")

print("\n" + "=" * 60)
print("EXAMPLE QA PAIRS FOR SAME ID")
print("=" * 60)
sample_id = df_captions_verify.iloc[0]['id']
sample_qa = df_qa_verify[df_qa_verify['id'] == sample_id]
for idx, row in sample_qa.head(3).iterrows():
    print(f"\nType: {row['question_type']}")
    print(f"Q: {row['question']}")
    print(f"A: {row['answer']}")