# Convert JSONL VQA Dataset to Parquet Format

This notebook converts the RSVLM-QA JSONL dataset into a Parquet file with structured columns:
- **id**: Record identifier
- **image**: Image file path
- **caption**: Detailed caption extracted from vqa_pairs
- **QA questions and answers**: Separate columns for each question-answer pair

## 1. Import Required Libraries

In [10]:
import json
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Load JSONL Dataset

In [11]:
# Load the JSONL file
jsonl_file = "RSVLM-QA.jsonl"

data = []
with open(jsonl_file, 'r', encoding='utf-8') as f:
    for line in f:
        data.append(json.loads(line))

print(f"Loaded {len(data)} records from {jsonl_file}")

Loaded 13820 records from RSVLM-QA.jsonl


## 3. Explore Dataset Structure

In [12]:
# Display first record structure
print("Sample record structure:")
print(f"Keys: {data[0].keys()}")
print(f"\nNumber of VQA pairs in first record: {len(data[0]['vqa_pairs'])}")
print(f"\nFirst VQA pair:")
print(data[0]['vqa_pairs'][0])
print(f"\nCaption question (question_type='caption'):")
caption_qa = [qa for qa in data[0]['vqa_pairs'] if qa['question_type'] == 'caption']
if caption_qa:
    print(f"Question: {caption_qa[0]['question']}")
    print(f"Answer (Caption): {caption_qa[0]['answer'][:200]}...")

Sample record structure:
Keys: dict_keys(['id', 'image', 'vqa_pairs', 'tags', 'relations'])

Number of VQA pairs in first record: 10

First VQA pair:
{'question_id': '1', 'question_type': 'spatial', 'question': 'Where is the highway interchange located in the image?', 'answer': 'The highway interchange is located in the central portion of the image.'}

Caption question (question_type='caption'):
Question: Generate a detailed caption for this image.
Answer (Caption): The image depicts a highly developed urban area characterized by a prominent highway interchange that dominates the central portion of the scene. Surrounding the highways are dense residential neighbo...


## 4. Extract and Flatten VQA Pairs

In [13]:
def process_record_captions(record):
    """
    Process a single JSONL record and extract id, image, and caption.
    Returns a dictionary with id, image, and caption.
    """
    processed = {
        'id': record['id'],
        'image': record['image']
    }
    
    # Extract caption from vqa_pairs (where question_type == 'caption')
    caption = None
    
    for qa in record['vqa_pairs']:
        if qa['question_type'] == 'caption':
            caption = qa['answer']
            break
    
    processed['caption'] = caption
    
    return processed

def process_record_qa(record):
    """
    Process a single JSONL record and extract all QA pairs.
    Returns a list of dictionaries, each containing id, question_type, question, and answer.
    """
    qa_list = []
    
    for qa in record['vqa_pairs']:
        if qa['question_type'] != 'caption':  # Exclude caption from QA pairs
            qa_list.append({
                'id': record['id'],
                'question_type': qa['question_type'],
                'question': qa['question'],
                'answer': qa['answer']
            })
    
    return qa_list

# Process all records for captions
captions_data = [process_record_captions(record) for record in data]

# Process all records for QA pairs
qa_data = []
for record in data:
    qa_data.extend(process_record_qa(record))

print(f"Processed {len(captions_data)} records for captions")
print(f"Processed {len(qa_data)} QA pairs total")

Processed 13820 records for captions
Processed 148558 QA pairs total


## 5. Create Two DataFrames

In [14]:
# Create DataFrame for captions (id, image, caption)
df_captions = pd.DataFrame(captions_data)

print("Captions DataFrame:")
print(f"Shape: {df_captions.shape}")
print(f"Columns: {df_captions.columns.tolist()}")
print(f"\nFirst few rows:")
df_captions.head()

Captions DataFrame:
Shape: (13820, 3)
Columns: ['id', 'image', 'caption']

First few rows:


Unnamed: 0,id,image,caption
0,0,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image depicts a highly developed urban are...
1,1,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image presents a clear contrast between tw...
2,2,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image primarily depicts a suburban neighbo...
3,3,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The landscape is characterized by a prominent ...
4,4,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image depicts a highly urbanized environme...


In [15]:
# Create DataFrame for QA pairs (id, question_type, question, answer)
df_qa = pd.DataFrame(qa_data)

print("QA Pairs DataFrame:")
print(f"Shape: {df_qa.shape}")
print(f"Columns: {df_qa.columns.tolist()}")
print(f"\nQuestion types distribution:")
print(df_qa['question_type'].value_counts())
print(f"\nFirst few rows:")
df_qa.head()

QA Pairs DataFrame:
Shape: (148558, 4)
Columns: ['id', 'question_type', 'question', 'answer']

Question types distribution:
question_type
spatial                             39467
count                               27608
presence                            27608
overall                             18954
quantity                             7413
comparison                           5759
total_count                          5759
object                               5579
overall_features                     2935
objects                              1848
quantity_proportion                  1608
overall_feature                      1351
overall features                     1137
quantities_proportions                423
quantity/proportion                   218
quantities or proportions             169
feature                               120
image_feature                         117
quantity or proportion                 83
object_quantity                        58
overall image features

Unnamed: 0,id,question_type,question,answer
0,0,spatial,Where is the highway interchange located in th...,The highway interchange is located in the cent...
1,0,spatial,In which parts of the image are the residentia...,The houses are primarily found in the northern...
2,0,spatial,Where are the recreational facilities situated...,The recreational facilities are in the southwe...
3,0,object,What recreational facilities can be seen in th...,A running track and tennis courts are visible ...
4,0,overall,What kind of area does the image depict?,The image depicts a highly developed urban area.


## 6. Save to Two Parquet Files

In [16]:
# Save captions DataFrame to Parquet
captions_file = "RSVLM-QA-captions.parquet"
df_captions.to_parquet(captions_file, engine='pyarrow', compression='snappy', index=False)

print(f"✓ Saved {len(df_captions)} caption records to {captions_file}")
print(f"  File size: {Path(captions_file).stat().st_size / (1024*1024):.2f} MB")

# Save QA pairs DataFrame to Parquet
qa_file = "RSVLM-QA-questions.parquet"
df_qa.to_parquet(qa_file, engine='pyarrow', compression='snappy', index=False)

print(f"\n✓ Saved {len(df_qa)} QA pairs to {qa_file}")
print(f"  File size: {Path(qa_file).stat().st_size / (1024*1024):.2f} MB")

✓ Saved 13820 caption records to RSVLM-QA-captions.parquet
  File size: 3.79 MB

✓ Saved 148558 QA pairs to RSVLM-QA-questions.parquet
  File size: 5.19 MB

✓ Saved 148558 QA pairs to RSVLM-QA-questions.parquet
  File size: 5.19 MB


## 7. Verify Both Parquet Files

In [17]:
# Verify captions Parquet file
df_captions_verify = pd.read_parquet(captions_file)

print("=" * 60)
print("CAPTIONS PARQUET FILE")
print("=" * 60)
print(f"Shape: {df_captions_verify.shape}")
print(f"\nColumns: {df_captions_verify.columns.tolist()}")
print(f"\nData types:")
print(df_captions_verify.dtypes)
print(f"\nSample rows:")
df_captions_verify.head()

CAPTIONS PARQUET FILE
Shape: (13820, 3)

Columns: ['id', 'image', 'caption']

Data types:
id         str
image      str
caption    str
dtype: object

Sample rows:


Unnamed: 0,id,image,caption
0,0,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image depicts a highly developed urban are...
1,1,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image presents a clear contrast between tw...
2,2,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image primarily depicts a suburban neighbo...
3,3,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The landscape is characterized by a prominent ...
4,4,RSVLM-QA/INRIA-Aerial-Image-Labeling/train/ima...,The image depicts a highly urbanized environme...


In [18]:
# Verify QA pairs Parquet file
df_qa_verify = pd.read_parquet(qa_file)

print("=" * 60)
print("QA PAIRS PARQUET FILE")
print("=" * 60)
print(f"Shape: {df_qa_verify.shape}")
print(f"\nColumns: {df_qa_verify.columns.tolist()}")
print(f"\nData types:")
print(df_qa_verify.dtypes)
print(f"\nQuestion types distribution:")
print(df_qa_verify['question_type'].value_counts())
print(f"\nSample rows:")
df_qa_verify.head(10)

QA PAIRS PARQUET FILE
Shape: (148558, 4)

Columns: ['id', 'question_type', 'question', 'answer']

Data types:
id               str
question_type    str
question         str
answer           str
dtype: object

Question types distribution:
question_type
spatial                             39467
count                               27608
presence                            27608
overall                             18954
quantity                             7413
comparison                           5759
total_count                          5759
object                               5579
overall_features                     2935
objects                              1848
quantity_proportion                  1608
overall_feature                      1351
overall features                     1137
quantities_proportions                423
quantity/proportion                   218
quantities or proportions             169
feature                               120
image_feature                     

Unnamed: 0,id,question_type,question,answer
0,0,spatial,Where is the highway interchange located in th...,The highway interchange is located in the cent...
1,0,spatial,In which parts of the image are the residentia...,The houses are primarily found in the northern...
2,0,spatial,Where are the recreational facilities situated...,The recreational facilities are in the southwe...
3,0,object,What recreational facilities can be seen in th...,A running track and tennis courts are visible ...
4,0,overall,What kind of area does the image depict?,The image depicts a highly developed urban area.
5,0,overall,How would you describe the balance between bui...,The built environment is significant and visua...
6,0,quantity,Are the recreational facilities concentrated i...,The recreational facilities are concentrated i...
7,0,count,How many buildings are there in the image?,There are 1588 buildings in the image.
8,0,presence,Are there any buildings in the image?,"Yes, there are 1588 buildings."
9,1,spatial,Where is the dense urban grid located in the i...,The dense urban grid is located on the right s...


In [19]:
# Display example from captions file
print("=" * 60)
print("EXAMPLE FROM CAPTIONS FILE")
print("=" * 60)
print(f"ID: {df_captions_verify.iloc[0]['id']}")
print(f"Image: {df_captions_verify.iloc[0]['image']}")
print(f"\nCaption:\n{df_captions_verify.iloc[0]['caption']}")

print("\n" + "=" * 60)
print("EXAMPLE QA PAIRS FOR SAME ID")
print("=" * 60)
sample_id = df_captions_verify.iloc[0]['id']
sample_qa = df_qa_verify[df_qa_verify['id'] == sample_id]
for idx, row in sample_qa.head(3).iterrows():
    print(f"\nType: {row['question_type']}")
    print(f"Q: {row['question']}")
    print(f"A: {row['answer']}")

EXAMPLE FROM CAPTIONS FILE
ID: 0
Image: RSVLM-QA/INRIA-Aerial-Image-Labeling/train/images/austin11.tif

Caption:
The image depicts a highly developed urban area characterized by a prominent highway interchange that dominates the central portion of the scene. Surrounding the highways are dense residential neighborhoods with closely packed houses, particularly in the northern and eastern parts of the image. Several large, white-roofed buildings, likely commercial or institutional, are visible in the southern and southeastern areas. The southwest corner features recreational facilities, including a running track and tennis courts, adjacent to a river. While there are some green spaces and tree cover interspersed throughout, the built environment is a significant and visually dominant feature, supporting the claim of a notable but not overwhelming building presence in the landscape.

EXAMPLE QA PAIRS FOR SAME ID

Type: spatial
Q: Where is the highway interchange located in the image?
A: Th