# CSV File Verification Notebook

This notebook helps verify the correctness of the generated CSV files by comparing data from two sources:
1. **CSV files** generated by `generate_csv_from_data.py`
2. **Original JSON files** containing VLM captions and GPT-4o evaluations

## Purpose
- Verify that VLM captions were correctly extracted and aligned
- Confirm that GPT-4o evaluations and scores match between sources
- Ensure data integrity during the CSV generation process

## Usage
1. Set the parameters (video_id, chunk_id, model_name) in the designated cell
2. Run all cells to perform the verification
3. Review the comparison results at the end

In [1]:
# Import Required Libraries
import pandas as pd
import json
import os
from pathlib import Path
from typing import Dict, Any, Optional, Tuple

print("üìö Libraries imported successfully!")

üìö Libraries imported successfully!


## üîß Configuration Parameters

Set the test parameters below to verify specific data points:

In [2]:
# ===== CONFIGURATION PARAMETERS =====
# Modify these parameters to test different data points

TEST_VIDEO_ID = "74GSMhR6oI0"  # Example video ID
TEST_CHUNK_ID = 0               # Example chunk ID  
TEST_MODEL_NAME = "vila-1.5"    # Example model name

# File paths
CSV_OUTPUT_DIR = "csv_output"
CAPTIONS_DIR = "reports/minseok_prompt_audio_x_caption_sorted"
EVALUATIONS_DIR = "reports/caption_full_comparison_minseok_audio_x_caption_sorted_completeness"

print(f"üéØ Test Parameters:")
print(f"   Video ID: {TEST_VIDEO_ID}")
print(f"   Chunk ID: {TEST_CHUNK_ID}")
print(f"   Model: {TEST_MODEL_NAME}")
print(f"üìÇ Directories configured successfully!")

üéØ Test Parameters:
   Video ID: 74GSMhR6oI0
   Chunk ID: 0
   Model: vila-1.5
üìÇ Directories configured successfully!


## üìÑ Load CSV File Data

Load the generated CSV files and examine their structure:

In [3]:
# Load CSV files
def load_csv_files():
    """Load both CSV files and return DataFrames."""
    
    captions_csv = Path(CSV_OUTPUT_DIR) / "vlm_captions.csv"
    evaluations_csv = Path(CSV_OUTPUT_DIR) / "gpt4o_evaluations.csv"
    
    if not captions_csv.exists():
        raise FileNotFoundError(f"Captions CSV not found: {captions_csv}")
    if not evaluations_csv.exists():
        raise FileNotFoundError(f"Evaluations CSV not found: {evaluations_csv}")
    
    print(f"üìÑ Loading: {captions_csv}")
    captions_df = pd.read_csv(captions_csv)
    
    print(f"üìÑ Loading: {evaluations_csv}")
    evaluations_df = pd.read_csv(evaluations_csv)
    
    print(f"‚úÖ CSV files loaded successfully!")
    print(f"   üìä Captions CSV: {len(captions_df)} rows")
    print(f"   üìä Evaluations CSV: {len(evaluations_df)} rows")
    
    return captions_df, evaluations_df

# Load the CSV data
captions_df, evaluations_df = load_csv_files()

# Display basic info about the CSV files
print(f"\nüìã Captions CSV Columns: {list(captions_df.columns)}")
print(f"üìã Evaluations CSV Columns: {list(evaluations_df.columns)}")

# Show sample data
print(f"\nüîç Sample Captions Data:")
print(captions_df.head(2))
print(f"\nüîç Sample Evaluations Data:")
print(evaluations_df.head(2))

üìÑ Loading: csv_output/vlm_captions.csv
üìÑ Loading: csv_output/gpt4o_evaluations.csv
‚úÖ CSV files loaded successfully!
   üìä Captions CSV: 747 rows
   üìä Evaluations CSV: 747 rows

üìã Captions CSV Columns: ['video_id', 'chunk_id', 'start_time', 'end_time', 'vila-1.5', 'nvila', 'cosmos_reason1', 'qwen3-vl-30b-a3b-instruct', 'gemini-2.5-pro']
üìã Evaluations CSV Columns: ['video_id', 'chunk_id', 'start_time', 'end_time', 'vila-1.5', 'vila-1.5_score', 'nvila', 'nvila_score', 'cosmos_reason1', 'cosmos_reason1_score', 'qwen3-vl-30b-a3b-instruct', 'qwen3-vl-30b-a3b-instruct_score']

üîç Sample Captions Data:
      video_id  chunk_id  start_time  end_time  \
0  74GSMhR6oI0         0         0.0      20.0   
1  74GSMhR6oI0         1        20.0      40.0   

                                            vila-1.5  \
0  A man is talking about a soccer game and holdi...   
1  A woman wearing an orange jersey and black sho...   

                                               nvila  \
0

## üìÅ Load Original JSON File Data

Load the original JSON files to compare against the CSV data:

In [4]:
# Load original JSON files
def load_original_json_files():
    """Load original VLM caption files and GPT-4o evaluation files."""
    
    # Load VLM caption file for the test model and video
    caption_file = Path(CAPTIONS_DIR) / TEST_MODEL_NAME / TEST_VIDEO_ID / f"vlm_captions_{TEST_VIDEO_ID}.json"
    
    if not caption_file.exists():
        raise FileNotFoundError(f"Caption file not found: {caption_file}")
    
    print(f"üìÑ Loading caption file: {caption_file}")
    with open(caption_file, 'r', encoding='utf-8') as f:
        caption_data = json.load(f)
    
    # Load GPT-4o evaluation results (multi-run file)
    eval_file = Path(EVALUATIONS_DIR) / "multi_run_caption_evaluations.json"
    
    if not eval_file.exists():
        raise FileNotFoundError(f"Evaluation file not found: {eval_file}")
    
    print(f"üìÑ Loading evaluation file: {eval_file}")
    with open(eval_file, 'r', encoding='utf-8') as f:
        eval_data = json.load(f)
    
    print(f"‚úÖ Original JSON files loaded successfully!")
    
    return caption_data, eval_data

# Load the original JSON data
original_captions, original_evaluations = load_original_json_files()

print(f"\nüìã Original Caption Data Keys: {list(original_captions.keys())}")
print(f"üìã Original Evaluation Data Keys: {list(original_evaluations.keys())}")
print(f"üìä Number of caption chunks: {len(original_captions.get('chunk_responses', []))}")
print(f"üìä Number of evaluation runs: {len(original_evaluations.get('all_runs', []))}")

üìÑ Loading caption file: reports/minseok_prompt_audio_x_caption_sorted/vila-1.5/74GSMhR6oI0/vlm_captions_74GSMhR6oI0.json
üìÑ Loading evaluation file: reports/caption_full_comparison_minseok_audio_x_caption_sorted_completeness/multi_run_caption_evaluations.json
‚úÖ Original JSON files loaded successfully!

üìã Original Caption Data Keys: ['id', 'created', 'model', 'media_info', 'usage', 'chunk_responses']
üìã Original Evaluation Data Keys: ['all_runs', 'averaged_results', 'num_runs', 'run_delay', 'ground_truth_model']
üìä Number of caption chunks: 159
üìä Number of evaluation runs: 3


## üîç Extract Data from CSV Files

Extract the specific data point from CSV files using the test parameters:

In [5]:
def extract_from_csv(video_id: str, chunk_id: int, model_name: str) -> Tuple[Optional[str], Optional[str], Optional[float]]:
    """
    Extract VLM caption and GPT-4o evaluation data from CSV files.
    
    Returns:
        Tuple of (vlm_caption, gpt4o_judgment, gpt4o_score)
    """
    
    # Filter data for the specific video and chunk
    caption_row = captions_df[
        (captions_df['video_id'] == video_id) & 
        (captions_df['chunk_id'] == chunk_id)
    ]
    
    eval_row = evaluations_df[
        (evaluations_df['video_id'] == video_id) & 
        (evaluations_df['chunk_id'] == chunk_id)
    ]
    
    if caption_row.empty:
        print(f"‚ùå No caption data found in CSV for {video_id}, chunk {chunk_id}")
        return None, None, None
    
    if eval_row.empty:
        print(f"‚ùå No evaluation data found in CSV for {video_id}, chunk {chunk_id}")
        return None, None, None
    
    # Extract VLM caption
    vlm_caption = caption_row.iloc[0][model_name] if model_name in caption_row.columns else None
    
    # Extract GPT-4o evaluation (judgment and score)
    gpt4o_judgment = eval_row.iloc[0][model_name] if model_name in eval_row.columns else None
    gpt4o_score = eval_row.iloc[0][f"{model_name}_score"] if f"{model_name}_score" in eval_row.columns else None
    
    return vlm_caption, gpt4o_judgment, gpt4o_score

# Extract data from CSV
print(f"üîç Extracting data from CSV files...")
csv_caption, csv_judgment, csv_score = extract_from_csv(TEST_VIDEO_ID, TEST_CHUNK_ID, TEST_MODEL_NAME)

print(f"\nüìÑ CSV Results:")
print(f"   ü§ñ VLM Caption: {csv_caption}")
print(f"   üß† GPT-4o Judgment: {csv_judgment}")
print(f"   üìä GPT-4o Score: {csv_score}")

üîç Extracting data from CSV files...

üìÑ CSV Results:
   ü§ñ VLM Caption: A man is talking about a soccer game and holding a soccer ball.
   üß† GPT-4o Judgment: The test caption is severely lacking in completeness. It only covers a small part of the ground truth, specifically the man holding a soccer ball and speaking. It omits the promotional video, the graphic showing team logos, the wide shot of the soccer field, and the players warming up.
   üìä GPT-4o Score: 3


## üìÇ Extract Data from Original JSON Files

Extract the corresponding data from the original JSON files:

In [6]:
def extract_from_json(video_id: str, chunk_id: int, model_name: str) -> Tuple[Optional[str], Optional[str], Optional[float]]:
    """
    Extract VLM caption and GPT-4o evaluation data from original JSON files.
    
    Returns:
        Tuple of (vlm_caption, gpt4o_judgment, gpt4o_score)
    """
    
    # Extract VLM caption from original caption file
    vlm_caption = None
    chunks = original_captions.get('chunk_responses', [])
    
    for chunk in chunks:
        if chunk.get('chunk_id') == chunk_id:
            vlm_caption = chunk.get('content', '')
            break
    
    # Extract GPT-4o evaluation from original evaluation file (first run)
    gpt4o_judgment = None
    gpt4o_score = None
    
    if 'all_runs' in original_evaluations and original_evaluations['all_runs']:
        first_run = original_evaluations['all_runs'][0]
        evaluations = first_run.get('evaluations', [])
        
        for eval_data in evaluations:
            if (eval_data.get('video_id') == video_id and 
                eval_data.get('chunk_index') == chunk_id and 
                eval_data.get('model_name') == model_name):
                
                gpt4o_judgment = eval_data.get('judgment', '')
                gpt4o_score = eval_data.get('score', eval_data.get('overall_score'))
                break
    
    return vlm_caption, gpt4o_judgment, gpt4o_score

# Extract data from original JSON files
print(f"üîç Extracting data from original JSON files...")
json_caption, json_judgment, json_score = extract_from_json(TEST_VIDEO_ID, TEST_CHUNK_ID, TEST_MODEL_NAME)

print(f"\nüìÑ JSON Results:")
print(f"   ü§ñ VLM Caption: {json_caption}")
print(f"   üß† GPT-4o Judgment: {json_judgment}")
print(f"   üìä GPT-4o Score: {json_score}")

üîç Extracting data from original JSON files...

üìÑ JSON Results:
   ü§ñ VLM Caption: A man is talking about a soccer game and holding a soccer ball.
   üß† GPT-4o Judgment: The test caption is severely lacking in completeness. It only covers a small part of the ground truth, specifically the man holding a soccer ball and speaking. It omits the promotional video, the graphic showing team logos, the wide shot of the soccer field, and the players warming up.
   üìä GPT-4o Score: 3


## ‚öñÔ∏è Compare and Verify Data

Perform detailed comparison between CSV and JSON data:

In [7]:
def compare_data(csv_data: Tuple, json_data: Tuple, test_params: Dict[str, Any]) -> Dict[str, bool]:
    """
    Compare data from CSV and JSON sources.
    
    Returns:
        Dictionary with comparison results for each data type
    """
    
    csv_caption, csv_judgment, csv_score = csv_data
    json_caption, json_judgment, json_score = json_data
    
    comparison_results = {
        'caption_match': False,
        'judgment_match': False,
        'score_match': False,
        'all_match': False
    }
    
    print(f"üîç DETAILED COMPARISON")
    print(f"{'='*60}")
    
    # Compare VLM Caption
    print(f"\nüìù VLM CAPTION COMPARISON:")
    print(f"   CSV:  {repr(csv_caption)}")
    print(f"   JSON: {repr(json_caption)}")
    
    caption_match = csv_caption == json_caption
    comparison_results['caption_match'] = caption_match
    print(f"   ‚úÖ Match: {caption_match}" if caption_match else f"   ‚ùå Mismatch: {caption_match}")
    
    # Compare GPT-4o Judgment
    print(f"\nüß† GPT-4O JUDGMENT COMPARISON:")
    print(f"   CSV:  {repr(csv_judgment)}")
    print(f"   JSON: {repr(json_judgment)}")
    
    judgment_match = csv_judgment == json_judgment
    comparison_results['judgment_match'] = judgment_match
    print(f"   ‚úÖ Match: {judgment_match}" if judgment_match else f"   ‚ùå Mismatch: {judgment_match}")
    
    # Compare GPT-4o Score
    print(f"\nüìä GPT-4O SCORE COMPARISON:")
    print(f"   CSV:  {csv_score}")
    print(f"   JSON: {json_score}")
    
    # Handle floating point comparison
    if csv_score is not None and json_score is not None:
        try:
            score_match = abs(float(csv_score) - float(json_score)) < 1e-10
        except (ValueError, TypeError):
            score_match = csv_score == json_score
    else:
        score_match = csv_score == json_score
    
    comparison_results['score_match'] = score_match
    print(f"   ‚úÖ Match: {score_match}" if score_match else f"   ‚ùå Mismatch: {score_match}")
    
    # Overall match
    all_match = all([caption_match, judgment_match, score_match])
    comparison_results['all_match'] = all_match
    
    return comparison_results

# Perform the comparison
print(f"‚öñÔ∏è Comparing data for:")
print(f"   Video: {TEST_VIDEO_ID}")
print(f"   Chunk: {TEST_CHUNK_ID}")
print(f"   Model: {TEST_MODEL_NAME}")

csv_data = (csv_caption, csv_judgment, csv_score)
json_data = (json_caption, json_judgment, json_score)
test_params = {
    'video_id': TEST_VIDEO_ID,
    'chunk_id': TEST_CHUNK_ID,
    'model_name': TEST_MODEL_NAME
}

comparison_results = compare_data(csv_data, json_data, test_params)

‚öñÔ∏è Comparing data for:
   Video: 74GSMhR6oI0
   Chunk: 0
   Model: vila-1.5
üîç DETAILED COMPARISON

üìù VLM CAPTION COMPARISON:
   CSV:  'A man is talking about a soccer game and holding a soccer ball.'
   JSON: 'A man is talking about a soccer game and holding a soccer ball.'
   ‚úÖ Match: True

üß† GPT-4O JUDGMENT COMPARISON:
   CSV:  'The test caption is severely lacking in completeness. It only covers a small part of the ground truth, specifically the man holding a soccer ball and speaking. It omits the promotional video, the graphic showing team logos, the wide shot of the soccer field, and the players warming up.'
   JSON: 'The test caption is severely lacking in completeness. It only covers a small part of the ground truth, specifically the man holding a soccer ball and speaking. It omits the promotional video, the graphic showing team logos, the wide shot of the soccer field, and the players warming up.'
   ‚úÖ Match: True

üìä GPT-4O SCORE COMPARISON:
   CSV:  3
   JS

## üìã Final Verification Results

Display the final verification status and recommendations:

In [8]:
# Display final verification results
print(f"\nüéØ FINAL VERIFICATION RESULTS")
print(f"{'='*60}")

def display_verification_status(results: Dict[str, bool]):
    """Display verification results in a formatted way."""
    
    status_icon = "‚úÖ" if results['all_match'] else "‚ùå"
    
    print(f"\n{status_icon} OVERALL STATUS: {'PASSED' if results['all_match'] else 'FAILED'}")
    print(f"\nüìä Individual Results:")
    print(f"   üìù VLM Caption:     {'‚úÖ MATCH' if results['caption_match'] else '‚ùå MISMATCH'}")
    print(f"   üß† GPT-4o Judgment: {'‚úÖ MATCH' if results['judgment_match'] else '‚ùå MISMATCH'}")
    print(f"   üìä GPT-4o Score:    {'‚úÖ MATCH' if results['score_match'] else '‚ùå MISMATCH'}")
    
    if results['all_match']:
        print(f"\nüéâ SUCCESS: CSV files are correctly generated!")
        print(f"   The data in CSV files perfectly matches the original JSON sources.")
        print(f"   You can confidently use the CSV files for your analysis.")
    else:
        print(f"\n‚ö†Ô∏è  WARNING: Data mismatch detected!")
        print(f"   There are differences between CSV and JSON sources.")
        print(f"   Please check the CSV generation script or data processing logic.")
    
    print(f"\nüí° RECOMMENDATIONS:")
    if results['all_match']:
        print(f"   ‚Ä¢ CSV files are ready for use")
        print(f"   ‚Ä¢ Test additional data points to ensure consistency")
        print(f"   ‚Ä¢ Consider spot-checking random samples for extra validation")
    else:
        print(f"   ‚Ä¢ Investigate the cause of data mismatches")
        print(f"   ‚Ä¢ Check CSV generation logic in generate_csv_from_data.py")
        print(f"   ‚Ä¢ Verify data alignment between sources")
        print(f"   ‚Ä¢ Re-run CSV generation if necessary")

# Display the verification status
display_verification_status(comparison_results)

# Additional testing suggestion
print(f"\nüîÑ ADDITIONAL TESTING:")
print(f"To test more data points, modify the test parameters in cell 3 and re-run the notebook:")
print(f"   ‚Ä¢ Try different video_ids: {list(captions_df['video_id'].unique())}")
print(f"   ‚Ä¢ Try different chunk_ids: 0 to max_chunks_per_video")
print(f"   ‚Ä¢ Try different models: vila-1.5, nvila, cosmos_reason1, qwen3-vl-30b-a3b-instruct")


üéØ FINAL VERIFICATION RESULTS

‚úÖ OVERALL STATUS: PASSED

üìä Individual Results:
   üìù VLM Caption:     ‚úÖ MATCH
   üß† GPT-4o Judgment: ‚úÖ MATCH
   üìä GPT-4o Score:    ‚úÖ MATCH

üéâ SUCCESS: CSV files are correctly generated!
   The data in CSV files perfectly matches the original JSON sources.
   You can confidently use the CSV files for your analysis.

üí° RECOMMENDATIONS:
   ‚Ä¢ CSV files are ready for use
   ‚Ä¢ Test additional data points to ensure consistency
   ‚Ä¢ Consider spot-checking random samples for extra validation

üîÑ ADDITIONAL TESTING:
To test more data points, modify the test parameters in cell 3 and re-run the notebook:
   ‚Ä¢ Try different video_ids: ['74GSMhR6oI0', 'Mhs73xQWo5g', 'WnzPCvaxYvs', 'WuFL2bJm2yo', 'aSHaM2GcjXY', 'hwxQXfHgLhI']
   ‚Ä¢ Try different chunk_ids: 0 to max_chunks_per_video
   ‚Ä¢ Try different models: vila-1.5, nvila, cosmos_reason1, qwen3-vl-30b-a3b-instruct


## üöÄ Quick Batch Testing (Optional)

Run this cell to test multiple random data points for comprehensive validation:

In [10]:
# Quick batch testing (optional)
def quick_batch_test(num_tests: int = 5) -> Dict[str, int]:
    """
    Test multiple random data points for comprehensive validation.
    
    Returns:
        Dictionary with test statistics
    """
    
    import random
    
    # Get available test parameters
    available_videos = list(captions_df['video_id'].unique())
    available_models = ['vila-1.5', 'nvila', 'cosmos_reason1', 'qwen3-vl-30b-a3b-instruct']
    
    results = {
        'total_tests': 0,
        'passed_tests': 0,
        'failed_tests': 0
    }
    
    print(f"üöÄ Running {num_tests} random verification tests...")
    print(f"{'='*60}")
    
    for i in range(num_tests):
        # Select random parameters
        video_id = random.choice(available_videos)
        model_name = random.choice(available_models)
        
        # Get available chunks for this video
        video_chunks = captions_df[captions_df['video_id'] == video_id]['chunk_id'].tolist()
        chunk_id = random.choice(video_chunks)
        
        print(f"\nüß™ Test {i+1}: {video_id}, chunk {chunk_id}, {model_name}")
        
        try:
            # Load original JSON for this model/video
            caption_file = Path(CAPTIONS_DIR) / model_name / video_id / f"vlm_captions_{video_id}.json"
            if not caption_file.exists():
                print(f"   ‚ö†Ô∏è  Skipping - caption file not found")
                continue
                
            with open(caption_file, 'r', encoding='utf-8') as f:
                caption_data = json.load(f)
            
            # Extract data from both sources
            csv_caption, csv_judgment, csv_score = extract_from_csv(video_id, chunk_id, model_name)
            
            # Extract from JSON
            json_caption = None
            chunks = caption_data.get('chunk_responses', [])
            for chunk in chunks:
                if chunk.get('chunk_id') == chunk_id:
                    json_caption = chunk.get('content', '')
                    break
            
            json_judgment = None
            json_score = None
            if 'all_runs' in original_evaluations and original_evaluations['all_runs']:
                first_run = original_evaluations['all_runs'][0]
                evaluations = first_run.get('evaluations', [])
                
                for eval_data in evaluations:
                    if (eval_data.get('video_id') == video_id and 
                        eval_data.get('chunk_index') == chunk_id and 
                        eval_data.get('model_name') == model_name):
                        
                        json_judgment = eval_data.get('judgment', '')
                        json_score = eval_data.get('score', eval_data.get('overall_score'))
                        break
            
            # Compare
            caption_match = csv_caption == json_caption
            judgment_match = csv_judgment == json_judgment
            
            if csv_score is not None and json_score is not None:
                try:
                    score_match = abs(float(csv_score) - float(json_score)) < 1e-10
                except (ValueError, TypeError):
                    score_match = csv_score == json_score
            else:
                score_match = csv_score == json_score
            
            all_match = all([caption_match, judgment_match, score_match])
            
            results['total_tests'] += 1
            if all_match:
                results['passed_tests'] += 1
                print(f"   ‚úÖ PASSED")
            else:
                results['failed_tests'] += 1
                print(f"   ‚ùå FAILED (Caption: {caption_match}, Judgment: {judgment_match}, Score: {score_match})")
        
        except Exception as e:
            print(f"   ‚ö†Ô∏è  Error: {str(e)}")
    
    return results

# Run batch testing (uncomment the next line to run)
batch_results = quick_batch_test(5)
print(f"\nüìä BATCH TEST SUMMARY:")
print(f"   Total: {batch_results['total_tests']}")
print(f"   Passed: {batch_results['passed_tests']} ‚úÖ")
print(f"   Failed: {batch_results['failed_tests']} ‚ùå")
if batch_results['total_tests'] > 0:
    success_rate = (batch_results['passed_tests'] / batch_results['total_tests']) * 100
    print(f"   Success Rate: {success_rate:.1f}%")

print("üí° Uncomment the lines above to run batch testing of 5 random data points.")

üöÄ Running 5 random verification tests...

üß™ Test 1: WnzPCvaxYvs, chunk 56, cosmos_reason1
   ‚úÖ PASSED

üß™ Test 2: WuFL2bJm2yo, chunk 46, nvila
   ‚úÖ PASSED

üß™ Test 3: 74GSMhR6oI0, chunk 56, vila-1.5
   ‚úÖ PASSED

üß™ Test 4: WuFL2bJm2yo, chunk 39, cosmos_reason1
   ‚úÖ PASSED

üß™ Test 5: Mhs73xQWo5g, chunk 215, qwen3-vl-30b-a3b-instruct
   ‚úÖ PASSED

üìä BATCH TEST SUMMARY:
   Total: 5
   Passed: 5 ‚úÖ
   Failed: 0 ‚ùå
   Success Rate: 100.0%
üí° Uncomment the lines above to run batch testing of 5 random data points.
