# LLM-Generated Data Structure Analyzer

This notebook analyzes the structure variations in the synthetically generated poem dataset.
Since the data is LLM-generated, rows may not have perfectly consistent structure.

**Goal:** Identify all unique structures, categorize them, and help you decide which ones to keep, salvage, or discard.

In [1]:
import json
import os
from collections import defaultdict, Counter
from typing import Dict, Any, List, Set
import pandas as pd
from pathlib import Path

print("‚úÖ Imports loaded")

‚úÖ Imports loaded


## Cell 1: Configuration & Load Data

In [2]:
DATA_FILE = "../data/poem_finetune_13000.jsonl"

# Load all data
all_records = []
load_errors = []

with open(DATA_FILE, 'r') as f:
    for line_no, line in enumerate(f, 1):
        try:
            record = json.loads(line)
            all_records.append(record)
        except json.JSONDecodeError as e:
            load_errors.append({"line": line_no, "error": str(e), "sample": line[:100]})

print(f"‚úÖ Loaded {len(all_records)} records from {DATA_FILE}")
if load_errors:
    print(f"‚ö†Ô∏è  {len(load_errors)} JSON parse errors")
    print(f"  First 3 errors:")
    for err in load_errors[:3]:
        print(f"    Line {err['line']}: {err['error']}")

‚úÖ Loaded 13468 records from ../data/poem_finetune_13000.jsonl


## Cell 2: Utility Functions for Structure Analysis

In [3]:
def get_structure_signature(obj: Any) -> str:
    """Generate a signature representing the nested structure of an object."""
    if isinstance(obj, dict):
        keys = sorted(obj.keys())
        values_sig = tuple(get_structure_signature(obj[k]) for k in keys)
        return f"dict({','.join(keys)})"
    elif isinstance(obj, list):
        if not obj:
            return "list(empty)"
        first_sig = get_structure_signature(obj[0])
        return f"list[{first_sig}]"
    else:
        return type(obj).__name__

def get_keys_at_path(obj: Any, path: str = "") -> Set[str]:
    """Recursively extract all keys at any nesting level."""
    keys = set()
    if isinstance(obj, dict):
        for k in obj.keys():
            keys.add(f"{path}.{k}" if path else k)
            keys.update(get_keys_at_path(obj[k], f"{path}.{k}" if path else k))
    elif isinstance(obj, list) and obj:
        keys.update(get_keys_at_path(obj[0], f"{path}[0]"))
    return keys

def validate_ideal_structure(record: Dict) -> tuple[bool, List[str]]:
    """
    Check if a record matches the ideal structure:
    {
        "poem_verse": str,
        "data": {
            "meaning": str,
            "queries": {
                "neutral": [str, str, str, str, str],
                "user": [str, str, str, str, str]
            }
        }
    }
    Returns: (is_valid, list_of_issues)
    """
    issues = []
    
    # Top level
    if "poem_verse" not in record:
        issues.append("Missing 'poem_verse' at top level")
    elif not isinstance(record.get("poem_verse"), str):
        issues.append(f"'poem_verse' should be str, got {type(record['poem_verse']).__name__}")
    
    if "data" not in record:
        issues.append("Missing 'data' at top level")
        return len(issues) == 0, issues
    
    data = record.get("data")
    if not isinstance(data, dict):
        issues.append(f"'data' should be dict, got {type(data).__name__}")
        return len(issues) == 0, issues
    
    # Data level
    if "meaning" not in data:
        issues.append("Missing 'data.meaning'")
    elif not isinstance(data.get("meaning"), str):
        issues.append(f"'data.meaning' should be str, got {type(data['meaning']).__name__}")
    
    if "queries" not in data:
        issues.append("Missing 'data.queries'")
        return len(issues) == 0, issues
    
    queries = data.get("queries")
    if not isinstance(queries, dict):
        issues.append(f"'data.queries' should be dict, got {type(queries).__name__}")
        return len(issues) == 0, issues
    
    # Queries level
    for key in ["neutral", "user"]:
        if key not in queries:
            issues.append(f"Missing 'data.queries.{key}'")
        else:
            val = queries[key]
            if not isinstance(val, list):
                issues.append(f"'data.queries.{key}' should be list, got {type(val).__name__}")
            elif len(val) != 5:
                issues.append(f"'data.queries.{key}' should have 5 items, got {len(val)}")
            elif not all(isinstance(item, str) for item in val):
                non_str = [i for i, item in enumerate(val) if not isinstance(item, str)]
                issues.append(f"'data.queries.{key}' has non-string items at indices: {non_str}")
    
    return len(issues) == 0, issues

print("‚úÖ Utility functions defined")

‚úÖ Utility functions defined


## Cell 3: Analyze All Structures

In [4]:
# Analyze all records
structure_signatures = defaultdict(list)
ideal_records = []
non_ideal_records = defaultdict(list)

print("Analyzing all records...")
for idx, record in enumerate(all_records):
    is_ideal, issues = validate_ideal_structure(record)
    
    if is_ideal:
        ideal_records.append({"index": idx, "record": record})
    else:
        sig = get_structure_signature(record)
        structure_signatures[sig].append(idx)
        
        # Group by issue type
        for issue in issues:
            non_ideal_records[issue].append(idx)

print(f"\n{'='*60}")
print("STRUCTURE ANALYSIS SUMMARY")
print(f"{'='*60}")
print(f"‚úÖ Ideal Records: {len(ideal_records)}")
print(f"‚ö†Ô∏è  Non-Ideal Records: {len(all_records) - len(ideal_records)}")
print(f"   Percentage Ideal: {100 * len(ideal_records) / len(all_records):.1f}%")

print(f"\n{'='*60}")
print("NON-IDEAL RECORDS BY ISSUE TYPE")
print(f"{'='*60}")
for issue, indices in sorted(non_ideal_records.items(), key=lambda x: -len(x[1])):
    print(f"{len(indices):5d} records: {issue}")

Analyzing all records...

STRUCTURE ANALYSIS SUMMARY
‚úÖ Ideal Records: 39
‚ö†Ô∏è  Non-Ideal Records: 13429
   Percentage Ideal: 0.3%

NON-IDEAL RECORDS BY ISSUE TYPE
13409 records: 'data.queries.user' has non-string items at indices: [0, 1, 2, 3, 4]
   15 records: 'data.queries.user' has non-string items at indices: [0, 1, 3, 4]
    5 records: 'data.queries.user' has non-string items at indices: [0, 1, 2, 4]


## Cell 4: Detailed Examples of Non-Ideal Structures

In [5]:
# Show examples of non-ideal records by issue type
for issue in sorted(non_ideal_records.keys())[:10]:  # First 10 distinct issues
    indices = non_ideal_records[issue]
    print(f"\n{'='*60}")
    print(f"ISSUE: {issue}")
    print(f"Count: {len(indices)} records")
    print(f"{'='*60}")
    
    # Show first example
    idx = indices[0]
    record = all_records[idx]
    print(f"\nExample (index {idx}):")
    print(json.dumps(record, indent=2, ensure_ascii=False)[:500] + "...")
    print(f"\nOther affected indices: {indices[1:min(6, len(indices))]}")
    if len(indices) > 5:
        print(f"... and {len(indices) - 5} more")


ISSUE: 'data.queries.user' has non-string items at indices: [0, 1, 2, 3, 4]
Count: 13409 records

Example (index 0):
{
  "poem_verse": "and spite of nature tear her from thy soul",
  "data": {
    "meaning": "Even though it goes against human nature, you should force yourself to let go of someone or something that is deeply connected to your emotions or identity.",
    "queries": {
      "neutral": [
        "How can someone overcome an attachment that feels like it‚Äôs part of their very being?",
        "What does it mean to let go of something that feels essential to your identity?",
        "Explain the idea...

Other affected indices: [1, 2, 3, 4, 5]
... and 13404 more

ISSUE: 'data.queries.user' has non-string items at indices: [0, 1, 2, 4]
Count: 5 records

Example (index 2828):
{
  "poem_verse": "it cur'd diseases heal'd the bleeding wound",
  "data": {
    "meaning": "It had the power to treat illnesses and stop bleeding injuries, effectively restoring health and preventing f

## Cell 5: Query Count Analysis

In [6]:
# Analyze query counts in all records
neutral_query_counts = Counter()
user_query_counts = Counter()

for record in all_records:
    try:
        if isinstance(record.get("data"), dict) and isinstance(record["data"].get("queries"), dict):
            neutral = record["data"]["queries"].get("neutral", [])
            user = record["data"]["queries"].get("user", [])
            neutral_query_counts[len(neutral)] += 1
            user_query_counts[len(user)] += 1
    except:
        pass

print(f"\n{'='*60}")
print("NEUTRAL QUERIES COUNT DISTRIBUTION")
print(f"{'='*60}")
for count in sorted(neutral_query_counts.keys()):
    pct = 100 * neutral_query_counts[count] / len(all_records)
    bar = "‚ñà" * int(pct / 2)
    print(f"{count:2d} queries: {neutral_query_counts[count]:5d} records ({pct:5.1f}%) {bar}")

print(f"\n{'='*60}")
print("USER QUERIES COUNT DISTRIBUTION")
print(f"{'='*60}")
for count in sorted(user_query_counts.keys()):
    pct = 100 * user_query_counts[count] / len(all_records)
    bar = "‚ñà" * int(pct / 2)
    print(f"{count:2d} queries: {user_query_counts[count]:5d} records ({pct:5.1f}%) {bar}")


NEUTRAL QUERIES COUNT DISTRIBUTION
 5 queries: 13468 records (100.0%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà

USER QUERIES COUNT DISTRIBUTION
 5 queries: 13468 records (100.0%) ‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà


## Cell 6: Salvageable Records Analysis

Records that have all required fields but different query counts or minor deviations.


In [7]:
# Categorize all records
salvageable = []  # Has all fields but different counts
fixable = []      # Missing one or two fields that can be added/fixed
not_salvageable = []  # Too broken

for idx, record in enumerate(all_records):
    is_ideal, issues = validate_ideal_structure(record)
    
    if is_ideal:
        continue  # Already in ideal_records
    
    # Check if it has all top-level keys and is just missing/mismatched queries
    if "poem_verse" in record and "data" in record:
        data = record.get("data")
        if isinstance(data, dict) and "meaning" in data and "queries" in data:
            # Has all the structure, might just have different query counts
            salvageable.append({"index": idx, "issues": issues, "record": record})
        else:
            # Missing meaning or queries
            fixable.append({"index": idx, "issues": issues, "record": record})
    else:
        # Missing top-level fields
        not_salvageable.append({"index": idx, "issues": issues})

print(f"\n{'='*60}")
print("RECORD CATEGORIZATION")
print(f"{'='*60}")
print(f"‚úÖ Ideal (perfect structure):           {len(ideal_records):5d} ({100*len(ideal_records)/len(all_records):5.1f}%)")
print(f"üîß Salvageable (can trim/pad queries): {len(salvageable):5d} ({100*len(salvageable)/len(all_records):5.1f}%)")
print(f"‚ö†Ô∏è  Fixable (missing minor fields):      {len(fixable):5d} ({100*len(fixable)/len(all_records):5.1f}%)")
print(f"‚ùå Not Salvageable (broken structure): {len(not_salvageable):5d} ({100*len(not_salvageable)/len(all_records):5.1f}%)")
print(f"\nTotal usable (Ideal + Salvageable + Fixable): {len(ideal_records) + len(salvageable) + len(fixable):5d} ({100*(len(ideal_records) + len(salvageable) + len(fixable))/len(all_records):5.1f}%)")


RECORD CATEGORIZATION
‚úÖ Ideal (perfect structure):              39 (  0.3%)
üîß Salvageable (can trim/pad queries): 13429 ( 99.7%)
‚ö†Ô∏è  Fixable (missing minor fields):          0 (  0.0%)
‚ùå Not Salvageable (broken structure):     0 (  0.0%)

Total usable (Ideal + Salvageable + Fixable): 13468 (100.0%)


## Cell 7: Show Examples from Each Category

In [8]:
print(f"{'='*60}")
print("IDEAL RECORD EXAMPLE")
print(f"{'='*60}")
if ideal_records:
    example = ideal_records[0]["record"]
    print(json.dumps(example, indent=2, ensure_ascii=False)[:800])
    print("... (truncated)")

print(f"\n{'='*60}")
print("SALVAGEABLE RECORD EXAMPLES")
print(f"{'='*60}")
if salvageable:
    for i, item in enumerate(salvageable[:2]):
        print(f"\nExample {i+1} (index {item['index']}):")
        print(f"Issues: {item['issues']}")
        example = item["record"]
        if "data" in example and "queries" in example["data"]:
            q = example["data"]["queries"]
            print(f"  neutral queries: {len(q.get('neutral', []))} items")
            print(f"  user queries: {len(q.get('user', []))} items")
        print(json.dumps(example, indent=2, ensure_ascii=False)[:600])
        print("... (truncated)")

print(f"\n{'='*60}")
print("FIXABLE RECORD EXAMPLES")
print(f"{'='*60}")
if fixable:
    for i, item in enumerate(fixable[:2]):
        print(f"\nExample {i+1} (index {item['index']}):")
        print(f"Issues: {item['issues']}")
        example = item["record"]
        print(json.dumps(example, indent=2, ensure_ascii=False)[:600])
        print("... (truncated)")

print(f"\n{'='*60}")
print("NOT SALVAGEABLE EXAMPLES")
print(f"{'='*60}")
if not_salvageable:
    for i, item in enumerate(not_salvageable[:2]):
        print(f"\nExample {i+1} (index {item['index']}):")
        print(f"Issues: {item['issues']}")
        example = all_records[item['index']]
        print(json.dumps(example, indent=2, ensure_ascii=False)[:600])
        print("... (truncated)")

IDEAL RECORD EXAMPLE
{
  "poem_verse": "and grecian groves her long and lov'd abode",
  "data": {
    "meaning": "She lived for a long time in the peaceful, beautiful forests of ancient Greece, a place she cherished deeply.",
    "queries": {
      "neutral": [
        "What does 'Grecian groves her long and lov'd abode' mean in modern terms?",
        "How would you rephrase 'and grecian groves her long and lov'd abode' for a contemporary audience?",
        "Can you simplify the line 'and grecian groves her long and lov'd abode' into plain language?",
        "What imagery is conveyed by the phrase 'grecian groves her long and lov'd abode'?",
        "Explain the meaning behind the poetic line 'and grecian groves her long and lov'd abode' in a straightforward way."
      ],
      "user": [
        "Hey, I'm 
... (truncated)

SALVAGEABLE RECORD EXAMPLES

Example 1 (index 0):
Issues: ["'data.queries.user' has non-string items at indices: [0, 1, 2, 3, 4]"]
  neutral queries: 5 items
  u

## Cell 8: Summary & Next Steps

Based on this analysis, you can now decide:

1. **Ideal Records**: Use as-is (no modifications needed)
2. **Salvageable Records**: Decide if you want to:
   - Keep only records with 5 queries each (strict)
   - Pad with empty strings or trim to 5 (moderate)
   - Or skip them entirely (conservative)
3. **Fixable Records**: Decide if missing fields can be added or skipped
4. **Not Salvageable**: Discard entirely

Run the next cell to generate a decision template.


In [9]:
# Decision template for the user
print(f"\n{'='*60}")
print("DECISION TEMPLATE - Fill this in to decide your strategy:")
print(f"{'='*60}")

template = f"""
# STRUCTURE ANALYSIS DECISIONS

## 1. IDEAL RECORDS ({len(ideal_records)} records)
   Status: ‚úÖ Perfect - Use all
   Action: ['use_all']

## 2. SALVAGEABLE RECORDS ({len(salvageable)} records)
   These have all required fields but different query counts.
   Options:
   - 'use_all': Keep all as-is (allows varied query counts)
   - 'trim_to_5': Use only first 5 queries from each
   - 'require_exactly_5': Skip if not exactly 5 queries
   - 'skip_all': Discard entirely
   Selected Action: ['require_exactly_5']  # <-- CHANGE THIS

## 3. FIXABLE RECORDS ({len(fixable)} records)
   These are missing 1-2 fields (meaning or queries).
   Options:
   - 'use_all': Try to fix and use
   - 'skip_all': Discard entirely
   Selected Action: ['skip_all']  # <-- CHANGE THIS

## 4. NOT SALVAGEABLE ({len(not_salvageable)} records)
   These are broken beyond repair.
   Status: ‚ùå Always discard
   Action: ['skip_all']

## SUMMARY OF YOUR CHOICES:
- If you keep all Ideal + Salvageable (as-is) + skip Fixable:
  Total usable: {len(ideal_records) + len(salvageable)} records
  
- If you keep Ideal + trim Salvageable to 5 exactly + skip Fixable:
  Total usable: {len(ideal_records) + len([s for s in salvageable if 'has exactly 5' in str(s)])} records (approx)
"""

print(template)
print("\nüìù Modify the 'Selected Action' values above, then use those decisions")
print("   to process and save your final dataset.")


DECISION TEMPLATE - Fill this in to decide your strategy:

# STRUCTURE ANALYSIS DECISIONS

## 1. IDEAL RECORDS (39 records)
   Status: ‚úÖ Perfect - Use all
   Action: ['use_all']

## 2. SALVAGEABLE RECORDS (13429 records)
   These have all required fields but different query counts.
   Options:
   - 'use_all': Keep all as-is (allows varied query counts)
   - 'trim_to_5': Use only first 5 queries from each
   - 'require_exactly_5': Skip if not exactly 5 queries
   - 'skip_all': Discard entirely
   Selected Action: ['require_exactly_5']  # <-- CHANGE THIS

## 3. FIXABLE RECORDS (0 records)
   These are missing 1-2 fields (meaning or queries).
   Options:
   - 'use_all': Try to fix and use
   - 'skip_all': Discard entirely
   Selected Action: ['skip_all']  # <-- CHANGE THIS

## 4. NOT SALVAGEABLE (0 records)
   These are broken beyond repair.
   Status: ‚ùå Always discard
   Action: ['skip_all']

## SUMMARY OF YOUR CHOICES:
- If you keep all Ideal + Salvageable (as-is) + skip Fixable:
 

## Cell 9: Deep Dive - What are the non-string items in "user" queries?

In [10]:
# Let's examine what these non-string items actually are
from collections import defaultdict

non_string_types = defaultdict(int)
examples_by_type = {}

for record in all_records:
    try:
        user_queries = record["data"]["queries"]["user"]
        for i, item in enumerate(user_queries):
            if not isinstance(item, str):
                item_type = type(item).__name__
                non_string_types[item_type] += 1
                if item_type not in examples_by_type:
                    examples_by_type[item_type] = {
                        "value": item,
                        "index": i,
                        "from_record": record["poem_verse"][:50]
                    }
    except:
        pass

print(f"\n{'='*60}")
print("NON-STRING ITEMS IN USER QUERIES - TYPE BREAKDOWN")
print(f"{'='*60}")
for type_name, count in sorted(non_string_types.items(), key=lambda x: -x[1]):
    print(f"{type_name:20s}: {count:6d} items")

print(f"\n{'='*60}")
print("EXAMPLES OF NON-STRING ITEMS")
print(f"{'='*60}")
for type_name, example in examples_by_type.items():
    print(f"\nType: {type_name}")
    print(f"Example value: {example['value']}")
    print(f"At query index: {example['index']}")
    print(f"From verse: ...{example['from_record']}...")
    print()


NON-STRING ITEMS IN USER QUERIES - TYPE BREAKDOWN
dict                :  67125 items

EXAMPLES OF NON-STRING ITEMS

Type: dict
Example value: {'persona': {'name': 'Alex', 'age': 28, 'profession': 'Grief counselor', 'context': 'Writing a self-help article on emotional detachment for clients struggling with loss.', 'tone': 'Reflective, slightly clinical, but empathetic'}, 'query': 'I‚Äôm working on a piece for my clients about how to move forward when grief feels like it‚Äôs woven into your soul. There‚Äôs this idea‚Äîalmost painful‚Äîthat sometimes you have to *tear* yourself away from the attachment, even if it feels unnatural. How would you phrase that in a way that resonates with someone who‚Äôs drowning in love or loss? It needs to sound raw but not cruel.'}
At query index: 0
From verse: ...and spite of nature tear her from thy soul...



## Cell 10: Import Summary & Recommendations

In [11]:
print(f"""
{'='*70}
ANALYSIS SUMMARY & RECOMMENDATIONS
{'='*70}

KEY FINDING:
============
Your dataset has TWO distinct structures:

1. IDEAL (39 records, 0.3%)
   ‚îú‚îÄ Structure: poem_verse + data.meaning + data.queries.neutral + data.queries.user
   ‚îî‚îÄ All queries are plain strings (5 each)

2. SALVAGEABLE (13,429 records, 99.7%)
   ‚îú‚îÄ Structure: poem_verse + data.meaning + data.queries.neutral + data.queries.user
   ‚îú‚îÄ All neutral queries are plain strings (5 each) ‚úÖ
   ‚îî‚îÄ User queries are OBJECTS with two keys:
      ‚îú‚îÄ 'persona': Contains persona metadata (dict)
      ‚îî‚îÄ 'query': Contains the actual query string

This is actually BETTER than the ideal structure!

{'='*70}
RECOMMENDATIONS:
{'='*70}

Option A: ENRICH THE DATASET (Recommended for better fine-tuning)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚úì Extract both 'persona' metadata AND the 'query' from user queries
‚úì This gives you contextual information for each user query
‚úì You can use personas to improve model diversity
‚úì Result: 13,468 fully usable records with enriched metadata

Option B: SIMPLIFY THE DATASET (If personas aren't useful)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚úì Extract only the 'query' string from user query objects
‚úì Flatten them to match neutral queries (plain strings)
‚úì You lose persona metadata but gain simplicity
‚úì Result: 13,468 fully usable records (all plain strings)

Option C: STRICT APPROACH (Conservative)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚úì Keep only the 39 ideal records
‚úó Discard 13,429 salvageable records (waste!)
‚úó Not recommended unless you have specific reasons

{'='*70}
NEXT STEPS:
{'='*70}

1. Decide which approach you prefer (A, B, or C)
2. Create a data cleaning/transformation script
3. Apply chosen transformation
4. Validate cleaned dataset
5. Proceed to training with cleaned data

ALL RECORDS ARE SALVAGEABLE - No data loss needed!
{'='*70}
""")



ANALYSIS SUMMARY & RECOMMENDATIONS

KEY FINDING:
Your dataset has TWO distinct structures:

1. IDEAL (39 records, 0.3%)
   ‚îú‚îÄ Structure: poem_verse + data.meaning + data.queries.neutral + data.queries.user
   ‚îî‚îÄ All queries are plain strings (5 each)

2. SALVAGEABLE (13,429 records, 99.7%)
   ‚îú‚îÄ Structure: poem_verse + data.meaning + data.queries.neutral + data.queries.user
   ‚îú‚îÄ All neutral queries are plain strings (5 each) ‚úÖ
   ‚îî‚îÄ User queries are OBJECTS with two keys:
      ‚îú‚îÄ 'persona': Contains persona metadata (dict)
      ‚îî‚îÄ 'query': Contains the actual query string

This is actually BETTER than the ideal structure!

RECOMMENDATIONS:

Option A: ENRICH THE DATASET (Recommended for better fine-tuning)
‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚úì Extract both 'persona' metadata AND the 'query' from u