# Question Access and Filtering Methods

This notebook demonstrates the comprehensive question access, filtering, and search methods available in Karenina.

## Overview

Karenina provides powerful methods to:
- **Access** questions with a harmonized API that returns full question objects
- **Filter** questions by system metadata (finished, has_template, author, etc.)
- **Filter** questions by custom metadata with flexible APIs
- **Search** questions with multi-term, regex, and multi-field support
- **Count** and analyze question distributions
- **Combine** methods for complex workflows

All methods work with **any custom metadata structure** - no assumptions about field names!

## Setup

In [None]:
from karenina.benchmark.benchmark import Benchmark

# Create a benchmark
benchmark = Benchmark.create(
    name="Question Access Demo",
    description="Demonstrating question access methods"
)

print(f"✓ Created benchmark: {benchmark.name}")

## Create Sample Questions

Add questions with various **custom metadata** to demonstrate filtering flexibility.

In [None]:
# Define questions with diverse metadata
questions_data = [
    {
        "question": "What is Python?",
        "raw_answer": "A high-level programming language",
        "finished": True,
        "custom_metadata": {
            "category": "programming",
            "difficulty": "easy",
            "tags": ["python", "basics"],
            "year": 2023
        }
    },
    {
        "question": "Explain quantum entanglement",
        "raw_answer": "A phenomenon where particles are correlated",
        "finished": True,
        "custom_metadata": {
            "category": "physics",
            "difficulty": "hard",
            "tags": ["quantum", "physics"],
            "year": 2024
        }
    },
    {
        "question": "What is machine learning?",
        "raw_answer": "AI algorithms that learn from data",
        "finished": False,
        "custom_metadata": {
            "category": "programming",
            "difficulty": "medium",
            "tags": ["ai", "ml"],
            "year": 2024
        }
    },
    {
        "question": "Describe DNA replication",
        "raw_answer": "Process of copying DNA molecules",
        "finished": True,
        "custom_metadata": {
            "category": "biology",
            "difficulty": "medium",
            "tags": ["biology", "dna"],
            "year": 2023
        }
    },
    {
        "question": "How does Python handle memory?",
        "raw_answer": "Through garbage collection and reference counting",
        "finished": False,
        "custom_metadata": {
            "category": "programming",
            "difficulty": "hard",
            "tags": ["python", "memory"],
            "year": 2024
        }
    },
]

# Add questions to benchmark
for q_data in questions_data:
    benchmark.add_question(
        question=q_data["question"],
        raw_answer=q_data["raw_answer"],
        finished=q_data["finished"],
        custom_metadata=q_data["custom_metadata"]
    )

print(f"✓ Added {len(benchmark)} questions to benchmark")

# Get all questions
all_questions = benchmark.get_all_questions()
print(f"Total questions: {len(all_questions)}")

# Get a specific question by ID
question_ids = benchmark.get_question_ids()
first_q = benchmark.get_question(question_ids[0])

print(f"\nFirst question: {first_q['question']}")
print(f"Category: {first_q['custom_metadata']['category']}")
print(f"Difficulty: {first_q['custom_metadata']['difficulty']}")

# get_all_questions also supports ids_only parameter
all_ids = benchmark.get_all_questions(ids_only=True)
print(f"\nAll question IDs: {len(all_ids)}")

In [None]:
# Get all questions
all_questions = benchmark.get_all_questions()
print(f"Total questions: {len(all_questions)}")

# Get a specific question by ID
question_ids = benchmark.get_question_ids()
first_q = benchmark.get_question(question_ids[0])

print(f"\nFirst question: {first_q['question']}")
print(f"Category: {first_q['custom_metadata']['category']}")
print(f"Difficulty: {first_q['custom_metadata']['difficulty']}")

## 2. Harmonized Access Methods

These methods return question objects by default, making them consistent with other filtering methods.

In [None]:
# Get finished questions (returns question objects)
finished = benchmark.get_finished_questions()
print(f"Finished questions: {len(finished)}")

# Directly access question properties
for q in finished:
    print(f"  ✓ {q['question'][:50]}...")

# Get unfinished questions
unfinished = benchmark.get_unfinished_questions()
print(f"\nUnfinished questions: {len(unfinished)}")

for q in unfinished:
    print(f"  ○ {q['question'][:50]}...")

In [None]:
# If you need just IDs, use ids_only=True
finished_ids = benchmark.get_finished_questions(ids_only=True)
unfinished_ids = benchmark.get_unfinished_questions(ids_only=True)

print(f"Finished IDs: {len(finished_ids)}")
print(f"Unfinished IDs: {len(unfinished_ids)}")
print(f"Total: {len(finished_ids) + len(unfinished_ids)}")

## 3. Filtering by System Metadata

Use `filter_questions()` for built-in Karenina fields.

In [None]:
# Filter by finished status
finished_qs = benchmark.filter_questions(finished=True)
print(f"Finished: {len(finished_qs)} questions")

# Use custom lambda for complex logic
complex = benchmark.filter_questions(
    finished=True,
    custom_filter=lambda q: q.get("custom_metadata", {}).get("difficulty") == "hard"
)

print(f"\nFinished + Hard: {len(complex)} questions")
for q in complex:
    print(f"  - {q['question']}")

## 4. Filtering by Custom Metadata

### Method 1: `filter_by_custom_metadata()` - Simple AND logic

In [None]:
# Filter by multiple criteria (AND logic)
prog_hard = benchmark.filter_by_custom_metadata(
    category="programming",
    difficulty="hard"
)

print(f"Programming AND Hard: {len(prog_hard)} questions")
for q in prog_hard:
    print(f"  - {q['question']}")

### Method 2: `filter_by_metadata()` - Generic with dot notation

In [None]:
# Exact match
programming = benchmark.filter_by_metadata("custom_metadata.category", "programming")
print(f"Programming questions: {len(programming)}")

# Contains match
bio_related = benchmark.filter_by_metadata(
    "custom_metadata.category",
    "bio",
    match_mode="contains"
)
print(f"Bio-related questions: {len(bio_related)}")

# List membership (in)
python_tagged = benchmark.filter_by_metadata(
    "custom_metadata.tags",
    "python",
    match_mode="in"
)
print(f"Python-tagged questions: {len(python_tagged)}")

## 5. Search Methods

The `search_questions()` method supports multi-term, regex, and multi-field searches.

## Key Takeaways

✅ **Harmonized API** - All access methods return question objects by default  
✅ **Generic APIs** - Work with any custom metadata structure  
✅ **Flexible Filtering** - Lambda functions for complex logic  
✅ **Powerful Search** - Multi-term, regex, multi-field support  
✅ **Statistics** - Count by any field with dot notation  
✅ **Composable** - Combine methods for complex workflows  

## Method Reference

### Access (Returns question objects by default)
- `get_all_questions(ids_only=False)` - All questions (objects by default, IDs if `ids_only=True`)
- `get_question(id)` - Single question by ID
- `get_question_ids()` - All question IDs (convenience wrapper)
- `get_finished_questions(ids_only=False)` - Finished questions
- `get_unfinished_questions(ids_only=False)` - Unfinished questions
- `get_missing_templates(ids_only=False)` - Questions without templates

### Filtering
- `filter_questions(finished, has_template, has_rubric, author, custom_filter)` - System metadata + lambda
- `filter_by_custom_metadata(**criteria)` - Simple AND logic
- `filter_by_metadata(field_path, value, match_mode)` - Generic with dot notation

### Search
- `search_questions(query, match_all, fields, case_sensitive, regex)` - Unified search API

### Statistics
- `count_by_field(field_path, questions)` - Count distribution by any field

See the [documentation](../docs/using-karenina/accessing-filtering.md) for more details!

In [None]:
# Multi-term AND search
ml_questions = benchmark.search_questions(
    ["machine", "learning"],
    match_all=True
)
print(f"'machine' AND 'learning': {len(ml_questions)} questions")

# Multi-term OR search
science = benchmark.search_questions(
    ["DNA", "quantum", "physics"],
    match_all=False
)
print(f"DNA OR quantum OR physics: {len(science)} questions")

In [None]:
# Regex search
what_questions = benchmark.search_questions(r"^What", regex=True)
print(f"Questions starting with 'What': {len(what_questions)}")
for q in what_questions:
    print(f"  - {q['question']}")

## 6. Statistics with `count_by_field()`

Get distribution statistics for any field using dot notation.

In [None]:
# Count by category
category_dist = benchmark.count_by_field("custom_metadata.category")
print("Category Distribution:")
for category, count in sorted(category_dist.items()):
    print(f"  {category}: {count} questions")

# Count by difficulty
difficulty_dist = benchmark.count_by_field("custom_metadata.difficulty")
print("\nDifficulty Distribution:")
for difficulty, count in sorted(difficulty_dist.items()):
    print(f"  {difficulty}: {count} questions")

In [None]:
# Count on a filtered subset
programming_qs = benchmark.filter_by_custom_metadata(category="programming")
prog_difficulty = benchmark.count_by_field(
    "custom_metadata.difficulty",
    questions=programming_qs
)

print("Programming Questions by Difficulty:")
for difficulty, count in sorted(prog_difficulty.items()):
    print(f"  {difficulty}: {count} questions")

## 7. Realistic Workflow: Progressive Filtering

Combine multiple methods for complex analysis pipelines.

In [None]:
# Step 1: Get all programming questions
prog_qs = benchmark.filter_by_custom_metadata(category="programming")
print(f"Step 1: Found {len(prog_qs)} programming questions")

# Step 2: Analyze difficulty distribution
difficulty_counts = benchmark.count_by_field(
    "custom_metadata.difficulty",
    questions=prog_qs
)
print(f"\nStep 2: Difficulty distribution")
for diff, count in difficulty_counts.items():
    print(f"  {diff}: {count}")

# Step 3: Filter to hard programming questions
hard_prog = benchmark.filter_by_custom_metadata(
    category="programming",
    difficulty="hard"
)
print(f"\nStep 3: {len(hard_prog)} hard programming questions")

# Step 4: Search for Python-specific hard questions
python_hard = [
    q for q in hard_prog
    if "python" in q["question"].lower()
]
print(f"\nStep 4: {len(python_hard)} hard Python questions")
for q in python_hard:
    print(f"  - {q['question']}")

## 8. Benchmark Summary

Generate a comprehensive overview using all the methods we've learned.

In [None]:
print("=" * 60)
print("BENCHMARK SUMMARY")
print("=" * 60)

print(f"\nTotal Questions: {len(benchmark)}")

# Status breakdown
finished = benchmark.get_finished_questions()
unfinished = benchmark.get_unfinished_questions()
print(f"Finished: {len(finished)}")
print(f"Unfinished: {len(unfinished)}")

# Category distribution
print("\nBy Category:")
for cat, count in sorted(benchmark.count_by_field("custom_metadata.category").items()):
    print(f"  {cat}: {count}")

# Difficulty distribution
print("\nBy Difficulty:")
for diff, count in sorted(benchmark.count_by_field("custom_metadata.difficulty").items()):
    print(f"  {diff}: {count}")

# Year distribution
print("\nBy Year:")
for year, count in sorted(benchmark.count_by_field("custom_metadata.year").items()):
    print(f"  {year}: {count}")

## Key Takeaways

✅ **Harmonized API** - All access methods return question objects by default  
✅ **Generic APIs** - Work with any custom metadata structure  
✅ **Flexible Filtering** - Lambda functions for complex logic  
✅ **Powerful Search** - Multi-term, regex, multi-field support  
✅ **Statistics** - Count by any field with dot notation  
✅ **Composable** - Combine methods for complex workflows  

## Method Reference

### Access (Returns question objects by default)
- `get_all_questions()` - All questions
- `get_question(id)` - Single question by ID
- `get_finished_questions(ids_only=False)` - Finished questions
- `get_unfinished_questions(ids_only=False)` - Unfinished questions
- `get_missing_templates(ids_only=False)` - Questions without templates

### Filtering
- `filter_questions(finished, has_template, has_rubric, author, custom_filter)` - System metadata + lambda
- `filter_by_custom_metadata(**criteria)` - Simple AND logic
- `filter_by_metadata(field_path, value, match_mode)` - Generic with dot notation

### Search
- `search_questions(query, match_all, fields, case_sensitive, regex)` - Unified search API

### Statistics
- `count_by_field(field_path, questions)` - Count distribution by any field

See the [documentation](../docs/using-karenina/accessing-filtering.md) for more details!