Skip to content

deem-data/PrismaDV

Repository files navigation

PrismaDV

Task-Aware Data Validation using Language Models

PrismaDV is a framework that leverages Language Models to generate data validation unit tests based on downstream tasks, ensuring that data quality checks are tailored to prevent failures in specific tasks.

Architecture

PrismaDV analyzes downstream task code and sample data to automatically generate validation constraints.

PrismaDV Architecture

SIFTA (Self-Improving Few-shot Task Adaptation) is our prompt optimization approach that adapts PrismaDV to specific datasets, achieving higher accuracy on new data batches through automated few-shot example selection and refinement.

SIFTA Architecture


Benchmarks

Performance is evaluated using two comprehensive benchmarks for task-aware data validation:

  • ICDBench - Individual Constraint Discovery (63 test cases)
  • EIDBench - End-to-End Error Impact Detection (5 datasets, 60 tasks)

See the complete benchmarks overview for detailed comparison and documentation.

Paper Results & Experiments

Navigate to the experimental results corresponding to each section of the paper:

Section Description Results
8.1 Constraint Discovery from Data-Code Pairs ICDBench Experiments
8.2 End-to-End Error Impact Detection EIDBench Experiments
8.3 Optimizing PrismaDV with SIFTA SIFTA Workflow & Results
8.4 Ablation Studies Ablation Experiments

Repository Structure

Benchmarks

  • benchmarks/ - Benchmark suite overview and router
    • ICDBench/ - Constraint discovery benchmark (63 cases)
    • EIDBench/ - Error impact detection benchmark (5 datasets, 60 tasks)

Core Implementations

  • prismadv/ - PrismaDV framework implementation
  • sifta/ - SIFTA prompt optimization implementation

Workflows & Experiments


Get Started

  1. Install Poetry (if not already installed):

    curl -sSL https://install.python-poetry.org | python3 -
  2. Install dependencies:

    poetry install --with test,gx
  3. Create a .env file with required API keys:

    OPENAI_API_KEY=your_openai_api_key
    HF_TOKEN=your_huggingface_token
    SPARK_VERSION=3.5
  4. Run tests:

    poetry run pytest

Test Coverage

  • 124 unit tests covering the main codebase (all passing on Ubuntu)

CI/CD Pipeline

The GitHub Actions workflow automatically:

  • Sets up Python 3.11 environment
  • Installs Poetry and dependencies
  • Runs the complete test suite

Intermediate Results

All task-aware data validation unit tests are saved in the data_processed/ directory. This directory contains unit tests for all tasks generated using different approaches.

Example: View the data unit tests that PrismaDV (GPT-5) generated for general_task_5.py from the IPL_win_prediction dataset in EIDBench here.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors