PrismaDV

Task-Aware Data Validation using Language Models

PrismaDV is a framework that leverages Language Models to generate data validation unit tests based on downstream tasks, ensuring that data quality checks are tailored to prevent failures in specific tasks.

Architecture

PrismaDV analyzes downstream task code and sample data to automatically generate validation constraints.

SIFTA (Self-Improving Few-shot Task Adaptation) is our prompt optimization approach that adapts PrismaDV to specific datasets, achieving higher accuracy on new data batches through automated few-shot example selection and refinement.

Benchmarks

Performance is evaluated using two comprehensive benchmarks for task-aware data validation:

ICDBench - Individual Constraint Discovery (63 test cases)
EIDBench - End-to-End Error Impact Detection (5 datasets, 60 tasks)

See the complete benchmarks overview for detailed comparison and documentation.

Paper Results & Experiments

Navigate to the experimental results corresponding to each section of the paper:

Section	Description	Results
8.1	Constraint Discovery from Data-Code Pairs	ICDBench Experiments
8.2	End-to-End Error Impact Detection	EIDBench Experiments
8.3	Optimizing PrismaDV with SIFTA	SIFTA Workflow & Results
8.4	Ablation Studies	Ablation Experiments

Repository Structure

Benchmarks

benchmarks/ - Benchmark suite overview and router
- ICDBench/ - Constraint discovery benchmark (63 cases)
- EIDBench/ - Error impact detection benchmark (5 datasets, 60 tasks)

Core Implementations

prismadv/ - PrismaDV framework implementation
sifta/ - SIFTA prompt optimization implementation

Workflows & Experiments

workflow_prismadv/ - PrismaDV inference and experiments
workflow_sifta/ - SIFTA optimization workflows

Get Started

Install Poetry (if not already installed):

curl -sSL https://install.python-poetry.org | python3 -

Install dependencies:
```
poetry install --with test,gx
```

Create a .env file with required API keys:

OPENAI_API_KEY=your_openai_api_key
HF_TOKEN=your_huggingface_token
SPARK_VERSION=3.5

Run tests:
```
poetry run pytest
```

Test Coverage

124 unit tests covering the main codebase (all passing on Ubuntu)

CI/CD Pipeline

The GitHub Actions workflow automatically:

Sets up Python 3.11 environment
Installs Poetry and dependencies
Runs the complete test suite

Intermediate Results

All task-aware data validation unit tests are saved in the data_processed/ directory. This directory contains unit tests for all tasks generated using different approaches.

Example: View the data unit tests that PrismaDV (GPT-5) generated for general_task_5.py from the IPL_win_prediction dataset in EIDBench here.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
benchmarks		benchmarks
data_processed		data_processed
eid_bench_gen		eid_bench_gen
optimization_runs		optimization_runs
prismadv		prismadv
sifta		sifta
tests		tests
workflow_prismadv		workflow_prismadv
workflow_sifta		workflow_sifta
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
toy-example.ipynb		toy-example.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrismaDV

Architecture

Benchmarks

Paper Results & Experiments

Repository Structure

Benchmarks

Core Implementations

Workflows & Experiments

Get Started

Test Coverage

CI/CD Pipeline

Intermediate Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PrismaDV

Architecture

Benchmarks

Paper Results & Experiments

Repository Structure

Benchmarks

Core Implementations

Workflows & Experiments

Get Started

Test Coverage

CI/CD Pipeline

Intermediate Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages