Skip to content

A flexible and expressive pandas data validation library.

License

Notifications You must be signed in to change notification settings

c3-Anthony-Truchet/pandera

 
 

Repository files navigation



A data validation library for scientists, engineers, and analysts seeking correctness.


Build Status Documentation Status PyPI version shields.io PyPI license pyOpenSci Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Status codecov PyPI pyversions DOI asv

pandas data structures contain information that pandera explicitly validates at runtime. This is useful in production-critical or reproducible research settings. With pandera, you can:

  1. Check the types and properties of columns in a DataFrame or values in a Series.
  2. Perform more complex statistical validation like hypothesis testing.
  3. Seamlessly integrate with existing data analysis/processing pipelines via function decorators.

pandera provides a flexible and expressive API for performing data validation on tidy (long-form) and wide data to make data processing pipelines more readable and robust.

Documentation

The official documentation is hosted on ReadTheDocs: https://pandera.readthedocs.io

Install

Using pip:

pip install pandera

Using conda:

conda install -c conda-forge pandera

Quick Start

import pandas as pd
import pandera as pa


# data to validate
df = pd.DataFrame({
    "column1": [1, 4, 0, 10, 9],
    "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})

# define schema
schema = pa.DataFrameSchema({
    "column1": pa.Column(pa.Int, checks=pa.Check.less_than_or_equal_to(10)),
    "column2": pa.Column(pa.Float, checks=pa.Check.less_than(-1.2)),
    "column3": pa.Column(pa.String, checks=[
        pa.Check.str_startswith("value_"),
        # define custom checks as functions that take a series as input and
        # outputs a boolean or boolean Series
        pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
    ]),
})

validated_df = schema.validate(df)
print(validated_df)

#     column1  column2  column3
#  0        1     -1.3  value_1
#  1        4     -1.4  value_2
#  2        0     -2.9  value_3
#  3       10    -10.1  value_2
#  4        9    -20.4  value_1

Development Installation

git clone https://github.com/pandera-dev/pandera.git
cd pandera
pip install -r requirements-dev.txt
pip install -e .

Tests

pip install pytest
pytest tests

Contributing to pandera GitHub contributors

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide on GitHub.

Issues

Go here to submit feature requests or bugfixes.

Other Data Validation Libraries

Here are a few other alternatives for validating Python data structures.

Generic Python object data validation

pandas-specific data validation

Other tools that include data validation

Why pandera?

  • pandas-centric data types, column nullability, and uniqueness are first-class concepts.
  • check_input and check_output decorators enable seamless integration with existing code.
  • Checks provide flexibility and performance by providing access to pandas API by design.
  • Hypothesis class provides a tidy-first interface for statistical hypothesis testing.
  • Checks and Hypothesis objects support both tidy and wide data validation.
  • Comprehensive documentation on key functionality.

Citation Information

@misc{niels_bantilan_2019_3385266,
  author       = {Niels Bantilan and
                  Nigel Markey and
                  Riccardo Albertazzi and
                  chr1st1ank},
  title        = {pandera-dev/pandera: 0.2.0 pre-release 1},
  month        = sep,
  year         = 2019,
  doi          = {10.5281/zenodo.3385266},
  url          = {https://doi.org/10.5281/zenodo.3385266}
}

About

A flexible and expressive pandas data validation library.

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.7%
  • Makefile 0.3%