Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raising errors for pyspark dataframe validation #73

Closed
michal-mmm opened this issue Jun 7, 2024 · 4 comments
Closed

Raising errors for pyspark dataframe validation #73

michal-mmm opened this issue Jun 7, 2024 · 4 comments

Comments

@michal-mmm
Copy link
Contributor

Description

By default, pandera does not raise errors for pyspark DataFrame. Instead, it records validation errors within the df.pandera.errors attribute.

e.g.

df = metadata["pandera"]["schema"].validate(df)
df.pandera.errors
defaultdict(<function ErrorHandler.__init__.<locals>.<lambda> at 0x30ae9c550>, {'SCHEMA': defaultdict(<class 'list'>, {'WRONG_DATATYPE': [{'schema': 'IrisPySparkSchema', 'column': 'sepal_length', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'sepal_width', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_width' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_length', 'check': "dtype('StringType()')", 'error': "expected column 'petal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_width', 'check': "dtype('StringType()')", 'error': "expected column 'petal_width' to have type StringType(), got DoubleType()"}]})})

As per pandera documentation:

This design decision is based on the expectation that most use cases for pyspark SQL dataframes means entails a production ETL setting. In these settings, pandera prioritizes completing the production load and saving the data quality issues for downstream rectification.

Context

Currently, validating pyspark DataFrames directly is not possible, except by manually inspecting the pandera.error attribute.

Possible Implementation

To enforce immediate error raising during validation, one can set lazy=False when calling the validation method: metadata["pandera"]["schema"].validate(data, lazy=False)
This setting might be more suitable for machine learning tasks. Alternatively, validation can be toggled off using the environment variable export PANDERA_VALIDATION_ENABLED=false, as mentioned in the docs and #27

@felipemonroy
Copy link

Hi @michal-mmm, I like the idea of adding lazy=False when calling the validation method. We should also evaluate including tests with a PySpark dataset (and even others like Polars) in order to check that errors are raised.

In the future, we should evaluate how to handle validations with lazy=True, for instance, with an after-pipeline-run hook.

@felipemonroy
Copy link

Hi @michal-mmm, could you make the PR with that change? And see what @Galileo-Galilei thinks about it. I am happy to help if you can't

@Galileo-Galilei
Copy link
Owner

Hi, sorry for not responding earlier. I think we should go forward. I suggest that we implement in general some kwargs to be passed to the validate function :

my_dataset: 
    type: ...
    filepath: ...
    metadata: 
        pandera: 
            schema: ...            
            validate_kwargs: 
                lazy: true

and then in the hook:

metadata["pandera"]["schema"].validate(data, **metadata["pandera"]["validate_kwargs"])

Feel free to open to a PR, and eventually suggest a different design.

@Galileo-Galilei
Copy link
Owner

Closed by #78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants