Raising errors for pyspark dataframe validation #73

michal-mmm · 2024-06-07T06:35:24Z

Description

By default, pandera does not raise errors for pyspark DataFrame. Instead, it records validation errors within the df.pandera.errors attribute.

e.g.

df = metadata["pandera"]["schema"].validate(df)
df.pandera.errors
defaultdict(<function ErrorHandler.__init__.<locals>.<lambda> at 0x30ae9c550>, {'SCHEMA': defaultdict(<class 'list'>, {'WRONG_DATATYPE': [{'schema': 'IrisPySparkSchema', 'column': 'sepal_length', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'sepal_width', 'check': "dtype('StringType()')", 'error': "expected column 'sepal_width' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_length', 'check': "dtype('StringType()')", 'error': "expected column 'petal_length' to have type StringType(), got DoubleType()"}, {'schema': 'IrisPySparkSchema', 'column': 'petal_width', 'check': "dtype('StringType()')", 'error': "expected column 'petal_width' to have type StringType(), got DoubleType()"}]})})

As per pandera documentation:

This design decision is based on the expectation that most use cases for pyspark SQL dataframes means entails a production ETL setting. In these settings, pandera prioritizes completing the production load and saving the data quality issues for downstream rectification.

Context

Currently, validating pyspark DataFrames directly is not possible, except by manually inspecting the pandera.error attribute.

Possible Implementation

To enforce immediate error raising during validation, one can set lazy=False when calling the validation method: metadata["pandera"]["schema"].validate(data, lazy=False)
This setting might be more suitable for machine learning tasks. Alternatively, validation can be toggled off using the environment variable export PANDERA_VALIDATION_ENABLED=false, as mentioned in the docs and #27

The text was updated successfully, but these errors were encountered:

felipemonroy · 2024-07-07T01:12:13Z

Hi @michal-mmm, I like the idea of adding lazy=False when calling the validation method. We should also evaluate including tests with a PySpark dataset (and even others like Polars) in order to check that errors are raised.

In the future, we should evaluate how to handle validations with lazy=True, for instance, with an after-pipeline-run hook.

felipemonroy · 2024-07-09T02:19:24Z

Hi @michal-mmm, could you make the PR with that change? And see what @Galileo-Galilei thinks about it. I am happy to help if you can't

Galileo-Galilei · 2024-07-09T21:38:49Z

Hi, sorry for not responding earlier. I think we should go forward. I suggest that we implement in general some kwargs to be passed to the validate function :

my_dataset: 
    type: ...
    filepath: ...
    metadata: 
        pandera: 
            schema: ...            
            validate_kwargs: 
                lazy: true

and then in the hook:

metadata["pandera"]["schema"].validate(data, **metadata["pandera"]["validate_kwargs"])

Feel free to open to a PR, and eventually suggest a different design.

…i#73)

Galileo-Galilei · 2024-07-18T20:22:41Z

Closed by #78

michal-mmm added a commit to michal-mmm/kedro-pandera that referenced this issue Jul 15, 2024

✨ Add keyword arguments to schema's validate() method (Galileo-Galile…

8fd083f

…i#73)

michal-mmm mentioned this issue Jul 15, 2024

✨ Add keyword arguments to schema's validate() method (#73) #78

Merged

6 tasks

michal-mmm added a commit to michal-mmm/kedro-pandera that referenced this issue Jul 15, 2024

✨ Add keyword arguments to schema's validate() method (Galileo-Galile…

bf00a21

…i#73)

Galileo-Galilei pushed a commit that referenced this issue Jul 18, 2024

✨ Add keyword arguments to schema's validate() method (#73)

24590ec

Galileo-Galilei closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raising errors for pyspark dataframe validation #73

Raising errors for pyspark dataframe validation #73

michal-mmm commented Jun 7, 2024

felipemonroy commented Jul 7, 2024

felipemonroy commented Jul 9, 2024

Galileo-Galilei commented Jul 9, 2024

Galileo-Galilei commented Jul 18, 2024

Raising errors for pyspark dataframe validation #73

Raising errors for pyspark dataframe validation #73

Comments

michal-mmm commented Jun 7, 2024

Description

Context

Possible Implementation

felipemonroy commented Jul 7, 2024

felipemonroy commented Jul 9, 2024

Galileo-Galilei commented Jul 9, 2024

Galileo-Galilei commented Jul 18, 2024