Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Value Error When Applying Multiple Data Validation Decorators to the Same Function #950

Closed
rohithrockzz opened this issue Jun 12, 2024 · 7 comments · Fixed by #960
Closed
Labels
triage label for issues that need to be triaged.

Comments

@rohithrockzz
Copy link

rohithrockzz commented Jun 12, 2024

Description:

I encountered an issue while trying to apply multiple data validation decorators to a single function in the Hamilton DAG framework. Specifically, I am trying to validate different columns of a DataFrame using multiple instances of the @check_output_custom decorator. However, I receive a ValueError indicating that the function cannot be defined more than once.

Steps to Reproduce:

  1. Define a function to process a DataFrame.
  2. Apply multiple @check_output_custom decorators to the function, each with different validation parameters.
  3. Attempt to run the decorated function.

Example code snippet:

1st issue code snippet

@check_output_custom(CompositePrimaryKeyValidatorPySparkDataFrame(columns=["OrderID", "ItemNumber"], importance="fail"))
@check_output_custom(CategoricalValuesValidatorPySparkDataFrame(column="CategoryID", allowed_values=[1, 2, 3], importance="fail"))
def process_order_data(order_data_config: dict, order_filter_template: List) -> DataFrame:
    # Function implementation
    pass

This raises the error:

ValueError: Cannot define function process_order_data_raw more than once. Already defined by function <function process_order_data

2nd issue code snippet

@check_output_custom(CategoricalValuesValidatorPySparkDataFrame(column="CategoryID", allowed_values=[1, 2, 3], importance="fail"))
@check_output_custom(CategoricalValuesValidatorPySparkDataFrame(column="ProductID", allowed_values=[10, 20, 30], importance="warn"))
def process_order_data(order_data_config: dict, order_filter_template: List) -> DataFrame:
    # Function implementation
    pass

This raises the error:

ValueError: Cannot define function process_order_data_CategoricalValuesValidator more than once. Already defined by function <function process_order_data

Expected Behavior

Applying multiple @check_output_custom decorators to a single function should allow for different validation checks on various columns of the DataFrame without raising a ValueError.

Actual Behavior

A ValueError is raised, indicating that the function cannot be defined more than once by the same validator.

Library & System Information

python version = 3.9.5
hamilton library version = 1.65.0

Additional Context:

This issue prevents the application of multiple validators to a single function, which is necessary for comprehensive data validation in our use case. It would be helpful if the framework could support multiple validators on the same function without raising errors.

Thank you for your attention to this issue.

@rohithrockzz rohithrockzz added the triage label for issues that need to be triaged. label Jun 12, 2024
@elijahbenizzy
Copy link
Collaborator

Description:

I encountered an issue while trying to apply multiple data validation decorators to a single function in the Hamilton DAG framework. Specifically, I am trying to validate different columns of a DataFrame using multiple instances of the @check_output_custom decorator. However, I receive a ValueError indicating that the function cannot be defined more than once.

Steps to Reproduce:

  1. Define a function to process a DataFrame.
  2. Apply multiple @check_output_custom decorators to the function, each with different validation parameters.
  3. Attempt to run the decorated function.

Example code snippet:

1st issue code snippet

@check_output_custom(CompositePrimaryKeyValidatorPySparkDataFrame(columns=["OrderID", "ItemNumber"], importance="fail")) @check_output_custom(CategoricalValuesValidatorPySparkDataFrame(column="CategoryID", allowed_values=[1, 2, 3], importance="fail")) def process_order_data(order_data_config: dict, order_filter_template: List) -> DataFrame: # Function implementation pass

This raises the error:

ValueError: Cannot define function process_order_data_raw more than once. Already defined by function <function process_order_data

2nd issue code snippet

@check_output_custom(CategoricalValuesValidatorPySparkDataFrame(column="CategoryID", allowed_values=[1, 2, 3], importance="fail")) @check_output_custom(CategoricalValuesValidatorPySparkDataFrame(column="ProductID", allowed_values=[10, 20, 30], importance="warn")) def process_order_data(order_data_config: dict, order_filter_template: List) -> DataFrame: # Function implementation pass

This raises the error:

ValueError: Cannot define function process_order_data_CategoricalValuesValidator more than once. Already defined by function <function process_order_data

Expected Behavior

Applying multiple @check_output_custom decorators to a single function should allow for different validation checks on various columns of the DataFrame without raising a ValueError.

Actual Behavior

A ValueError is raised, indicating that the function cannot be defined more than once by the same validator.

Library & System Information

python version = 3.9.5 hamilton library version = 1.65.0

Additional Context:

This issue prevents the application of multiple validators to a single function, which is necessary for comprehensive data validation in our use case. It would be helpful if the framework could support multiple validators on the same function without raising errors.

Thank you for your attention to this issue.

Thanks for opening! This is limitation I think. E.G. two that have the same name + another complexity. I think we can build a fix, but just to check, if you have them both in the same validator (E.G. as follows) does it work? My guess is not, but worth a try:

@check_output_custom(
    CompositePrimaryKeyValidatorPySparkDataFrame(columns=["OrderID", "ItemNumber"], importance="fail")),
    CategoricalValuesValidatorPySparkDataFrame(column="CategoryID", allowed_values=[1, 2, 3], importance="fail")
)

@skrawcz
Copy link
Collaborator

skrawcz commented Jun 13, 2024

Another thought would be to add another custom validator that takes in multiple validators... 🤔

Otherwise I think a potential avenue to scope would be to include some name_ kwarg to help name the node so it doesn't clash...

@rohithrockzz
Copy link
Author

rohithrockzz commented Jun 13, 2024

@elijahbenizzy
Yes, its working. Thanks for your help. If it is mention in the documentation, it would be helpful for the new comers.
But the 2nd issue still present which is 'If we pass same data validator twice getting error' like below
@check_output_custom(
CategoricalValuesValidatorPySparkDataFrame(column="ReportingId", allowed_values=[156], importance="fail")),
CategoricalValuesValidatorPySparkDataFrame(column="CategoryID", allowed_values=[1, 2, 3], importance="fail")
)

skrawcz added a commit that referenced this issue Jun 14, 2024
Fixes #950. This fixes the case where someone wants to do

```python
@check_output_custom(
CategoricalValuesValidatorPySparkDataFrame(column="ReportingId", allowed_values=[156], importance="fail")),
CategoricalValuesValidatorPySparkDataFrame(column="CategoryID", allowed_values=[1, 2, 3], importance="fail")
)
```
The validator name is the same, but it's used differently. So this
appends a numeric number to make it unique.
@skrawcz
Copy link
Collaborator

skrawcz commented Jun 14, 2024

@rohithrockzz could you try installing
pip install sf-hamilton==1.66.1rc0
and see if that fixes your issue please?

skrawcz added a commit that referenced this issue Jun 14, 2024
Fixes #950. This fixes the case where someone wants to do

```python
@check_output_custom(
CategoricalValuesValidatorPySparkDataFrame(column="ReportingId", allowed_values=[156], importance="fail")),
CategoricalValuesValidatorPySparkDataFrame(column="CategoryID", allowed_values=[1, 2, 3], importance="fail")
)
```
The validator name is the same, but it's used differently. So this
appends a numeric number to make it unique.
@rohithrockzz
Copy link
Author

@skrawcz
Yes, it worked. Thank you so much for the quick fix

@skrawcz
Copy link
Collaborator

skrawcz commented Jun 14, 2024

@rohithrockzz great thanks for verifying. I will publish a non-RC version in the morning.

@skrawcz
Copy link
Collaborator

skrawcz commented Jun 15, 2024

@rohithrockzz this has been released under sf-hamilton==1.66.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage label for issues that need to be triaged.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants