-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: Suggested hasDataType constraint works with non-complete data #87
Bugfix: Suggested hasDataType constraint works with non-complete data #87
Conversation
Thanks a lot for this great PR! One thing I'm wondering about: Instead of changing the value for the assertion parameter in the suggested We could then suggest |
Hi Malcolm and Stephan, Thank you for pointing out this issue. I think testing based on the completeness of the column misses to catch cases where the completeness changed (e.g., more nulls have been introduced) but no values of a wrong data type have entered the column. In general, we tried to make our metrics ignore nulls, so that they only operate on the non-null cells. It seems we have failed to do this with the hasDataType metric. I think we should adjust the constraint internally so that the threshold for the assertion refers to the non-null values in the column, e.g., ´hasDataType(Boolean, 1.0)´ means that 100% of the non-null values are of type Boolean. If someone wants to check that there are no null in the column, they could do this via additional completeness checks. Let me know what you think of this approach. |
Hi Sebastian, I agree, I think this is the cleanest solution and is the approach I would prefer, too, if changing the behaviour of the hasDataType-constraint is something we can do. |
Hey Stefan and Sebastian! I think the As a non library author, I very much appreciate this deeper understanding of Would this change require an update to all What do y'all think is the best design going forward? |
Hi Malcom, In my view, the "nullable" option makes the semantics too complicated. I think we should just change the implementation of The corresponding code is here: The Does that make sense to you? Let me know if you need more information. This is a pretty complicated part of the code that we already rewrote a couple of times... |
👋 I am not sure I follow 💯 : are you saying that the
Then, would we not need to have the |
If the above code snippet is indeed what's intended, then were are good places to test this change? Additionally, I think I'd want to make this definition Though, perhaps the test I have would be enough? It does test the thing we want -- generated |
- Name change from `ratio` to `ratioTypes` & signature change to curried form - If `ignoreUnk = false`, then behavior is unchanged. - Otherwise, calculates the ratio of values of type `keyType` to rest of the distribution's non-null (specifically non `Unknown` typed) values. - `dataTypeConstraint` calculates this non-null keyType ratio for everything except the unknown type (when it's asked to do so!)
@stefan-grafberger and @sscdotopen -- I've updated the definition of The test I made -- a roundtrip constraint suggest, code compile, verify -- still works! Overall, this new set of changes is much more focused (all in the |
d493819
to
d7e1b56
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Malcom,
The change looks great, I mostly have style comments. Thank you for your contribution!
src/test/scala/com/amazon/deequ/suggestions/ConstraintSuggestionRunnerTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/amazon/deequ/suggestions/ConstraintSuggestionRunnerTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/com/amazon/deequ/suggestions/ConstraintSuggestionRunnerTest.scala
Show resolved
Hide resolved
6f6ebc8
to
a5cd830
Compare
@sscdotopen Addressed review comments -- anything I didn't quite get or change appropriately? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for incorporating the feedback so quickly, there is just one typo left in a comment, then this PR should be ready to merge!
src/test/scala/com/amazon/deequ/suggestions/ConstraintSuggestionRunnerTest.scala
Show resolved
Hide resolved
@sscdotopen pushed typo fix |
Malcom, thank you again for your contribution! |
You're most welcome! I appreciate the collaboration!! ❤️ |
Issue #, if available:
#86
Description of changes:
This PR updates the generated code for the suggested
.hasDataType(...)
constraint to work with non-complete columns.If a
ConstraintSuggestionRunner
is used to suggest a constraint on aString
-valued column that consists of numeric values, the resulting suggestion's source code must take into consideration that the column it was suggested on may have missing values. Previously, this generated constraint code assumed that the respective column is always complete (it'sassertion
was always_ >= 1.0
).The change in this PR is to use the calculated completeness in the
ColumnProfile
that is passed-in toRetainTypeRule
'scandidate
method. Specifically, the.hasDataType
'sassertion
parameter is changed to_ >= ${profile.completeness}
.The
ConstraintSuggestionRunnerTest
class
has been updated to include a test for this exact scenario: using an automatically suggested.hasDataType
constraint. This test is called"suggest retain type rule with completeness information"
. In order to support this test, thescala-compiler
dependency was added to the project's test dependencies. This is because this new test compiles the generated code into aDataFrame => VerificationResult
function, which is used at runtime in the test.Due to this fix, the expected generated code in
ConstraintSuggestionResultTest
andConstraintRulesTest
had to be updated to include this explicit completeness assertion. Additionally, a mocked object inConstraintRulesTest
had to be updated to expect exactly onecompleteness()
call.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.