Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Suggested hasDataType constraint works with non-complete data #87

Conversation

malcolmgreaves
Copy link
Contributor

@malcolmgreaves malcolmgreaves commented Feb 2, 2019

Issue #, if available:
#86

Description of changes:
This PR updates the generated code for the suggested .hasDataType(...) constraint to work with non-complete columns.

If a ConstraintSuggestionRunner is used to suggest a constraint on a String-valued column that consists of numeric values, the resulting suggestion's source code must take into consideration that the column it was suggested on may have missing values. Previously, this generated constraint code assumed that the respective column is always complete (it's assertion was always _ >= 1.0).

The change in this PR is to use the calculated completeness in the ColumnProfile that is passed-in to RetainTypeRule's candidate method. Specifically, the .hasDataType's assertion parameter is changed to _ >= ${profile.completeness}.

The ConstraintSuggestionRunnerTest class has been updated to include a test for this exact scenario: using an automatically suggested .hasDataType constraint. This test is called "suggest retain type rule with completeness information". In order to support this test, the scala-compiler dependency was added to the project's test dependencies. This is because this new test compiles the generated code into a DataFrame => VerificationResult function, which is used at runtime in the test.

Due to this fix, the expected generated code in ConstraintSuggestionResultTest and ConstraintRulesTest had to be updated to include this explicit completeness assertion. Additionally, a mocked object in ConstraintRulesTest had to be updated to expect exactly one completeness() call.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@stefan-grafberger
Copy link
Contributor

stefan-grafberger commented Feb 2, 2019

Thanks a lot for this great PR!

One thing I'm wondering about: Instead of changing the value for the assertion parameter in the suggested hasDataType-Constraint, do we want to include a nullable option in the hasDataType function itself? (If it weren't for backward-compatibility, we could even consider making nullable = true the default, as we have other constraints addressing completeness.)

We could then suggest hasDataType(..., nullabe = true) instead, which may be more intuitive and make this constraint more useful. I imagine the nullable option is something that users may want to use outside of the constraint suggestion, too.

@sscdotopen
Copy link
Contributor

Hi Malcolm and Stephan,

Thank you for pointing out this issue. I think testing based on the completeness of the column misses to catch cases where the completeness changed (e.g., more nulls have been introduced) but no values of a wrong data type have entered the column.

In general, we tried to make our metrics ignore nulls, so that they only operate on the non-null cells. It seems we have failed to do this with the hasDataType metric.

I think we should adjust the constraint internally so that the threshold for the assertion refers to the non-null values in the column, e.g.,

´hasDataType(Boolean, 1.0)´ means that 100% of the non-null values are of type Boolean. If someone wants to check that there are no null in the column, they could do this via additional completeness checks.

Let me know what you think of this approach.

@stefan-grafberger
Copy link
Contributor

stefan-grafberger commented Feb 3, 2019

Hi Sebastian,

I agree, I think this is the cleanest solution and is the approach I would prefer, too, if changing the behaviour of the hasDataType-constraint is something we can do.

@malcolmgreaves
Copy link
Contributor Author

Hey Stefan and Sebastian!

I think the nullable concept makes the most sense. Especially for this PR, the idea is that it's really an Option[Int], not a pure Int. A notion of nullable captures this idea. And for my specific use case (where I'm hitting all of these corner cases!), I have very dirty, poorly formated data: it's almost always the case that I will have columns with missing values. I imagine it's more likely to express the idea that a column's data type is an Option of a single type rather than a mixture of data types (and thus having a notion of completeness).

As a non library author, I very much appreciate this deeper understanding of deequ's design. However, I am not sure I know enough about the codebase to implement this change.

Would this change require an update to all hasDataType callsites to include this new nullable parameter? Or, would it be best to add another method definition with this parameter and then update the existing defs to call this new "full" hasDataType def?

What do y'all think is the best design going forward?

@sscdotopen
Copy link
Contributor

Hi Malcom,

In my view, the "nullable" option makes the semantics too complicated. I think we should just change the implementation of hasDataType to ignore the null columns when considering the threshold.

The corresponding code is here:

https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/constraints/Constraint.scala#L548

The Distribution object has the absolute counts and ratios of all data types and nulls encountered. I think we would just have to rescale the value for the particular data type that we want to check. E.g., if we had 4 integer values, 2 string values and two nulls, than the ratio for integer would be 0.5. We would need to rescale it by the non-null values, e.g. 4 / (4 + 2) in this case and it would become 0.66 for the constraint.

Does that make sense to you? Let me know if you need more information. This is a pretty complicated part of the code that we already rewrote a couple of times...

@malcolmgreaves
Copy link
Contributor Author

malcolmgreaves commented Feb 4, 2019

👋

I am not sure I follow 💯 : are you saying that the ratio method should be changed so that it doesn't count nulls? Would this new definition be correct / what you're looking for?

private[this] def ratio(keyType: DataTypeInstances.Value,
                        distribution: Distribution): Double = {
    val keyTypeAbs = distribution.values
      .get(keyType.toString)
      .map { _.absolute }
      .getOrElse(0L)
    if (keyTypeAbs == 0L)
      0.0
    else {
      val nullAbs = distribution.values
        .get(DataTypeInstances.Unknown.toString())
        .map { _.absolute }
        .getOrElse(0L)
      val allAbs = distribution.values.values.map { _.absolute }.sum
      val denom = allAbs - nullAbs
      if (denom <= 0L)
        0.0
      else
        keyTypeAbs.toDouble / denom.toDouble
    }

Then, would we not need to have the .hasDataType generating-code explicitly put in an assertion? (I.e. let it use the default _ >= 1.0). Since we're not counting nulls in the ratio, we'd expect it to be 1.0 iff all non-null values are of the specified data type (correct?).

@malcolmgreaves
Copy link
Contributor Author

malcolmgreaves commented Feb 4, 2019

If the above code snippet is indeed what's intended, then were are good places to test this change? Additionally, I think I'd want to make this definition protectecd on the package level so that it could be tested (IIRC a private[this] means it may only be accessed by members w/in the defining class or object. And since this is an object, this means I couldn't call it from a test.)

Though, perhaps the test I have would be enough? It does test the thing we want -- generated hasDataType constraint works on the data it was generated from.

Malcolm Greaves added 2 commits February 5, 2019 10:56
- Name change from `ratio` to `ratioTypes` & signature change to curried form
- If `ignoreUnk = false`, then behavior is unchanged.
- Otherwise, calculates the ratio of values of type `keyType` to rest of
the distribution's non-null (specifically non `Unknown` typed) values.
- `dataTypeConstraint` calculates this non-null keyType ratio for
everything except the unknown type (when it's asked to do so!)
@malcolmgreaves
Copy link
Contributor Author

@stefan-grafberger and @sscdotopen -- I've updated the definition of ratio in Constraints to calculate the data type ratio sans Unknown values. Specifically, my change makes it a callable option to perform the calculation w/ or w/o considering Unknown values. I rolled back my changes in the retain type rule class and the subsequent changes that required to other test code.

The test I made -- a roundtrip constraint suggest, code compile, verify -- still works! Overall, this new set of changes is much more focused (all in the constraints package).

@malcolmgreaves malcolmgreaves force-pushed the mg_bugfix_suggested_constraint_with_completeness branch from d493819 to d7e1b56 Compare February 5, 2019 19:25
Copy link
Contributor

@sscdotopen sscdotopen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Malcom,

The change looks great, I mostly have style comments. Thank you for your contribution!

@malcolmgreaves malcolmgreaves force-pushed the mg_bugfix_suggested_constraint_with_completeness branch from 6f6ebc8 to a5cd830 Compare February 5, 2019 23:24
@malcolmgreaves
Copy link
Contributor Author

@sscdotopen Addressed review comments -- anything I didn't quite get or change appropriately?

Copy link
Contributor

@sscdotopen sscdotopen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for incorporating the feedback so quickly, there is just one typo left in a comment, then this PR should be ready to merge!

@malcolmgreaves
Copy link
Contributor Author

@sscdotopen pushed typo fix

@sscdotopen sscdotopen merged commit 3bdb294 into awslabs:master Feb 7, 2019
@sscdotopen
Copy link
Contributor

Malcom, thank you again for your contribution!

@malcolmgreaves
Copy link
Contributor Author

You're most welcome! I appreciate the collaboration!! ❤️

@malcolmgreaves malcolmgreaves deleted the mg_bugfix_suggested_constraint_with_completeness branch February 7, 2019 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants