Bugfix: Suggested hasDataType constraint works with non-complete data #87

malcolmgreaves · 2019-02-02T04:47:55Z

Issue #, if available:
#86

Description of changes:
This PR updates the generated code for the suggested .hasDataType(...) constraint to work with non-complete columns.

If a ConstraintSuggestionRunner is used to suggest a constraint on a String-valued column that consists of numeric values, the resulting suggestion's source code must take into consideration that the column it was suggested on may have missing values. Previously, this generated constraint code assumed that the respective column is always complete (it's assertion was always _ >= 1.0).

The change in this PR is to use the calculated completeness in the ColumnProfile that is passed-in to RetainTypeRule's candidate method. Specifically, the .hasDataType's assertion parameter is changed to _ >= ${profile.completeness}.

The ConstraintSuggestionRunnerTest class has been updated to include a test for this exact scenario: using an automatically suggested .hasDataType constraint. This test is called "suggest retain type rule with completeness information". In order to support this test, the scala-compiler dependency was added to the project's test dependencies. This is because this new test compiles the generated code into a DataFrame => VerificationResult function, which is used at runtime in the test.

Due to this fix, the expected generated code in ConstraintSuggestionResultTest and ConstraintRulesTest had to be updated to include this explicit completeness assertion. Additionally, a mocked object in ConstraintRulesTest had to be updated to expect exactly one completeness() call.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

stefan-grafberger · 2019-02-02T08:16:58Z

Thanks a lot for this great PR!

One thing I'm wondering about: Instead of changing the value for the assertion parameter in the suggested hasDataType-Constraint, do we want to include a nullable option in the hasDataType function itself? (If it weren't for backward-compatibility, we could even consider making nullable = true the default, as we have other constraints addressing completeness.)

We could then suggest hasDataType(..., nullabe = true) instead, which may be more intuitive and make this constraint more useful. I imagine the nullable option is something that users may want to use outside of the constraint suggestion, too.

sscdotopen · 2019-02-03T14:51:19Z

Hi Malcolm and Stephan,

Thank you for pointing out this issue. I think testing based on the completeness of the column misses to catch cases where the completeness changed (e.g., more nulls have been introduced) but no values of a wrong data type have entered the column.

In general, we tried to make our metrics ignore nulls, so that they only operate on the non-null cells. It seems we have failed to do this with the hasDataType metric.

I think we should adjust the constraint internally so that the threshold for the assertion refers to the non-null values in the column, e.g.,

´hasDataType(Boolean, 1.0)´ means that 100% of the non-null values are of type Boolean. If someone wants to check that there are no null in the column, they could do this via additional completeness checks.

Let me know what you think of this approach.

stefan-grafberger · 2019-02-03T15:11:46Z

Hi Sebastian,

I agree, I think this is the cleanest solution and is the approach I would prefer, too, if changing the behaviour of the hasDataType-constraint is something we can do.

malcolmgreaves · 2019-02-03T21:19:48Z

Hey Stefan and Sebastian!

I think the nullable concept makes the most sense. Especially for this PR, the idea is that it's really an Option[Int], not a pure Int. A notion of nullable captures this idea. And for my specific use case (where I'm hitting all of these corner cases!), I have very dirty, poorly formated data: it's almost always the case that I will have columns with missing values. I imagine it's more likely to express the idea that a column's data type is an Option of a single type rather than a mixture of data types (and thus having a notion of completeness).

As a non library author, I very much appreciate this deeper understanding of deequ's design. However, I am not sure I know enough about the codebase to implement this change.

Would this change require an update to all hasDataType callsites to include this new nullable parameter? Or, would it be best to add another method definition with this parameter and then update the existing defs to call this new "full" hasDataType def?

What do y'all think is the best design going forward?

sscdotopen · 2019-02-04T15:45:40Z

Hi Malcom,

In my view, the "nullable" option makes the semantics too complicated. I think we should just change the implementation of hasDataType to ignore the null columns when considering the threshold.

The corresponding code is here:

https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/constraints/Constraint.scala#L548

The Distribution object has the absolute counts and ratios of all data types and nulls encountered. I think we would just have to rescale the value for the particular data type that we want to check. E.g., if we had 4 integer values, 2 string values and two nulls, than the ratio for integer would be 0.5. We would need to rescale it by the non-null values, e.g. 4 / (4 + 2) in this case and it would become 0.66 for the constraint.

Does that make sense to you? Let me know if you need more information. This is a pretty complicated part of the code that we already rewrote a couple of times...

malcolmgreaves · 2019-02-04T20:55:06Z

👋

I am not sure I follow 💯 : are you saying that the ratio method should be changed so that it doesn't count nulls? Would this new definition be correct / what you're looking for?

private[this] def ratio(keyType: DataTypeInstances.Value,
                        distribution: Distribution): Double = {
    val keyTypeAbs = distribution.values
      .get(keyType.toString)
      .map { _.absolute }
      .getOrElse(0L)
    if (keyTypeAbs == 0L)
      0.0
    else {
      val nullAbs = distribution.values
        .get(DataTypeInstances.Unknown.toString())
        .map { _.absolute }
        .getOrElse(0L)
      val allAbs = distribution.values.values.map { _.absolute }.sum
      val denom = allAbs - nullAbs
      if (denom <= 0L)
        0.0
      else
        keyTypeAbs.toDouble / denom.toDouble
    }

Then, would we not need to have the .hasDataType generating-code explicitly put in an assertion? (I.e. let it use the default _ >= 1.0). Since we're not counting nulls in the ratio, we'd expect it to be 1.0 iff all non-null values are of the specified data type (correct?).

malcolmgreaves · 2019-02-04T21:00:20Z

If the above code snippet is indeed what's intended, then were are good places to test this change? Additionally, I think I'd want to make this definition protectecd on the package level so that it could be tested (IIRC a private[this] means it may only be accessed by members w/in the defining class or object. And since this is an object, this means I couldn't call it from a test.)

Though, perhaps the test I have would be enough? It does test the thing we want -- generated hasDataType constraint works on the data it was generated from.

- Name change from `ratio` to `ratioTypes` & signature change to curried form - If `ignoreUnk = false`, then behavior is unchanged. - Otherwise, calculates the ratio of values of type `keyType` to rest of the distribution's non-null (specifically non `Unknown` typed) values. - `dataTypeConstraint` calculates this non-null keyType ratio for everything except the unknown type (when it's asked to do so!)

malcolmgreaves · 2019-02-05T19:05:01Z

@stefan-grafberger and @sscdotopen -- I've updated the definition of ratio in Constraints to calculate the data type ratio sans Unknown values. Specifically, my change makes it a callable option to perform the calculation w/ or w/o considering Unknown values. I rolled back my changes in the retain type rule class and the subsequent changes that required to other test code.

The test I made -- a roundtrip constraint suggest, code compile, verify -- still works! Overall, this new set of changes is much more focused (all in the constraints package).

sscdotopen

Hi Malcom,

The change looks great, I mostly have style comments. Thank you for your contribution!

src/main/scala/com/amazon/deequ/constraints/Constraint.scala

src/test/scala/com/amazon/deequ/suggestions/ConstraintSuggestionRunnerTest.scala

malcolmgreaves · 2019-02-05T23:31:25Z

@sscdotopen Addressed review comments -- anything I didn't quite get or change appropriately?

sscdotopen

Thank you for incorporating the feedback so quickly, there is just one typo left in a comment, then this PR should be ready to merge!

src/main/scala/com/amazon/deequ/constraints/Constraint.scala

src/test/scala/com/amazon/deequ/suggestions/ConstraintSuggestionRunnerTest.scala

malcolmgreaves · 2019-02-06T15:42:22Z

@sscdotopen pushed typo fix

sscdotopen · 2019-02-07T01:50:32Z

Malcom, thank you again for your contribution!

malcolmgreaves · 2019-02-07T02:48:45Z

You're most welcome! I appreciate the collaboration!! ❤️

Malcolm Greaves added 5 commits February 1, 2019 20:31

Add scala-compiler as test dependency

c306fa8

Test showing failure of suggested .hasDataType on column w/ missing vals

fef5dad

Fixed by using completeness check

73c7b54

Updated expected code gen & mock obj needs 1 completeness call

fe2abbd

Fix style issues

f9ea713

Malcolm Greaves added 2 commits February 5, 2019 10:56

Undo initial fix & resulting test updates

d3662e0

Fix style issues from CI

d7e1b56

malcolmgreaves force-pushed the mg_bugfix_suggested_constraint_with_completeness branch from d493819 to d7e1b56 Compare February 5, 2019 19:25

sscdotopen suggested changes Feb 5, 2019

View reviewed changes

Addressed review comments

a5cd830

malcolmgreaves force-pushed the mg_bugfix_suggested_constraint_with_completeness branch from 6f6ebc8 to a5cd830 Compare February 5, 2019 23:24

sscdotopen suggested changes Feb 6, 2019

View reviewed changes

src/main/scala/com/amazon/deequ/constraints/Constraint.scala Outdated Show resolved Hide resolved

src/test/scala/com/amazon/deequ/suggestions/ConstraintSuggestionRunnerTest.scala Show resolved Hide resolved

Typo

fe59964

sscdotopen approved these changes Feb 7, 2019

View reviewed changes

sscdotopen merged commit 3bdb294 into awslabs:master Feb 7, 2019

malcolmgreaves mentioned this pull request Feb 7, 2019

Suggested hasDataType constraint fails when data is not complete #86

Closed

malcolmgreaves deleted the mg_bugfix_suggested_constraint_with_completeness branch February 7, 2019 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: Suggested hasDataType constraint works with non-complete data #87

Bugfix: Suggested hasDataType constraint works with non-complete data #87

malcolmgreaves commented Feb 2, 2019 •

edited

Loading

stefan-grafberger commented Feb 2, 2019 •

edited

Loading

sscdotopen commented Feb 3, 2019

stefan-grafberger commented Feb 3, 2019 •

edited

Loading

malcolmgreaves commented Feb 3, 2019

sscdotopen commented Feb 4, 2019

malcolmgreaves commented Feb 4, 2019 •

edited

Loading

malcolmgreaves commented Feb 4, 2019 •

edited

Loading

malcolmgreaves commented Feb 5, 2019

sscdotopen left a comment

malcolmgreaves commented Feb 5, 2019

sscdotopen left a comment

malcolmgreaves commented Feb 6, 2019

sscdotopen commented Feb 7, 2019

malcolmgreaves commented Feb 7, 2019

Bugfix: Suggested hasDataType constraint works with non-complete data #87

Bugfix: Suggested hasDataType constraint works with non-complete data #87

Conversation

malcolmgreaves commented Feb 2, 2019 • edited Loading

stefan-grafberger commented Feb 2, 2019 • edited Loading

sscdotopen commented Feb 3, 2019

stefan-grafberger commented Feb 3, 2019 • edited Loading

malcolmgreaves commented Feb 3, 2019

sscdotopen commented Feb 4, 2019

malcolmgreaves commented Feb 4, 2019 • edited Loading

malcolmgreaves commented Feb 4, 2019 • edited Loading

malcolmgreaves commented Feb 5, 2019

sscdotopen left a comment

Choose a reason for hiding this comment

malcolmgreaves commented Feb 5, 2019

sscdotopen left a comment

Choose a reason for hiding this comment

malcolmgreaves commented Feb 6, 2019

sscdotopen commented Feb 7, 2019

malcolmgreaves commented Feb 7, 2019

malcolmgreaves commented Feb 2, 2019 •

edited

Loading

stefan-grafberger commented Feb 2, 2019 •

edited

Loading

stefan-grafberger commented Feb 3, 2019 •

edited

Loading

malcolmgreaves commented Feb 4, 2019 •

edited

Loading

malcolmgreaves commented Feb 4, 2019 •

edited

Loading