-
Notifications
You must be signed in to change notification settings - Fork 538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable RetainCompletenessRule #564
Conversation
*/ | ||
case class RetainCompletenessRule() extends ConstraintRule[ColumnProfile] { | ||
|
||
case class RetainCompletenessRule(minCompleteness: Double = 0.2, maxCompleteness: Double = 1.0) extends ConstraintRule[ColumnProfile] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided not to Parameterize z-value likes in original implementation. Due to the fact that it is related to a specific Interval Calculation Techniques. If possible, we can work into parameterize the strategy use to calculating the interval #563
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zeotuan
Can you trim this line to below 120 characters? It is failing checkstyle and failing the build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also store the values 0.2
and 1.0
as constants ?
@rdsharma26 Hi, Please help review this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for addressing the feedback. LGTM.
* Configurable RetainCompletenessRule * Add doc string * Add default completeness const
* Configurable RetainCompletenessRule * Add doc string * Add default completeness const
* Configurable RetainCompletenessRule * Add doc string * Add default completeness const
* Configurable RetainCompletenessRule * Add doc string * Add default completeness const
* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <jzexter@amazon.com> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <svanvari@amazon.com> * Updated version in pom.xml to 2.0.8-spark-3.4 --------- Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com> Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com> Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com> Co-authored-by: Joshua Zexter <jzexter@amazon.com> Co-authored-by: bojackli <478378663@qq.com> Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com> Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com> Co-authored-by: Shriya Vanvari <svanvari@amazon.com>
* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <jzexter@amazon.com> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <svanvari@amazon.com> * Match Breeze version with spark 3.3 (#562) * Updated version in pom.xml to 2.0.8-spark-3.3 --------- Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com> Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com> Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com> Co-authored-by: Joshua Zexter <jzexter@amazon.com> Co-authored-by: bojackli <478378663@qq.com> Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com> Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com> Co-authored-by: Shriya Vanvari <svanvari@amazon.com>
* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <jzexter@amazon.com> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <svanvari@amazon.com> * Updated version in pom.xml to 2.0.8-spark-3.2 --------- Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com> Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com> Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com> Co-authored-by: Joshua Zexter <jzexter@amazon.com> Co-authored-by: bojackli <478378663@qq.com> Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com> Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com> Co-authored-by: Shriya Vanvari <svanvari@amazon.com>
* Configurable RetainCompletenessRule (#564) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Optional specification of instance name in CustomSQL analyzer metric. (#569) Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> * Adding Wilson Score Confidence Interval Strategy (#567) * Configurable RetainCompletenessRule * Add doc string * Add default completeness const * Add ConfidenceIntervalStrategy * Add Separate Wilson and Wald Interval Test * Add License information, Fix formatting * Add License information * formatting fix * Update documentation * Make WaldInterval the default strategy for now * Formatting import to per line * Separate group import to per line import * CustomAggregator (#572) * Add support for EntityTypes dqdl rule * Add support for Conditional Aggregation Analyzer --------- Co-authored-by: Joshua Zexter <jzexter@amazon.com> * fix typo (#574) * Fix performance of building row-level results (#577) * Generate row-level results with withColumns Iteratively using withColumn (singular) causes performance issues when iterating over a large sequence of columns. * Add back UNIQUENESS_ID * Replace 'withColumns' with 'select' (#582) 'withColumns' was introduced in Spark 3.3, so it won't work for Deequ's <3.3 builds. * Replace rdd with dataframe functions in Histogram analyzer (#586) Co-authored-by: Shriya Vanvari <svanvari@amazon.com> * pdated version in pom.xml to 2.0.8-spark-3.1 --------- Co-authored-by: zeotuan <48720253+zeotuan@users.noreply.github.com> Co-authored-by: tylermcdaniel0 <144386264+tylermcdaniel0@users.noreply.github.com> Co-authored-by: Tyler Mcdaniel <tymcd@amazon.com> Co-authored-by: Joshua Zexter <67130377+joshuazexter@users.noreply.github.com> Co-authored-by: Joshua Zexter <jzexter@amazon.com> Co-authored-by: bojackli <478378663@qq.com> Co-authored-by: Josh <5685731+marcantony@users.noreply.github.com> Co-authored-by: Shriya Vanvari <vanvari.shriya@gmail.com> Co-authored-by: Shriya Vanvari <svanvari@amazon.com>
* Configurable RetainCompletenessRule * Add doc string * Add default completeness const
Close #340
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.