Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/uniqueness row level results #471

Merged

Conversation

eycho-am
Copy link
Contributor

Issue #, if available:
N/A

Description of changes:
Building on this PR to add more analyzers that support row level results.

This PR adds row level results for the Uniqueness analyzer, which is used by several checks (isUnique, isPrimaryKey, hasUniqueness, uniqueValueRatio)

Row level results for other ScanShareableFrequencyBasedAnalyzers will be added in a future PR.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@eycho-am
Copy link
Contributor Author

Note: Tests for FileSystemMetricsRepositoryTest are currently failing. I'm working on a way to not serialize the fullColumn field of metrics, which is currently causing the issue:

Syntax Error
== SQL ==
CASE WHEN (count(att1) OVER (PARTITION BY att1, att2 ORDER BY att1 ASC NULLS FIRST unspecifiedframe$()) = 1) THEN true WHEN (count(att1) OVER (PARTITION BY att1, att2 ORDER BY att1 ASC NULLS FIRST unspecifiedframe$()) = 0) THEN NULL ELSE false END
        ----------^^^

@eycho-am
Copy link
Contributor Author

Note: Tests for FileSystemMetricsRepositoryTest are currently failing. I'm working on a way to not serialize the fullColumn field of metrics, which is currently causing the issue:

Syntax Error
== SQL ==
CASE WHEN (count(att1) OVER (PARTITION BY att1, att2 ORDER BY att1 ASC NULLS FIRST unspecifiedframe$()) = 1) THEN true WHEN (count(att1) OVER (PARTITION BY att1, att2 ORDER BY att1 ASC NULLS FIRST unspecifiedframe$()) = 0) THEN NULL ELSE false END
        ----------^^^

Resolved this issue by not serializing the full column field in metrics for now

@@ -69,6 +71,10 @@ case class DoubleMetric(
extends Metric[Double] with FullColumn {

override def flatten(): Seq[DoubleMetric] = Seq(this)

override def getMetricWithoutFullColumn: DoubleMetric = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, removed

@@ -513,7 +513,7 @@ private[deequ] object MetricSerializer extends JsonSerializer[Metric[_]] {
result.addProperty("instance", doubleMetric.instance)
result.addProperty("name", doubleMetric.name)
result.addProperty("value", doubleMetric.value.getOrElse(null).asInstanceOf[Double])
doubleMetric.fullColumn.foreach(c => result.addProperty("fullColumn", c.expr.sql))
// doubleMetric.fullColumn.foreach(c => result.addProperty("fullColumn", c.expr.sql))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can create an issue in Github as well and link it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@eycho-am eycho-am merged commit 310699b into awslabs:master Apr 27, 2023
1 check passed
rdsharma26 pushed a commit that referenced this pull request Apr 27, 2023
- Added row-level results for Uniqueness
- Modified tests to compare strings because comparisons involving the same columns with count() result in false
- Remove serializing fullColumn for now
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
- Added row-level results for Uniqueness
- Modified tests to compare strings because comparisons involving the same columns with count() result in false
- Remove serializing fullColumn for now
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
- Added row-level results for Uniqueness
- Modified tests to compare strings because comparisons involving the same columns with count() result in false
- Remove serializing fullColumn for now
rdsharma26 pushed a commit that referenced this pull request Apr 16, 2024
- Added row-level results for Uniqueness
- Modified tests to compare strings because comparisons involving the same columns with count() result in false
- Remove serializing fullColumn for now
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants