Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: MetricRepository cannot store metrics of Histogram analyzer with filter #271

Open
pwzhong opened this issue Aug 4, 2020 · 1 comment

Comments

@pwzhong
Copy link

pwzhong commented Aug 4, 2020

Bug: MetricRepository cannot store metrics of Histogram analyzer with filter correctly. It is stored as Histogram metric without any filter.

Here is a small snippet to demo the issue, using latest build 1.0.4:

    val path = <repository_path>
    val spark: SparkSession = ...

    import spark.implicits._
    val inputDF = Seq(("a", 1),("a", 1),("a", 2),("a", 3),("b", 1),("b", 2),("b", 2),("c",1)).toDF("id", "value")

    val repository = FileSystemMetricsRepository(spark, path)
    val resultKey = ResultKey(System.currentTimeMillis(), Map("tag" -> "test"))

    // collect Histogram metrics with filter and store in the repository
    val analysisResult = AnalysisRunner
      .onData(inputDF)
      .useRepository(repository)
      .addAnalyzers(Seq(Histogram("value",where=Some("id='a'")), Histogram("value",where=Some("id='b'")), Histogram("value",where=Some("id='c'"))))
      .saveOrAppendResult(resultKey)
      .run()

    // print out the collected metric. It shows Histogram metrics with filter are collected correctly.
    AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult).show(false)

   // print out the collected metric loading from the repository. Here is the error: the filter is missing. It is stored as Histogram metric without any filter.
    println("Data stored in metric repository: ")
    repository.load()
    .withTagValues(Map("tag" -> "test"))
    .getSuccessMetricsAsDataFrame(spark)
    .show(false)

Result:

+------+--------+---------------------------------+-----+
|entity|instance|name                             |value|
+------+--------+---------------------------------+-----+
|Column|value   |Histogram.bins (where: id='a')   |3.0  |
|Column|value   |Histogram.abs.1 (where: id='a')  |2.0  |
|Column|value   |Histogram.ratio.1 (where: id='a')|0.25 |
|Column|value   |Histogram.abs.3 (where: id='a')  |1.0  |
|Column|value   |Histogram.ratio.3 (where: id='a')|0.125|
|Column|value   |Histogram.abs.2 (where: id='a')  |1.0  |
|Column|value   |Histogram.ratio.2 (where: id='a')|0.125|
|Column|value   |Histogram.bins (where: id='b')   |2.0  |
|Column|value   |Histogram.abs.2 (where: id='b')  |2.0  |
|Column|value   |Histogram.ratio.2 (where: id='b')|0.25 |
|Column|value   |Histogram.abs.1 (where: id='b')  |1.0  |
|Column|value   |Histogram.ratio.1 (where: id='b')|0.125|
|Column|value   |Histogram.bins (where: id='c')   |1.0  |
|Column|value   |Histogram.abs.1 (where: id='c')  |1.0  |
|Column|value   |Histogram.ratio.1 (where: id='c')|0.125|
+------+--------+---------------------------------+-----+

Data stored in metric repository:
+------+--------+-----------------+-----+-------------+----+
|entity|instance|name             |value|dataset_date |tag |
+------+--------+-----------------+-----+-------------+----+
|Column|value   |Histogram.bins   |1.0  |1596558816681|test|
|Column|value   |Histogram.abs.1  |1.0  |1596558816681|test|
|Column|value   |Histogram.ratio.1|0.125|1596558816681|test|
+------+--------+-----------------+-----+-------------+----+

Root cause:
filter property is missing in AnalyzerSerializer and MetricSerializer .

  1. In /src/main/scala/com/amazon/deequ/repository/AnalysisResultSerde.scala, line 300:
case histogram: Histogram if histogram.binningUdf.isEmpty =>

        result.addProperty(ANALYZER_NAME_FIELD, "Histogram")

        result.addProperty(COLUMN_FIELD, histogram.column)

        result.addProperty("maxDetailBins", histogram.maxDetailBins)

result.addProperty(WHERE_FIELD, histogram.where.orNull) is missing.

  1. In /src/main/scala/com/amazon/deequ/repository/AnalysisResultSerde.scala, line 433:
case "Histogram" =>
        Histogram(
          json.get(COLUMN_FIELD).getAsString,
          None,
          json.get("maxDetailBins").getAsInt)

getOptionalWhereParam(json) is missing.

Would you please fix this bug or do I need to fix and submit a PR by myself?

@pwzhong pwzhong changed the title MetricRepository cannot store metrics of Histogram analyzer with filter Bug: MetricRepository cannot store metrics of Histogram analyzer with filter Aug 4, 2020
@sscdotopen
Copy link
Contributor

Thanks for catching this, it would be great if you could submit a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants