Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrongly split results when mapping "" in query #4301

Closed
turibe opened this issue May 19, 2017 · 8 comments
Closed

Wrongly split results when mapping "" in query #4301

turibe opened this issue May 19, 2017 · 8 comments

Comments

@turibe
Copy link

turibe commented May 19, 2017

I want to map empty strings to "N/A" in query results. However, the results are strangely split between "null" and "N/A"'s (query examples below), as if the mapping was not being applied in all cases.

When we don't use the mapping, there is only one category: "null". The raw data, by the way, has only empty strings for nulls.

Interestingly, it seems like the rows whose "" are not mapped to "N/A" are those that have timestamps that come before the first row with a non-null entry; the corresponding segments only have "" for that dimension.

This seems like a bug in Druid (tried both 0.8.2 and 0.9.2).

{
  "queryType": "groupBy",
  "dataSource": "my-datasource",
  "granularity": "all",
  "descending": "true",
  "dimensions": ["my-dimension",
  {
    "type": "extraction",
    "dimension": "mydimension",
    "outputName": "mappeddimension",
    "extractionFn": {
      "type": "lookup",
      "retainMissingValue": true,
      "lookup": {
        "type": "map",
        "map": {
          "": "N/A"
        }
      }
    }
  }
  ],
  "aggregations": [ {
      "name": "aggregatedmetric",
      "type": "doubleSum",
      "fieldName": "my_metric"
    }
  ],
  "intervals": [ "2017-03-10T00:00:00.000/2017-03-10T10:00:00.000" ]
}

The empty values are split in the result:

[
  {
    "version": "v1",
    "timestamp": "2017-03-10T00:00:00.000Z",
    "event": {
      "mydimension": null,
      "aggregatedmetric": 100130,
      "mappeddimension": null
    }
  },
  {
    "version": "v1",
    "timestamp": "2017-03-10T00:00:00.000Z",
    "event": {
      "mydimension": null,
      "aggregatedmetric": 46534,
      "mappeddimension": "N/A"
    }
  },

....
]

Remove the "": "N/A" map and change it to one that has no effect ("foobar" : "N/A"),
and the result groups all the null values:

[
  {
    "version": "v1",
    "timestamp": "2017-03-10T00:00:00.000Z",
    "event": {
      "mydimension": null,
      "aggregatedmetric": 146664,
      "mappeddimension": null
    }
  },

....

]

Note that 100130 + 46534 = 146664

@gianm
Copy link
Contributor

gianm commented May 19, 2017

@gianm gianm added the Bug label May 19, 2017
@gianm
Copy link
Contributor

gianm commented May 19, 2017

I bet it's related to different behavior for actual nulls in actual columns, vs. columns that don't exist.

@gianm gianm added the Starter label May 23, 2017
@sascha-coenen
Copy link

yes. That's the reason.
We've been seeing this as well in Druid 0.9.0, 0.9.1.1, 0.9.2. and 0.10.0

@gianm
Copy link
Contributor

gianm commented May 31, 2017

If anyone is interested in fixing this, try starting by searching for calls to NullDimensionSelector.instance() (which Druid uses in some cases when it knows a column is missing) and double checking that they are all appropriate. Probably, sometimes it's being called when it really shouldn't be.

There should be some tests too, probably in GroupByQueryRunnerTest and TopNQueryRunnerTest. That just checks reading values though. To check that filtering works too, try adding some tests to the filter test files like SelectorFilterTest and BoundFilterTest.

@noppanit
Copy link

Mind if I pick this up? I want to contribute.

@Jinchul81
Copy link

@gianm I am struggling to make a reproduction scenario for TDD.

You mentioned the root cause. But, I don't understand the meaning of the versus. I guess the column points dimension of extraction, so the column can be either exists or not exists. Could you please give me more explanation about this?

I bet it's related to different behavior for actual nulls in actual columns, vs. columns that don't exist.

I am trying to make a test case which is based on
GroupByQueryRunnerTest.testGroupByWithExtractionDimFilterKeyisNull(). I guess the result looks right to me, so this example has not been prepared to reproduce the problem. Could you please guide me how I reproduce the problem? Thanks.

  public void testWronglySplitResults()
  {
    Map<String, String> extractionMap = new HashMap<>();
    extractionMap.put("", "NULLorEMPTY");

    MapLookupExtractor mapLookupExtractor = new MapLookupExtractor(extractionMap, false);
    LookupExtractionFn lookupExtractionFn = new LookupExtractionFn(mapLookupExtractor, false, null, false, false);

    GroupByQuery query = GroupByQuery.builder().setDataSource(QueryRunnerTestHelper.dataSource)
                                     .setQuerySegmentSpec(QueryRunnerTestHelper.firstToThird)
                                     .setDimensions(
                                         Lists.<DimensionSpec>newArrayList(
                                             new ExtractionDimensionSpec(
                                                 "partial_null_column",
                                                 "alias",
                                                 lookupExtractionFn
                                             )
                                         )
                                     )
                                     .setAggregatorSpecs(
                                         Arrays.asList(
                                             QueryRunnerTestHelper.rowsCount,
                                             new LongSumAggregatorFactory("idx", "index")
                                         )
                                     )
                                     .setGranularity(QueryRunnerTestHelper.dayGran)
                                     .build();
    Iterable<Row> results = GroupByQueryRunnerTestHelper.runQuery(factory, runner, query);
    for (Row r: results) {
      LOG.info(r.toString());
    }
  }

Loading index from druid.sample.numeric.tsv and the result is here.

MapBasedRow{timestamp=2011-04-02T00:00:00.000Z, event={alias=null, rows=2, idx=2193}}
MapBasedRow{timestamp=2011-04-02T00:00:00.000Z, event={alias=NULLorEMPTY, rows=11, idx=3634}}
MapBasedRow{timestamp=2011-04-02T00:00:00.000Z, event={alias=null, rows=2, idx=2193}}
MapBasedRow{timestamp=2011-04-02T00:00:00.000Z, event={alias=NULLorEMPTY, rows=11, idx=3634}}

This is a loaded data from the index file.

2011-04-02T00:00:00.000Z  spot  automotive  1000  10000.0 10000.0 100000  preferred a^Apreferred  147.425935
2011-04-02T00:00:00.000Z  spot  business  1100  11000.0 11000.0 110000  preferred b^Apreferred  112.987027
2011-04-02T00:00:00.000Z  spot  entertainment 1200  12000.0 12000.0 120000  preferred e^Apreferred  166.016049
2011-04-02T00:00:00.000Z  spot  health  1300  13000.0 13000.0 130000  preferred h^Apreferred  113.446008
2011-04-02T00:00:00.000Z  spot  mezzanine 1400  14000.0 14000.0 140000  preferred m^Apreferred  110.931934
2011-04-02T00:00:00.000Z  spot  news  1500  15000.0 15000.0 150000  preferred n^Apreferred  114.290141
2011-04-02T00:00:00.000Z  spot  premium 1600  16000.0 16000.0 160000  preferred p^Apreferred  135.301506
2011-04-02T00:00:00.000Z  spot  technology  1700  17000.0 17000.0 170000  preferred t^Apreferred  97.387433
2011-04-02T00:00:00.000Z  spot  travel  1800  18000.0 18000.0 180000  preferred t^Apreferred  126.411364
2011-04-02T00:00:00.000Z  total_market  mezzanine 1400  14000.0 14000.0 140000  preferred m^Apreferred  1193.556278
2011-04-02T00:00:00.000Z  total_market  premium 1600  16000.0 16000.0 160000  preferred p^Apreferred  1321.375057
2011-04-02T00:00:00.000Z  upfront mezzanine 1400  14000.0 14000.0 140000  preferred m^Apreferred  1144.342401 value
2011-04-02T00:00:00.000Z  upfront premium 1600  16000.0 16000.0 160000  preferred p^Apreferred  1049.738585 value

@github-actions
Copy link

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label May 31, 2023
@github-actions
Copy link

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants