Wrongly split results when mapping "" in query #4301

turibe · 2017-05-19T13:17:59Z

I want to map empty strings to "N/A" in query results. However, the results are strangely split between "null" and "N/A"'s (query examples below), as if the mapping was not being applied in all cases.

When we don't use the mapping, there is only one category: "null". The raw data, by the way, has only empty strings for nulls.

Interestingly, it seems like the rows whose "" are not mapped to "N/A" are those that have timestamps that come before the first row with a non-null entry; the corresponding segments only have "" for that dimension.

This seems like a bug in Druid (tried both 0.8.2 and 0.9.2).

{
  "queryType": "groupBy",
  "dataSource": "my-datasource",
  "granularity": "all",
  "descending": "true",
  "dimensions": ["my-dimension",
  {
    "type": "extraction",
    "dimension": "mydimension",
    "outputName": "mappeddimension",
    "extractionFn": {
      "type": "lookup",
      "retainMissingValue": true,
      "lookup": {
        "type": "map",
        "map": {
          "": "N/A"
        }
      }
    }
  }
  ],
  "aggregations": [ {
      "name": "aggregatedmetric",
      "type": "doubleSum",
      "fieldName": "my_metric"
    }
  ],
  "intervals": [ "2017-03-10T00:00:00.000/2017-03-10T10:00:00.000" ]
}

The empty values are split in the result:

[
  {
    "version": "v1",
    "timestamp": "2017-03-10T00:00:00.000Z",
    "event": {
      "mydimension": null,
      "aggregatedmetric": 100130,
      "mappeddimension": null
    }
  },
  {
    "version": "v1",
    "timestamp": "2017-03-10T00:00:00.000Z",
    "event": {
      "mydimension": null,
      "aggregatedmetric": 46534,
      "mappeddimension": "N/A"
    }
  },

....
]

Remove the "": "N/A" map and change it to one that has no effect ("foobar" : "N/A"),
and the result groups all the null values:

[
  {
    "version": "v1",
    "timestamp": "2017-03-10T00:00:00.000Z",
    "event": {
      "mydimension": null,
      "aggregatedmetric": 146664,
      "mappeddimension": null
    }
  },

....

]

Note that 100130 + 46534 = 146664

The text was updated successfully, but these errors were encountered:

gianm · 2017-05-19T16:24:30Z

Related discussion thread: https://groups.google.com/forum/#!msg/druid-user/QqYmUYPduH4/x8ji3RRjBQAJ

gianm · 2017-05-19T16:55:26Z

I bet it's related to different behavior for actual nulls in actual columns, vs. columns that don't exist.

sascha-coenen · 2017-05-31T13:50:51Z

yes. That's the reason.
We've been seeing this as well in Druid 0.9.0, 0.9.1.1, 0.9.2. and 0.10.0

gianm · 2017-05-31T14:41:32Z

If anyone is interested in fixing this, try starting by searching for calls to NullDimensionSelector.instance() (which Druid uses in some cases when it knows a column is missing) and double checking that they are all appropriate. Probably, sometimes it's being called when it really shouldn't be.

There should be some tests too, probably in GroupByQueryRunnerTest and TopNQueryRunnerTest. That just checks reading values though. To check that filtering works too, try adding some tests to the filter test files like SelectorFilterTest and BoundFilterTest.

noppanit · 2017-11-10T19:59:31Z

Mind if I pick this up? I want to contribute.

Jinchul81 · 2018-02-21T08:51:05Z

@gianm I am struggling to make a reproduction scenario for TDD.

You mentioned the root cause. But, I don't understand the meaning of the versus. I guess the column points dimension of extraction, so the column can be either exists or not exists. Could you please give me more explanation about this?

I bet it's related to different behavior for actual nulls in actual columns, vs. columns that don't exist.

I am trying to make a test case which is based on
GroupByQueryRunnerTest.testGroupByWithExtractionDimFilterKeyisNull(). I guess the result looks right to me, so this example has not been prepared to reproduce the problem. Could you please guide me how I reproduce the problem? Thanks.

  public void testWronglySplitResults()
  {
    Map<String, String> extractionMap = new HashMap<>();
    extractionMap.put("", "NULLorEMPTY");

    MapLookupExtractor mapLookupExtractor = new MapLookupExtractor(extractionMap, false);
    LookupExtractionFn lookupExtractionFn = new LookupExtractionFn(mapLookupExtractor, false, null, false, false);

    GroupByQuery query = GroupByQuery.builder().setDataSource(QueryRunnerTestHelper.dataSource)
                                     .setQuerySegmentSpec(QueryRunnerTestHelper.firstToThird)
                                     .setDimensions(
                                         Lists.<DimensionSpec>newArrayList(
                                             new ExtractionDimensionSpec(
                                                 "partial_null_column",
                                                 "alias",
                                                 lookupExtractionFn
                                             )
                                         )
                                     )
                                     .setAggregatorSpecs(
                                         Arrays.asList(
                                             QueryRunnerTestHelper.rowsCount,
                                             new LongSumAggregatorFactory("idx", "index")
                                         )
                                     )
                                     .setGranularity(QueryRunnerTestHelper.dayGran)
                                     .build();
    Iterable<Row> results = GroupByQueryRunnerTestHelper.runQuery(factory, runner, query);
    for (Row r: results) {
      LOG.info(r.toString());
    }
  }

Loading index from druid.sample.numeric.tsv and the result is here.

MapBasedRow{timestamp=2011-04-02T00:00:00.000Z, event={alias=null, rows=2, idx=2193}}
MapBasedRow{timestamp=2011-04-02T00:00:00.000Z, event={alias=NULLorEMPTY, rows=11, idx=3634}}
MapBasedRow{timestamp=2011-04-02T00:00:00.000Z, event={alias=null, rows=2, idx=2193}}
MapBasedRow{timestamp=2011-04-02T00:00:00.000Z, event={alias=NULLorEMPTY, rows=11, idx=3634}}

This is a loaded data from the index file.

2011-04-02T00:00:00.000Z  spot  automotive  1000  10000.0 10000.0 100000  preferred a^Apreferred  147.425935
2011-04-02T00:00:00.000Z  spot  business  1100  11000.0 11000.0 110000  preferred b^Apreferred  112.987027
2011-04-02T00:00:00.000Z  spot  entertainment 1200  12000.0 12000.0 120000  preferred e^Apreferred  166.016049
2011-04-02T00:00:00.000Z  spot  health  1300  13000.0 13000.0 130000  preferred h^Apreferred  113.446008
2011-04-02T00:00:00.000Z  spot  mezzanine 1400  14000.0 14000.0 140000  preferred m^Apreferred  110.931934
2011-04-02T00:00:00.000Z  spot  news  1500  15000.0 15000.0 150000  preferred n^Apreferred  114.290141
2011-04-02T00:00:00.000Z  spot  premium 1600  16000.0 16000.0 160000  preferred p^Apreferred  135.301506
2011-04-02T00:00:00.000Z  spot  technology  1700  17000.0 17000.0 170000  preferred t^Apreferred  97.387433
2011-04-02T00:00:00.000Z  spot  travel  1800  18000.0 18000.0 180000  preferred t^Apreferred  126.411364
2011-04-02T00:00:00.000Z  total_market  mezzanine 1400  14000.0 14000.0 140000  preferred m^Apreferred  1193.556278
2011-04-02T00:00:00.000Z  total_market  premium 1600  16000.0 16000.0 160000  preferred p^Apreferred  1321.375057
2011-04-02T00:00:00.000Z  upfront mezzanine 1400  14000.0 14000.0 140000  preferred m^Apreferred  1144.342401 value
2011-04-02T00:00:00.000Z  upfront premium 1600  16000.0 16000.0 160000  preferred p^Apreferred  1049.738585 value

github-actions · 2023-05-31T00:17:54Z

This issue has been marked as stale due to 280 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If this issue is still
relevant, please simply write any comment. Even if closed, you can still revive the
issue at any time or discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

github-actions · 2023-06-28T00:19:57Z

This issue has been closed due to lack of activity. If you think that
is incorrect, or the issue requires additional review, you can revive the issue at
any time.

gianm added the Bug label May 19, 2017

gianm added the Starter label May 23, 2017

jihoonson added this to the 0.13.0 milestone Sep 17, 2018

fjy modified the milestones: 0.13.0, 0.13.1 Sep 26, 2018

gianm removed this from the 0.14.0 milestone Feb 5, 2019

suneet-s mentioned this issue Feb 5, 2021

The map implementation no longer converts empty characters, after the update from 0.20.0 to 0.20.1 #10859

Closed

github-actions bot added the stale label May 31, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrongly split results when mapping "" in query #4301

Wrongly split results when mapping "" in query #4301

turibe commented May 19, 2017

gianm commented May 19, 2017

gianm commented May 19, 2017

sascha-coenen commented May 31, 2017

gianm commented May 31, 2017

noppanit commented Nov 10, 2017

Jinchul81 commented Feb 21, 2018

github-actions bot commented May 31, 2023

github-actions bot commented Jun 28, 2023

Wrongly split results when mapping "" in query #4301

Wrongly split results when mapping "" in query #4301

Comments

turibe commented May 19, 2017

gianm commented May 19, 2017

gianm commented May 19, 2017

sascha-coenen commented May 31, 2017

gianm commented May 31, 2017

noppanit commented Nov 10, 2017

Jinchul81 commented Feb 21, 2018

github-actions bot commented May 31, 2023

github-actions bot commented Jun 28, 2023