Inner Query should build on sub query #1632

Hailei · 2015-08-17T10:31:55Z

Like following query:

SELECT 
   device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv 
FROM
  (SELECT
         device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
   FROM dsp_report        
   WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
   AND area = '0A' BREAK BY 'all' GROUP BY device, cast
   ) 
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ;

don't return anything,not as except.
look into the code path:

   final GroupByQuery innerQuery = new GroupByQuery.Builder(query)
          .setAggregatorSpecs(aggs)
          .setInterval(subquery.getIntervals())
          .setPostAggregatorSpecs(Lists.<PostAggregator>newArrayList())
          .build();

Inner Query should build on sub query

nishantmonu51 · 2015-08-17T10:48:08Z

@Hailei, It would be great if you can also add a unit test to GroupByQueryRunnerTest for this.

Hailei · 2015-08-17T11:15:38Z

ok ,I will add it tomorrow

drcrallen · 2015-08-17T15:02:24Z

@Hailei What SQL interpreter are you using?

Hailei · 2015-08-18T06:39:45Z

@drcrallen SQL4D
this sql was compiled to json as following

{
  "filter": {
    "type": "selector",
    "dimension": "cast",
    "value": "1"
  },
  "intervals": {
    "intervals": ["2015-08-04T00:00:00/2015-08-05T00:00:00"],
    "type": "intervals"
  },
  "granularity": "all",
  "dataSource": {
    "query": {
      "filter": {
        "type": "selector",
        "dimension": "area",
        "value": "0A"
      },
      "intervals": {
        "intervals": ["2015-08-04T00:00:00/2015-08-05T00:00:00"],
        "type": "intervals"
      },
      "granularity": "all",
      "dataSource": {
        "name": "dsp_report",
        "type": "table"
      },
      "aggregations": [
        {
          "fieldName": "pv_sum",
          "name": "pv",
          "type": "longSum"
        },
        {
          "fieldName": "uv",
          "name": "hyper_uv",
          "type": "hyperUnique"
        }
      ],
      "postAggregations": [],
      "queryType": "groupBy",
      "dimensions": [
        "device",
        "cast"
      ]
    },
    "type": "query"
  },
  "aggregations": [
    {
      "fieldName": "pv",
      "name": "pv",
      "type": "longSum"
    },
    {
      "fieldName": "hyper_uv",
      "name": "hyper_uv",
      "type": "hyperUnique"
    }
  ],
  "postAggregations": [],
  "queryType": "groupBy",
  "dimensions": ["device"]
}

Hailei · 2015-08-18T07:17:47Z

I think this is a bug,But this PR isn't good solution.
The reason why this query of output is empty: inner query group by 'cast' and 'device' two dimensions, and outer query only group by 'device',meanwhile filter 'cast' dimension.
as following code

 // We need the inner incremental index to have all the columns required by the outer query
      final GroupByQuery innerQuery = new GroupByQuery.Builder(subquery)
          .setAggregatorSpecs(aggs)
          .setInterval(subquery.getIntervals())
          .setPostAggregatorSpecs(Lists.<PostAggregator>newArrayList())
          .build();

     IncrementalIndex index = makeIncrementalIndex(innerQuery, subqueryResult);

If inner query group by outer query's GROUP BY,the increment index only contains 'cast',So can't filter by 'cast=1'

Use inner query to make increment index is redundant,their GROUP BY is the same,So this PR isn't good solution

drcrallen · 2015-08-19T15:07:47Z

That shouldn't be an issue. The filter is applied FIRST. This is the reason there are dimension extraction stuff for both the dimension specification AND filters, which must both be set. Specifying a dimension extraction in the dimension specification, then making a "normal" dimension selector on the extracted dimension will NOT work.

You can see the filter being applied at
io.druid.query.groupby.GroupByQueryEngine#process

    final Sequence<Cursor> cursors = storageAdapter.makeCursors(
        Filters.convertDimensionFilters(query.getDimFilter()),
        intervals.get(0),
        query.getGranularity()
    );

where storageAdapter comes from the incremental index created by the sub query, and query.getDimFilter() should be the filter you were mentioning.

That neither proves nor disproves this issue. A unit test should be able to reveal if there is an issue here.

drcrallen · 2015-08-19T15:10:35Z

If you issue the sub query as a thing on its own, is the result what you expect?

      "filter": {
        "type": "selector",
        "dimension": "area",
        "value": "0A"
      },
      "intervals": {
        "intervals": ["2015-08-04T00:00:00/2015-08-05T00:00:00"],
        "type": "intervals"
      },
      "granularity": "all",
      "dataSource": {
        "name": "dsp_report",
        "type": "table"
      },
      "aggregations": [
        {
          "fieldName": "pv_sum",
          "name": "pv",
          "type": "longSum"
        },
        {
          "fieldName": "uv",
          "name": "hyper_uv",
          "type": "hyperUnique"
        }
      ],
      "postAggregations": [],
      "queryType": "groupBy",
      "dimensions": [
        "device",
        "cast"
      ]

Hailei · 2015-08-26T02:24:39Z

@drcrallen issue sub query,will return 4000 rows,the result as except,because the cardinality of cast is 1000 and the cardinality of device is 4.
and order by cast and limit as following sql:

SELECT device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
FROM dsp_report        
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
 AND area = '0A' BREAK BY 'all' GROUP BY device, cast ORDER BY cast LIMIT 10;

SELECT device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv
FROM dsp_report
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
AND area = '0A' BREAK BY 'all' GROUP BY device, cast ORDER BY cast LIMIT 8;>>>
+------------------------+----+----+------------------+------+
|timestamp |cast|pv |hyper_uv |device|
+------------------------+----+----+------------------+------+

....some rows ignore

+------------------------+----+----+------------------+------+
|2015-08-03T16:00:00.000Z|1 |5443|3049.582786719047 |3D |
+------------------------+----+----+------------------+------+
|2015-08-03T16:00:00.000Z|1 |6025|3577.616802569747 |1D |
+------------------------+----+----+------------------+------+
|2015-08-03T16:00:00.000Z|1 |5715|3269.305274221978 |2D |
+------------------------+----+----+------------------+------+
|2015-08-03T16:00:00.000Z|1 |6586|3920.265152070715 |0D |

in the meaning time,apply this PR,issue the nest sql:

SELECT 
   device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv 
FROM
  (SELECT
         device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
   FROM dsp_report        
   WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
   AND area = '0A' BREAK BY 'all' GROUP BY device, cast
   ) 
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ;

the result is

SELECT
device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv
FROM
(SELECT
device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv
FROM dsp_report
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T0>>>>>>0:00:00
AND area = '0A' BREAK BY 'all' GROUP BY device, cast
)
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ;>>>
line 1:312 extraneous input ' ' expecting RPARAN
+------------------------+----+-----------------+------+
|timestamp |pv |hyper_uv |device|
+------------------------+----+-----------------+------+
|2015-08-03T16:00:00.000Z|6586|3920.265152070715|0D |
+------------------------+----+-----------------+------+
|2015-08-03T16:00:00.000Z|6025|3577.616802569747|1D |
+------------------------+----+-----------------+------+
|2015-08-03T16:00:00.000Z|5715|3269.305274221978|2D |
+------------------------+----+-----------------+------+
|2015-08-03T16:00:00.000Z|5443|3049.582786719047|3D |
+------------------------+----+-----------------+------+

this is the same as above

Hailei · 2015-08-26T03:15:28Z

look into GroupByQueryHelper line 71

 final List<String> dimensions = Lists.transform(
        query.getDimensions(),
        new Function<DimensionSpec, String>()
        {
          @Override
          public String apply(DimensionSpec input)
          {
            return input.getOutputName();
          }
        }
    );

line 118:

 Accumulator<IncrementalIndex, T> accumulator = new Accumulator<IncrementalIndex, T>()
    {
      @Override
      public IncrementalIndex accumulate(IncrementalIndex accumulated, T in)
      {

        if (in instanceof MapBasedRow) {
          try {
            MapBasedRow row = (MapBasedRow) in;
            accumulated.add(
                new MapBasedInputRow(
                    row.getTimestamp(),
                    dimensions,
                    row.getEvent()
                )
            );
          }

if use outer query's dimensions,only add device to increment index,don't contains cast.

drcrallen · 2015-08-26T03:52:51Z

I added a test that fails in master but passes with this patch : Hailei#1

Please confirm this is the case you are encountering

Add test for apache#1632

Hailei · 2015-08-27T02:33:39Z

@drcrallen yes,this case is similar to mine.The difference is inner query specify two dimensions "cast" and "device",and outer query outer query only specify one dimension "device" and filter by "cast"

fjy · 2015-08-27T16:45:57Z

👍

fjy · 2015-08-27T16:46:39Z

@drcrallen can we finish this one up?

Inner Query should build on sub query

Hailei · 2016-02-24T08:12:19Z

@fjy @drcrallen I think this pull request that I submitted have defects. Inner query really should build on sub query.look following SQL

SELECT  COUNT(*) as cc,LONG_SUM(click) as c from (select cast,LONG_SUM(click_sum) as click from dsp_report where interval BETWEEN  2016-01-20 AND 2016-01-21) where interval BETWEEN  2016-01-20 AND 2016-01-21;

compile to JSON

{
  "intervals": {
    "intervals": ["2016-01-20/2016-01-21"],
    "type": "intervals"
  },
  "granularity": "all",
  "dataSource": {
    "query": {
      "intervals": {
        "intervals": ["2016-01-20/2016-01-21"],
        "type": "intervals"
      },
      "granularity": "all",
      "dataSource": {
        "name": "dsp_report",
        "type": "table"
      },
      "aggregations": [{
        "fieldName": "click_sum",
        "name": "click",
        "type": "longSum"
      }],
      "postAggregations": [],
      "queryType": "groupBy",
      "dimensions": ["cast"]
    },
    "type": "query"
  },
  "aggregations": [{
    "name": "cc",
    "type": "count"
  }],
  "postAggregations": [],
  "queryType": "groupBy",
  "dimensions": ["click"]
}

inner query's aggregation 'click' turned to outer query's dimension,If use the build that merged this PR,will return wrong result,because inner query don't have 'click' dimension.
Following SQL is mentioned in those issue are misused. As much as possible,The WHERE condition and aggregations are inside inner query.
#1632

SELECT 
   device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv 
FROM
  (SELECT
         device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
   FROM dsp_report        
   WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
   AND area = '0A' BREAK BY 'all' GROUP BY device, cast
   ) 
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ;

proper usage:

SELECT 
   device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv 
FROM
  (SELECT
         device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
   FROM dsp_report        
   WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
   AND area = '0A' and cast=1 BREAK BY 'all' GROUP BY device
   ) 
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 BREAK BY 'all' GROUP BY device ;

#1825

  select COUNT(*) as rows from (select a b from t1 group by a,b)

proper usage:

   select COUNT(*)  from t1 group by a,b

Issue #1036
Can count() be inside inner query?

Inner Query should build on sub query

234a958

Hailei mentioned this pull request Aug 19, 2015

Inner Query should build on sub query #1641

Closed

Add test for apache#1632

24aa762

Merge pull request #1 from metamx/testFor1632

e9e3aae

Add test for apache#1632

drcrallen added a commit that referenced this pull request Aug 27, 2015

Merge pull request #1632 from Hailei/fix-subquery-innerquery-demension

c1388a1

Inner Query should build on sub query

drcrallen merged commit c1388a1 into apache:master Aug 27, 2015

drcrallen mentioned this pull request Oct 14, 2015

Total count differed by different dim order #1825

Closed

ghost mentioned this pull request Dec 7, 2015

Query Data Source - Aggregator Issue #1036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inner Query should build on sub query #1632

Inner Query should build on sub query #1632

Hailei commented Aug 17, 2015

nishantmonu51 commented Aug 17, 2015

Hailei commented Aug 17, 2015

drcrallen commented Aug 17, 2015

Hailei commented Aug 18, 2015

Hailei commented Aug 18, 2015

drcrallen commented Aug 19, 2015

drcrallen commented Aug 19, 2015

Hailei commented Aug 26, 2015

Hailei commented Aug 26, 2015

drcrallen commented Aug 26, 2015

Hailei commented Aug 27, 2015

fjy commented Aug 27, 2015

fjy commented Aug 27, 2015

Hailei commented Feb 24, 2016

Inner Query should build on sub query #1632

Inner Query should build on sub query #1632

Conversation

Hailei commented Aug 17, 2015

nishantmonu51 commented Aug 17, 2015

Hailei commented Aug 17, 2015

drcrallen commented Aug 17, 2015

Hailei commented Aug 18, 2015

Hailei commented Aug 18, 2015

drcrallen commented Aug 19, 2015

drcrallen commented Aug 19, 2015

Hailei commented Aug 26, 2015

Hailei commented Aug 26, 2015

drcrallen commented Aug 26, 2015

Hailei commented Aug 27, 2015

fjy commented Aug 27, 2015

fjy commented Aug 27, 2015

Hailei commented Feb 24, 2016