Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inner Query should build on sub query #1632

Merged
merged 3 commits into from
Aug 27, 2015

Conversation

Hailei
Copy link
Contributor

@Hailei Hailei commented Aug 17, 2015

Like following query:

SELECT 
   device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv 
FROM
  (SELECT
         device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
   FROM dsp_report        
   WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
   AND area = '0A' BREAK BY 'all' GROUP BY device, cast
   ) 
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ;

subquery
don't return anything,not as except.
look into the code path:

   final GroupByQuery innerQuery = new GroupByQuery.Builder(query)
          .setAggregatorSpecs(aggs)
          .setInterval(subquery.getIntervals())
          .setPostAggregatorSpecs(Lists.<PostAggregator>newArrayList())
          .build();

Inner Query should build on sub query

@nishantmonu51
Copy link
Member

@Hailei, It would be great if you can also add a unit test to GroupByQueryRunnerTest for this.

@Hailei
Copy link
Contributor Author

Hailei commented Aug 17, 2015

ok ,I will add it tomorrow

@drcrallen
Copy link
Contributor

@Hailei What SQL interpreter are you using?

@Hailei
Copy link
Contributor Author

Hailei commented Aug 18, 2015

@drcrallen SQL4D
this sql was compiled to json as following

{
  "filter": {
    "type": "selector",
    "dimension": "cast",
    "value": "1"
  },
  "intervals": {
    "intervals": ["2015-08-04T00:00:00/2015-08-05T00:00:00"],
    "type": "intervals"
  },
  "granularity": "all",
  "dataSource": {
    "query": {
      "filter": {
        "type": "selector",
        "dimension": "area",
        "value": "0A"
      },
      "intervals": {
        "intervals": ["2015-08-04T00:00:00/2015-08-05T00:00:00"],
        "type": "intervals"
      },
      "granularity": "all",
      "dataSource": {
        "name": "dsp_report",
        "type": "table"
      },
      "aggregations": [
        {
          "fieldName": "pv_sum",
          "name": "pv",
          "type": "longSum"
        },
        {
          "fieldName": "uv",
          "name": "hyper_uv",
          "type": "hyperUnique"
        }
      ],
      "postAggregations": [],
      "queryType": "groupBy",
      "dimensions": [
        "device",
        "cast"
      ]
    },
    "type": "query"
  },
  "aggregations": [
    {
      "fieldName": "pv",
      "name": "pv",
      "type": "longSum"
    },
    {
      "fieldName": "hyper_uv",
      "name": "hyper_uv",
      "type": "hyperUnique"
    }
  ],
  "postAggregations": [],
  "queryType": "groupBy",
  "dimensions": ["device"]
}

@Hailei
Copy link
Contributor Author

Hailei commented Aug 18, 2015

I think this is a bug,But this PR isn't good solution.
The reason why this query of output is empty: inner query group by 'cast' and 'device' two dimensions, and outer query only group by 'device',meanwhile filter 'cast' dimension.
as following code

 // We need the inner incremental index to have all the columns required by the outer query
      final GroupByQuery innerQuery = new GroupByQuery.Builder(subquery)
          .setAggregatorSpecs(aggs)
          .setInterval(subquery.getIntervals())
          .setPostAggregatorSpecs(Lists.<PostAggregator>newArrayList())
          .build();

     IncrementalIndex index = makeIncrementalIndex(innerQuery, subqueryResult);

If inner query group by outer query's GROUP BY,the increment index only contains 'cast',So can't filter by 'cast=1'

Use inner query to make increment index is redundant,their GROUP BY is the same,So this PR isn't good solution

@drcrallen
Copy link
Contributor

That shouldn't be an issue. The filter is applied FIRST. This is the reason there are dimension extraction stuff for both the dimension specification AND filters, which must both be set. Specifying a dimension extraction in the dimension specification, then making a "normal" dimension selector on the extracted dimension will NOT work.

You can see the filter being applied at
io.druid.query.groupby.GroupByQueryEngine#process

    final Sequence<Cursor> cursors = storageAdapter.makeCursors(
        Filters.convertDimensionFilters(query.getDimFilter()),
        intervals.get(0),
        query.getGranularity()
    );

where storageAdapter comes from the incremental index created by the sub query, and query.getDimFilter() should be the filter you were mentioning.

That neither proves nor disproves this issue. A unit test should be able to reveal if there is an issue here.

@drcrallen
Copy link
Contributor

If you issue the sub query as a thing on its own, is the result what you expect?

      "filter": {
        "type": "selector",
        "dimension": "area",
        "value": "0A"
      },
      "intervals": {
        "intervals": ["2015-08-04T00:00:00/2015-08-05T00:00:00"],
        "type": "intervals"
      },
      "granularity": "all",
      "dataSource": {
        "name": "dsp_report",
        "type": "table"
      },
      "aggregations": [
        {
          "fieldName": "pv_sum",
          "name": "pv",
          "type": "longSum"
        },
        {
          "fieldName": "uv",
          "name": "hyper_uv",
          "type": "hyperUnique"
        }
      ],
      "postAggregations": [],
      "queryType": "groupBy",
      "dimensions": [
        "device",
        "cast"
      ]

@Hailei
Copy link
Contributor Author

Hailei commented Aug 26, 2015

@drcrallen issue sub query,will return 4000 rows,the result as except,because the cardinality of cast is 1000 and the cardinality of device is 4.
and order by cast and limit as following sql:

SELECT device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
FROM dsp_report        
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
 AND area = '0A' BREAK BY 'all' GROUP BY device, cast ORDER BY cast LIMIT 10;

SELECT device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv
FROM dsp_report
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
AND area = '0A' BREAK BY 'all' GROUP BY device, cast ORDER BY cast LIMIT 8;>>>
+------------------------+----+----+------------------+------+
|timestamp |cast|pv |hyper_uv |device|
+------------------------+----+----+------------------+------+

....some rows ignore

+------------------------+----+----+------------------+------+
|2015-08-03T16:00:00.000Z|1 |5443|3049.582786719047 |3D |
+------------------------+----+----+------------------+------+
|2015-08-03T16:00:00.000Z|1 |6025|3577.616802569747 |1D |
+------------------------+----+----+------------------+------+
|2015-08-03T16:00:00.000Z|1 |5715|3269.305274221978 |2D |
+------------------------+----+----+------------------+------+
|2015-08-03T16:00:00.000Z|1 |6586|3920.265152070715 |0D |

in the meaning time,apply this PR,issue the nest sql:

SELECT 
   device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv 
FROM
  (SELECT
         device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
   FROM dsp_report        
   WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
   AND area = '0A' BREAK BY 'all' GROUP BY device, cast
   ) 
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ;

the result is

SELECT
device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv
FROM
(SELECT
device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv
FROM dsp_report
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T0>>>>>>0:00:00
AND area = '0A' BREAK BY 'all' GROUP BY device, cast
)
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ;>>>
line 1:312 extraneous input ' ' expecting RPARAN
+------------------------+----+-----------------+------+
|timestamp |pv |hyper_uv |device|
+------------------------+----+-----------------+------+
|2015-08-03T16:00:00.000Z|6586|3920.265152070715|0D |
+------------------------+----+-----------------+------+
|2015-08-03T16:00:00.000Z|6025|3577.616802569747|1D |
+------------------------+----+-----------------+------+
|2015-08-03T16:00:00.000Z|5715|3269.305274221978|2D |
+------------------------+----+-----------------+------+
|2015-08-03T16:00:00.000Z|5443|3049.582786719047|3D |
+------------------------+----+-----------------+------+

this is the same as above

@Hailei
Copy link
Contributor Author

Hailei commented Aug 26, 2015

look into GroupByQueryHelper line 71

 final List<String> dimensions = Lists.transform(
        query.getDimensions(),
        new Function<DimensionSpec, String>()
        {
          @Override
          public String apply(DimensionSpec input)
          {
            return input.getOutputName();
          }
        }
    );

line 118:

 Accumulator<IncrementalIndex, T> accumulator = new Accumulator<IncrementalIndex, T>()
    {
      @Override
      public IncrementalIndex accumulate(IncrementalIndex accumulated, T in)
      {

        if (in instanceof MapBasedRow) {
          try {
            MapBasedRow row = (MapBasedRow) in;
            accumulated.add(
                new MapBasedInputRow(
                    row.getTimestamp(),
                    dimensions,
                    row.getEvent()
                )
            );
          }

if use outer query's dimensions,only add device to increment index,don't contains cast.

@drcrallen
Copy link
Contributor

I added a test that fails in master but passes with this patch : Hailei#1

Please confirm this is the case you are encountering

@Hailei
Copy link
Contributor Author

Hailei commented Aug 27, 2015

@drcrallen yes,this case is similar to mine.The difference is inner query specify two dimensions "cast" and "device",and outer query outer query only specify one dimension "device" and filter by "cast"

@fjy
Copy link
Contributor

fjy commented Aug 27, 2015

👍

@fjy
Copy link
Contributor

fjy commented Aug 27, 2015

@drcrallen can we finish this one up?

drcrallen added a commit that referenced this pull request Aug 27, 2015
@drcrallen drcrallen merged commit c1388a1 into apache:master Aug 27, 2015
@ghost ghost mentioned this pull request Dec 7, 2015
@Hailei
Copy link
Contributor Author

Hailei commented Feb 24, 2016

@fjy @drcrallen I think this pull request that I submitted have defects. Inner query really should build on sub query.look following SQL

SELECT  COUNT(*) as cc,LONG_SUM(click) as c from (select cast,LONG_SUM(click_sum) as click from dsp_report where interval BETWEEN  2016-01-20 AND 2016-01-21) where interval BETWEEN  2016-01-20 AND 2016-01-21;

compile to JSON

{
  "intervals": {
    "intervals": ["2016-01-20/2016-01-21"],
    "type": "intervals"
  },
  "granularity": "all",
  "dataSource": {
    "query": {
      "intervals": {
        "intervals": ["2016-01-20/2016-01-21"],
        "type": "intervals"
      },
      "granularity": "all",
      "dataSource": {
        "name": "dsp_report",
        "type": "table"
      },
      "aggregations": [{
        "fieldName": "click_sum",
        "name": "click",
        "type": "longSum"
      }],
      "postAggregations": [],
      "queryType": "groupBy",
      "dimensions": ["cast"]
    },
    "type": "query"
  },
  "aggregations": [{
    "name": "cc",
    "type": "count"
  }],
  "postAggregations": [],
  "queryType": "groupBy",
  "dimensions": ["click"]
}

inner query's aggregation 'click' turned to outer query's dimension,If use the build that merged this PR,will return wrong result,because inner query don't have 'click' dimension.
Following SQL is mentioned in those issue are misused. As much as possible,The WHERE condition and aggregations are inside inner query.
#1632

SELECT 
   device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv 
FROM
  (SELECT
         device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
   FROM dsp_report        
   WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
   AND area = '0A' BREAK BY 'all' GROUP BY device, cast
   ) 
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ;

proper usage:

SELECT 
   device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv 
FROM
  (SELECT
         device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv 
   FROM dsp_report        
   WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
   AND area = '0A' and cast=1 BREAK BY 'all' GROUP BY device
   ) 
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 BREAK BY 'all' GROUP BY device ;

#1825

  select COUNT(*) as rows from (select a b from t1 group by a,b)

proper usage:

   select COUNT(*)  from t1 group by a,b

Issue #1036
Can count() be inside inner query?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants