-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inner Query should build on sub query #1632
Inner Query should build on sub query #1632
Conversation
@Hailei, It would be great if you can also add a unit test to GroupByQueryRunnerTest for this. |
ok ,I will add it tomorrow |
@Hailei What SQL interpreter are you using? |
@drcrallen SQL4D {
"filter": {
"type": "selector",
"dimension": "cast",
"value": "1"
},
"intervals": {
"intervals": ["2015-08-04T00:00:00/2015-08-05T00:00:00"],
"type": "intervals"
},
"granularity": "all",
"dataSource": {
"query": {
"filter": {
"type": "selector",
"dimension": "area",
"value": "0A"
},
"intervals": {
"intervals": ["2015-08-04T00:00:00/2015-08-05T00:00:00"],
"type": "intervals"
},
"granularity": "all",
"dataSource": {
"name": "dsp_report",
"type": "table"
},
"aggregations": [
{
"fieldName": "pv_sum",
"name": "pv",
"type": "longSum"
},
{
"fieldName": "uv",
"name": "hyper_uv",
"type": "hyperUnique"
}
],
"postAggregations": [],
"queryType": "groupBy",
"dimensions": [
"device",
"cast"
]
},
"type": "query"
},
"aggregations": [
{
"fieldName": "pv",
"name": "pv",
"type": "longSum"
},
{
"fieldName": "hyper_uv",
"name": "hyper_uv",
"type": "hyperUnique"
}
],
"postAggregations": [],
"queryType": "groupBy",
"dimensions": ["device"]
} |
I think this is a bug,But this PR isn't good solution. // We need the inner incremental index to have all the columns required by the outer query
final GroupByQuery innerQuery = new GroupByQuery.Builder(subquery)
.setAggregatorSpecs(aggs)
.setInterval(subquery.getIntervals())
.setPostAggregatorSpecs(Lists.<PostAggregator>newArrayList())
.build();
IncrementalIndex index = makeIncrementalIndex(innerQuery, subqueryResult); If inner query group by outer query's GROUP BY,the increment index only contains 'cast',So can't filter by 'cast=1' Use inner query to make increment index is redundant,their GROUP BY is the same,So this PR isn't good solution |
That shouldn't be an issue. The filter is applied FIRST. This is the reason there are dimension extraction stuff for both the dimension specification AND filters, which must both be set. Specifying a dimension extraction in the dimension specification, then making a "normal" dimension selector on the extracted dimension will NOT work. You can see the filter being applied at final Sequence<Cursor> cursors = storageAdapter.makeCursors(
Filters.convertDimensionFilters(query.getDimFilter()),
intervals.get(0),
query.getGranularity()
); where That neither proves nor disproves this issue. A unit test should be able to reveal if there is an issue here. |
If you issue the sub query as a thing on its own, is the result what you expect? "filter": {
"type": "selector",
"dimension": "area",
"value": "0A"
},
"intervals": {
"intervals": ["2015-08-04T00:00:00/2015-08-05T00:00:00"],
"type": "intervals"
},
"granularity": "all",
"dataSource": {
"name": "dsp_report",
"type": "table"
},
"aggregations": [
{
"fieldName": "pv_sum",
"name": "pv",
"type": "longSum"
},
{
"fieldName": "uv",
"name": "hyper_uv",
"type": "hyperUnique"
}
],
"postAggregations": [],
"queryType": "groupBy",
"dimensions": [
"device",
"cast"
] |
@drcrallen issue sub query,will return 4000 rows,the result as except,because the cardinality of cast is 1000 and the cardinality of device is 4. SELECT device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv
FROM dsp_report
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
AND area = '0A' BREAK BY 'all' GROUP BY device, cast ORDER BY cast LIMIT 10;
in the meaning time,apply this PR,issue the nest sql: SELECT
device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv
FROM
(SELECT
device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv
FROM dsp_report
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
AND area = '0A' BREAK BY 'all' GROUP BY device, cast
)
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ; the result is
this is the same as above |
look into GroupByQueryHelper line 71 final List<String> dimensions = Lists.transform(
query.getDimensions(),
new Function<DimensionSpec, String>()
{
@Override
public String apply(DimensionSpec input)
{
return input.getOutputName();
}
}
); line 118: Accumulator<IncrementalIndex, T> accumulator = new Accumulator<IncrementalIndex, T>()
{
@Override
public IncrementalIndex accumulate(IncrementalIndex accumulated, T in)
{
if (in instanceof MapBasedRow) {
try {
MapBasedRow row = (MapBasedRow) in;
accumulated.add(
new MapBasedInputRow(
row.getTimestamp(),
dimensions,
row.getEvent()
)
);
} if use outer query's dimensions,only add device to increment index,don't contains cast. |
I added a test that fails in master but passes with this patch : Hailei#1 Please confirm this is the case you are encountering |
Add test for apache#1632
@drcrallen yes,this case is similar to mine.The difference is inner query specify two dimensions "cast" and "device",and outer query outer query only specify one dimension "device" and filter by "cast" |
👍 |
@drcrallen can we finish this one up? |
Inner Query should build on sub query
@fjy @drcrallen I think this pull request that I submitted have defects. Inner query really should build on sub query.look following SQL SELECT COUNT(*) as cc,LONG_SUM(click) as c from (select cast,LONG_SUM(click_sum) as click from dsp_report where interval BETWEEN 2016-01-20 AND 2016-01-21) where interval BETWEEN 2016-01-20 AND 2016-01-21; compile to JSON {
"intervals": {
"intervals": ["2016-01-20/2016-01-21"],
"type": "intervals"
},
"granularity": "all",
"dataSource": {
"query": {
"intervals": {
"intervals": ["2016-01-20/2016-01-21"],
"type": "intervals"
},
"granularity": "all",
"dataSource": {
"name": "dsp_report",
"type": "table"
},
"aggregations": [{
"fieldName": "click_sum",
"name": "click",
"type": "longSum"
}],
"postAggregations": [],
"queryType": "groupBy",
"dimensions": ["cast"]
},
"type": "query"
},
"aggregations": [{
"name": "cc",
"type": "count"
}],
"postAggregations": [],
"queryType": "groupBy",
"dimensions": ["click"]
}
inner query's aggregation 'click' turned to outer query's dimension,If use the build that merged this PR,will return wrong result,because inner query don't have 'click' dimension. SELECT
device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv
FROM
(SELECT
device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv
FROM dsp_report
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
AND area = '0A' BREAK BY 'all' GROUP BY device, cast
)
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 AND cast = 1 BREAK BY 'all' GROUP BY device ; proper usage: SELECT
device, LONG_SUM(pv) as pv, HYPER_UNIQUE(hyper_uv) as hyper_uv
FROM
(SELECT
device, cast, LONG_SUM(pv_sum) as pv, HYPER_UNIQUE(uv) as hyper_uv
FROM dsp_report
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00
AND area = '0A' and cast=1 BREAK BY 'all' GROUP BY device
)
WHERE interval BETWEEN 2015-08-04T00:00:00 AND 2015-08-05T00:00:00 BREAK BY 'all' GROUP BY device ; select COUNT(*) as rows from (select a b from t1 group by a,b) proper usage: select COUNT(*) from t1 group by a,b Issue #1036 |
Like following query:
don't return anything,not as except.
look into the code path:
Inner Query should build on sub query