Compile WHERE k IN (1,2,3) into batched get #7

jtaylor-sfdc · 2013-01-26T00:30:29Z

When an IN (or the equivalent OR) appears in a query using the leading row key columns, compile it into a batched get to more efficiently retrieve the query results. Currently, the scan key will be set to the min value and max value in the IN instead.

jtaylor-sfdc · 2013-02-28T02:38:00Z

This could be staged in two parts:

handle the case where the query ends up being a batched get of fully qualified row keys instead of a scan. There are two potential queries constructs over which this optimization could be done: IN and multiple ORs with an equality on the same column. For example, if your primary key is composed of a host column, the following could be optimized to become a batched get:
1. SELECT ... FROM ... WHERE host IN ('na1','eu1')
2. SELECT ... FROM ... WHERE host = 'na1' OR host = 'eu1'
handle the case where the query ends up being a set of sequential scans that would skip from row key range to row key range. This is the case where you don't have a set of keys, but they're not fully qualified.
- same as above, but say the row key is made up of host and something else, like date. In this case, you could have a scan with a start key of 'na1' and a filter that has a skip hint to 'eu1' after the 'na1' rows are exhausted.

My take would be to start with implementing just 1.1 - folks can always rewrite a query like 1.2 using the IN syntax. For 1.2, you'd need to detect when you can't optimize it, so it'd be a little more work.

My take on how 1.1 would be implemented:

Mostly isolated to changes in WhereOptimizer.
1. Change KeyExpressionVisitor.visitLeave(InListExpression node, List<KeyParts> childKeyExprs) to collect all of the values in the IN in the KeyParts returned. We're currently getting the min value and max value instead (which will cause the scan to have a start key and stop key from the smallest value to the largest value).
2. In WhereOptimizer.pushKeyExpressionsToScan()
3. Instead of calculating a single start and stop key, we need to track a start and stop List<List<byte[]>>. Today, we'd have a list of singleton lists where each byte[] is the key being concatenated together to form the start key and stop key. With this optimization, though, we'll have the possibility that we'll have multiple keys (one per value in the IN).
4. Get rid of the assert(keyExpr.getKeyRanges().size() == 1) and have an inner loop over the keyExpr.getKeyRanges() key ranges. Each one gets added to the List<byte[]> for that key slot.
5. Instead of the part at the end that creates a final start/stop key, we'd want to delay doing that and hold on to the List<List<byte[]>> on the context object. It'd be good to have a class encapsulate this list.
In AggregatePlan.newScanner() and QueryPlan.newScanner(), create and use a new implementation of ResultIterators for the batched get case. You wouldn't use the ParallelIterators class, but you might push the handling of the batched get into the SerialLimitingIterators class.
1. The SerialLimitingIterators class is the class that handles the case when a limit has been defined on the query (to stop the scan early). Maybe it makes sense for the batched get to be done per region until the limit is reached? Or do it for a group of the batch get that's as big as the limit (and then do another one if you don't get back rows for some of the gets).
2. Instead of ParallelIterators, you'd implement your own that did a batch get by having a nested loop over the List<List <byte[]>> structure, essentially expanding it to the complete combination of keys (for the case like this: SELECT ... FROM ... WHERE host IN ('a','b') AND feature IN ('c','d') you'd end up with 'ac','ad','bc', 'bd')

So somewhat involved, but somewhat isolated. I'm sure they'll be gotchas, though. If you're still up for it, let me know and we can talk further over a whiteboard.

jtaylor-sfdc · 2013-03-15T22:57:29Z

Fixed (by Ron)

ghost assigned ryang-sfdc Feb 27, 2013

jtaylor-sfdc mentioned this issue Mar 9, 2013

Support transparent salting of row key #74

Closed

jtaylor-sfdc closed this as completed Mar 15, 2013

jtaylor-sfdc unassigned ryang-sfdc Feb 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compile WHERE k IN (1,2,3) into batched get #7

Compile WHERE k IN (1,2,3) into batched get #7

jtaylor-sfdc commented Jan 26, 2013

jtaylor-sfdc commented Feb 28, 2013

jtaylor-sfdc commented Mar 15, 2013

Compile WHERE k IN (1,2,3) into batched get #7

Compile WHERE k IN (1,2,3) into batched get #7

Comments

jtaylor-sfdc commented Jan 26, 2013

jtaylor-sfdc commented Feb 28, 2013

jtaylor-sfdc commented Mar 15, 2013