QUICKSTEP-70-71 Improve aggregation performance #179

jianqiao · 2017-02-05T05:03:06Z

This PR implements two features that improve aggregation performance:

Adds CollisionFreeVectorTable to support specialized high-performance aggregation.
Adds support for aggregation copy elision that we only materialize intermediate results for non-trivial expressions.

For feature 1, when the group-by attribute is a range-bounded single attribute of INT or LONG type. We can use a vector of type std::vector<std::atomic<StateT>> to store the aggregation states, where StateT is the aggregation state type (currently restricted to LONG and DOUBLE). Then during aggregation, for each tuple, we locate the aggregation state with the group-by key's value as index to the state vector, and concurrently update the state with C++'s atomic primitives.

For feature 2, note that the current implementation of aggregation always creates a ColumnVectorsValueAccessor to store the results of ALL the input expressions. However, we can avoid the creation of a column vector (thus avoiding copying values into the column vector) if the aggregation is on a simple attribute, e.g. SUM(x). Thus by PR, when performing aggregation we prepare two input ValueAccessors: one BASE accessor that is created directly from the input relation's storage block, and one DERIVED accessor that is the temporary result ColumnVectorsValueAccessor. Each aggregation argument may be from the base accessor (meaning that it is a simple attribute) or from the derived accessor (meaning that it is a non-trivial expression that gets evaluated). The two accessors are then properly handled in aggregation handles and aggregation hash tables.

Main changes:
expressions/aggregation: Updated the aggregation handles to support copy elision. Also did some cleanups.
relational_operators: Added InitializeAggregationOperator to support parallel initialization of the aggregation state (just memset the memory to 0) -- because it takes a relatively long time to do the initialization with single thread if the aggregation hash table is large.
storage: Added CollisionFreeVectorTable. Renamed FastHashTable to PackedPayloadHashTable, made it support copy elision, and cleaned up the class to remove unused methods. Refactored AggregationOperationState to support copy elision and support the new aggregation. Moved aggregation code out of StorageBlock.

This PR significantly improves some TPC-H queries' performance. For example, it improves TPC-H Q18 from ~27.5s to ~3.5s, with scale factor 100 on a cloudlab machine.

Below shows the TPC-H performance (scale factor 100 on a cloudlab machine) with recently committed optimizations up to this point:

TPCH SF100	master (ms)	w/ optimizations (ms)
Q01	13629	11221
Q02	537	460
Q03	4824	4124
Q04	2185	2203
Q05	5517	5282
Q06	399	401
Q07	18563	3456
Q08	1746	899
Q09	7247	5586
Q10	6745	5665
Q11	1053	247
Q12	1713	1698
Q13	22896	15582
Q14	805	745
Q15	897	431
Q16	9942	9158
Q17	1588	1117
Q18	27459	3507
Q19	1711	1609
Q20	1204	1014
Q21	8671	7886
Q22	6178	724
Total	145509	83016

zuyu · 2017-02-05T19:23:43Z

Hi @jianqiao,

A quick question on Feature 1 using a vector-based aggregation: for a group-by w/ a known bounded range (i.e., the min and max value), do we always choose this approach over the hash-based, or depending on the range size (i.e., if the range is too wide, we may fall back to the hash-based)? Thanks!

jianqiao · 2017-02-05T21:15:37Z

It is depending on the range size. Currently there is a gflag for setting the range upbound at ExecutionGenerator.cpp line 440.

zuyu

My only concern regarding this PR is the way to deal with partitions. I guess we may merge PartitionedHashTablePool and HashTablePool so that the later is the special case of the former w/ a single partition.

zuyu · 2017-02-05T21:07:19Z

expressions/aggregation/AggregationConcreteHandle.hpp

@@ -61,7 +61,7 @@ class HashTableStateUpserterFast {
   *        table. The corresponding state (for the same key) in the destination
   *        hash table will be upserted.
   **/
-  HashTableStateUpserterFast(const HandleT &handle,
+  HashTableStateUpserter(const HandleT &handle,
                             const std::uint8_t *source_state)


Align with the line above.

zuyu · 2017-02-05T21:32:49Z

query_optimizer/ExecutionGenerator.cpp

+                                       estimated_num_groups,
+                                       &max_num_groups);
+
+    if (can_use_collision_free_aggregation) {


Minor, we actually don't need this extra bool variable.

zuyu · 2017-02-05T23:45:51Z

query_optimizer/ExecutionGenerator.cpp

+#endif
+
+  // Supports only single group-by key.
+  if (aggregate->grouping_expressions().size() != 1) {


For multiple small group-by keys, I think we could create a multi-dimension array for the same goal as the single key.

Yes we will have a follow-up PR dealing with multiple group-by keys to improve TPC-H Q01.

zuyu · 2017-02-05T23:47:26Z

query_optimizer/ExecutionGenerator.cpp

+      break;
+    }
+    default:
+      return false;


Is the reason of supporting only such types about the overflow?

We can support more types later. For any type/any number of group-by keys, if we can define a one-to-one mapping function that maps the keys to range-bounded integers, then this aggregation is applicable.

zuyu · 2017-02-05T23:50:35Z

query_optimizer/ExecutionGenerator.cpp

+    }
+
+    // TODO(jianqiao): Support AggregationID::AVG.
+    switch (agg_func->getAggregate().getAggregationID()) {


Refactor using QUICKSTEP_EQUALS_ANY_CONSTANT defined in utility/EqualsAnyConstant.hpp.

zuyu · 2017-02-06T04:50:35Z

storage/PackedPayloadHashTable.cpp

+
+  void *aligned_memory_start = this->blob_->getMemoryMutable();
+  std::size_t available_memory = num_storage_slots * kSlotSizeBytes;
+  if (align(alignof(Header),


Refactor using CHECK.

Similarly below.

zuyu · 2017-02-06T04:51:15Z

storage/PackedPayloadHashTable.cpp

+        "StorageBlob used to hold resizable "
+        "SeparateChainingHashTable is too small to meet alignment "
+        "requirements of SeparateChainingHashTable::Header.");
+  } else if (aligned_memory_start != this->blob_->getMemoryMutable()) {


Refactor using LOG_IF.

zuyu · 2017-02-06T04:52:34Z

storage/PackedPayloadHashTable.cpp

+  // Separate chaining ensures that any resized hash table with more buckets
+  // than the original table will be able to hold more entries than the
+  // original.
+  DEBUG_ASSERT(retry_num == 0);


Use DCHECK_EQ instead.

zuyu · 2017-02-06T04:53:03Z

storage/PackedPayloadHashTable.cpp

+      variable_storage_required;
+  const std::size_t resized_storage_slots =
+      this->storage_manager_->SlotsNeededForBytes(resized_memory_required);
+  if (resized_storage_slots == 0) {


Refactor using CHECK.

zuyu · 2017-02-06T04:58:21Z

storage/PackedPayloadHashTable.hpp

+    return true;
+  } else {
+    return false;
+  }


Minor, but we could flip the condition:

if (*entry_num >= header_->buckets_allocated.load(std::memory_order_relaxed)) { return false; } const char *bucket = static_cast<const char *>(buckets_) + (*entry_num) * bucket_size_; *key = key_manager_.getKeyComponentTyped(bucket, 0); *value = reinterpret_cast<const std::uint8_t *>(bucket + kValueOffset); ++(*entry_num); return true;

Similarly below.

jianqiao · 2017-02-07T20:29:57Z

For the question about PartitionedHashTablePool and HashTablePool. Note that their use patterns are different so perhaps it is not natural to merge them into one class.

PartitionedHashTablePool creates a fixed number of hash tables on its construction. The use pattern is that every AggregationWorkOrder updates all of these hash tables and every FinalizeAggregationWorkOrder finalizes one of these hash tables.

HashTablePool creates hash tables on demand. The current use pattern is that every AggregationWorkOrder checkouts exclusively one hash table, updates the hash table, and returns the hash table back to the pool. Then only one FinalizeAggregationWorkOrder is created to merge all the tables in the pool into the final hash table.

zuyu · 2017-02-07T21:05:49Z

Please resync with the master branch, and I will merge it. Thanks.

jianqiao · 2017-02-07T21:18:06Z

Just rebased on master.

jianqiao · 2017-02-07T21:28:17Z

There is a CMakeLists to be updated -- do not merge at this moment.

…egation for range-bounded single integer group-by key. - Supports copy elision for aggregation.

jianqiao · 2017-02-07T21:35:02Z

Updated, and tested locally.

pateljm · 2017-02-07T22:36:14Z

Very impressive algorithmic change @jianqiao !!

asfgit force-pushed the collision-free-agg branch 2 times, most recently from 7285f90 to fe2ec54 Compare February 5, 2017 18:11

asfgit force-pushed the collision-free-agg branch from b6a2059 to 0dce4d2 Compare February 5, 2017 21:11

asfgit force-pushed the collision-free-agg branch from 0dce4d2 to c70485b Compare February 6, 2017 01:47

zuyu suggested changes Feb 6, 2017

View reviewed changes

asfgit force-pushed the collision-free-agg branch from c70485b to 0dce4d2 Compare February 7, 2017 16:37

asfgit force-pushed the collision-free-agg branch from 0dce4d2 to 8dbac18 Compare February 7, 2017 21:16

asfgit force-pushed the collision-free-agg branch from 8dbac18 to 2d89e4f Compare February 7, 2017 21:29

- Adds CollisionFreeVectorTable to support specialized fast path aggr…

2d89e4f

…egation for range-bounded single integer group-by key. - Supports copy elision for aggregation.

asfgit merged commit 2d89e4f into master Feb 7, 2017

asfgit deleted the collision-free-agg branch February 7, 2017 22:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUICKSTEP-70-71 Improve aggregation performance #179

QUICKSTEP-70-71 Improve aggregation performance #179

jianqiao commented Feb 5, 2017

zuyu commented Feb 5, 2017

jianqiao commented Feb 5, 2017 •

edited

zuyu left a comment

zuyu Feb 5, 2017

jianqiao Feb 7, 2017

zuyu Feb 5, 2017

jianqiao Feb 7, 2017

zuyu Feb 5, 2017

jianqiao Feb 7, 2017 •

edited

zuyu Feb 5, 2017

jianqiao Feb 7, 2017

zuyu Feb 5, 2017

jianqiao Feb 7, 2017

zuyu Feb 6, 2017

zuyu Feb 6, 2017

zuyu Feb 6, 2017

zuyu Feb 6, 2017

zuyu Feb 6, 2017

jianqiao commented Feb 7, 2017 •

edited

zuyu commented Feb 7, 2017

jianqiao commented Feb 7, 2017

jianqiao commented Feb 7, 2017

jianqiao commented Feb 7, 2017

pateljm commented Feb 7, 2017

QUICKSTEP-70-71 Improve aggregation performance #179

QUICKSTEP-70-71 Improve aggregation performance #179

Conversation

jianqiao commented Feb 5, 2017

zuyu commented Feb 5, 2017

jianqiao commented Feb 5, 2017 • edited

zuyu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianqiao Feb 7, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianqiao commented Feb 7, 2017 • edited

zuyu commented Feb 7, 2017

jianqiao commented Feb 7, 2017

jianqiao commented Feb 7, 2017

jianqiao commented Feb 7, 2017

pateljm commented Feb 7, 2017

jianqiao commented Feb 5, 2017 •

edited

jianqiao Feb 7, 2017 •

edited

jianqiao commented Feb 7, 2017 •

edited