Optimize element_at for maps with complex type keys #7365

mbasmanova · 2023-11-01T13:04:05Z

Implementation of element_at(map, search) uses linear search to find matching
entry. This is slow when maps are large (100s of entries). This optimization
sorts map keys and uses binary search producing 4x speedup on a real-world
workload. The optimization applies only if there is a single map (constant or
dictionary encoded) since in this case the cost of sorting the map amortizes
across many searches.

There is a standard query optimization technique that generates element_at
over a single map.

A simple join with a small lookup table u:

SELECT * FROM t, u WHERE t.id = u.id

Can be written as

SELECT
   element_at(
      (SELECT map_agg(u.id) FROM u), t.id
   ) 
FROM t

This ensures that records from u will be collected into a single map and
broadcasted to all workers to be "joined" with t. The original join query may
run as partitioned join (if optimizer is not smart enough) and be very slow.

netlify · 2023-11-01T13:04:12Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`7c768af`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/6543c73a335272000863f796

facebook-github-bot · 2023-11-01T13:04:45Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Yuhta · 2023-11-01T13:42:45Z

How is this related to #7191? CC: @laithsakka

Yuhta · 2023-11-01T13:46:21Z

velox/functions/lib/SubscriptUtil.h

-      size_t offset = rawOffsets[mapIndex];
+    // Fast path for the case of a single map. It may be constant or dictionary
+    // encoded. Sort map keys, then use binary search.
+    if (baseMap->size() == 1) {


@laithsakka How does this criteria compare to your method to check identity between batches?

@mbasmanova Laith is using a method to check if map vector between batches are the same. Your method looks simpler and less error-prone, while his method covers more cases.

@Yuhta I'm aware of Laith's work. This optimization is complimentary as it covers maps with complex-type keys while #7191 applies only to maps with primitive type keys.

Also, #7191 applies to cases where there are many maps, not just one.

I think it's possible to expand this one to all key types and use Laith's triggering criteria to detect MapVector identity. Then we can use one single optimization to cover all cases.

@Yuhta @laithsakka Indeed, it should be straightforward to extend this optimization to primitive types. We can also optimize Nested Loop Join to detect the case of single-row build side and use Constant encoding for build-side columns + sort build-side maps. If we do that, we'll have maps coming to element_at already sorted and that will make this optimization even more effective.

Laith, what do you think?

@Yuhta @laithsakka There is another pattern where map is a constant defined in the query. In this case it is not produced by the NLJ. We can add logic to always sort constant map literals so that element_at receives maps sorted in that case too. If we do that, we won't need to make element_at stateful and can keep it stateless and relatively simple.

I thought about both ideas initially and discussed with @oerling and @kevinwilfong .
My concern was that we have to sort for every batch because I do not have so much confidence
in the isSorted flag is reset whenever a map is updated. we can do the same game of holding a reference though to make sure the map is not updated and hence assert that if it was sorted in one batch it will always be sorted when received.

I do not have string opinion for either, maybe run this opt on the query i am working on also and check the results?
if you strongly feel this is better you can take over and abandon the other PR but also make sure that you double check with @oerling

Yuhta · 2023-11-01T13:51:27Z

If we just sort everything (not just complex typed keys) and binary search if the input is a single map, how does the result compare to the one in #7191?

laithsakka · 2023-11-01T17:47:24Z

one more thing i recall from the discussions with orri is that those base maps can be shared across threads, we need to add locks in sorting if we are doing that. in the functions . unless we sort pre before distributing the vector

mbasmanova · 2023-11-01T21:22:09Z

If we just sort everything (not just complex typed keys) and binary search if the input is a single map, how does the result compare to the one in #7191?

@Yuhta I tried this on Laith's query and it didn't work very well. In his case, map has 70K entries and each batch has only 1K rows. It seems that sorting 70K entry map doesn't get amortized over 1K binary searches and cached hash table works better.

Yuhta · 2023-11-02T15:15:36Z

velox/functions/lib/SubscriptUtil.h

@@ -23,6 +23,22 @@

 namespace facebook::velox::functions {

+namespace {


No anonymous namespace in header. Either put it in detail or the main class SubscriptImpl

Yuhta · 2023-11-02T15:15:56Z

velox/functions/lib/SubscriptUtil.h

+  const vector_size_t baseIndex;
+  const vector_size_t index;
+
+  static bool lessThen(const MapKey& left, const MapKey& right) {


lessThan, or you can just overload operator< since it's a custom struct already

facebook-github-bot · 2023-11-02T16:00:06Z

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-11-02T19:33:49Z

@mbasmanova merged this pull request in 34aa5ac.

conbench-facebook · 2023-11-02T19:55:19Z

Conbench analyzed the 1 benchmark run on commit 34aa5ac9.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

mbasmanova requested review from laithsakka, oerling and xiaoxmeng November 1, 2023 13:04

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 1, 2023

Yuhta reviewed Nov 1, 2023

View reviewed changes

Yuhta approved these changes Nov 2, 2023

View reviewed changes

...

7c768af

mbasmanova force-pushed the opt-element-at branch from dc5b60f to 7c768af Compare November 2, 2023 15:58

facebook-github-bot closed this in 34aa5ac Nov 2, 2023

facebook-github-bot added the Merged label Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize element_at for maps with complex type keys #7365

Optimize element_at for maps with complex type keys #7365

mbasmanova commented Nov 1, 2023

netlify bot commented Nov 1, 2023 •

edited

facebook-github-bot commented Nov 1, 2023

Yuhta commented Nov 1, 2023

Yuhta Nov 1, 2023

mbasmanova Nov 1, 2023

mbasmanova Nov 1, 2023

Yuhta Nov 1, 2023 •

edited

mbasmanova Nov 1, 2023

mbasmanova Nov 1, 2023

laithsakka Nov 1, 2023 •

edited

Yuhta commented Nov 1, 2023

laithsakka commented Nov 1, 2023 •

edited

mbasmanova commented Nov 1, 2023

Yuhta Nov 2, 2023

Yuhta Nov 2, 2023

facebook-github-bot commented Nov 2, 2023

facebook-github-bot commented Nov 2, 2023

conbench-facebook bot commented Nov 2, 2023

		@@ -23,6 +23,22 @@

		namespace facebook::velox::functions {

		namespace {

Optimize element_at for maps with complex type keys #7365

Optimize element_at for maps with complex type keys #7365

Conversation

mbasmanova commented Nov 1, 2023

netlify bot commented Nov 1, 2023 • edited

✅ Deploy Preview for meta-velox canceled.

facebook-github-bot commented Nov 1, 2023

Yuhta commented Nov 1, 2023

Yuhta Nov 1, 2023

Choose a reason for hiding this comment

mbasmanova Nov 1, 2023

Choose a reason for hiding this comment

mbasmanova Nov 1, 2023

Choose a reason for hiding this comment

Yuhta Nov 1, 2023 • edited

Choose a reason for hiding this comment

mbasmanova Nov 1, 2023

Choose a reason for hiding this comment

mbasmanova Nov 1, 2023

Choose a reason for hiding this comment

laithsakka Nov 1, 2023 • edited

Choose a reason for hiding this comment

Yuhta commented Nov 1, 2023

laithsakka commented Nov 1, 2023 • edited

mbasmanova commented Nov 1, 2023

Yuhta Nov 2, 2023

Choose a reason for hiding this comment

Yuhta Nov 2, 2023

Choose a reason for hiding this comment

facebook-github-bot commented Nov 2, 2023

facebook-github-bot commented Nov 2, 2023

conbench-facebook bot commented Nov 2, 2023

netlify bot commented Nov 1, 2023 •

edited

Yuhta Nov 1, 2023 •

edited

laithsakka Nov 1, 2023 •

edited

laithsakka commented Nov 1, 2023 •

edited