-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VectorHasher's value ID caching logic makes certain queries unnecessarily slow #10057
Comments
@zeodtr Thank you for reporting this issue with so much detail.
This sounds similar to #9843 CC: @Yuhta @xiaoxmeng |
If |
@Yuhta Jimmy, let's first find out which code produced such a dictionary vector. It might be better to change that code to avoid producing such vectors (similar to Unnest). |
@mbasmanova Agree that we should find out the code producing this dictionary and selectivity in this case (out of exchange), because peeling can be inefficient as well on the data. But as general case I see this can happen legitimately, for example whenever we use dictionary to filter rows (remaining filter / join filter). As in |
…hen it is not beneficial Summary: Similar to facebookincubator#7150, when we only need to make IDs for a small number of rows fewer than dictionary values, using cache is slower and we should just compute the IDs directly. Related issue: facebookincubator#10057 Differential Revision: D58215380
…hen it is not beneficial (facebookincubator#10084) Summary: Similar to facebookincubator#7150, when we only need to make IDs for a small number of rows fewer than dictionary values, using cache is slower and we should just compute the IDs directly. Related issue: facebookincubator#10057 Differential Revision: D58215380
@zeodtr #10084 is a fix for
|
@Yuhta Thank you very much for your fix. Since I am out of the office this week, I will try it next week. I have this for the plan printout for now (I have changed names and deleted the column names in
Also, I will try to diagnose the dictionary vector. However, given my current knowledge of Velox's internals, I may have difficulty locating the place. Creating a unit test case might not be possible due to the large amount of data in the tables. Thank you. |
@zeodtr Thanks for the detail. Just a question, which plan tree are you observing the slowness? Is it |
…hen it is not beneficial (facebookincubator#10084) Summary: Similar to facebookincubator#7150, when we only need to make IDs for a small number of rows fewer than dictionary values, using cache is slower and we should just compute the IDs directly. Related issue: facebookincubator#10057 Reviewed By: mbasmanova Differential Revision: D58215380
…hen it is not beneficial (facebookincubator#10084) Summary: Similar to facebookincubator#7150, when we only need to make IDs for a small number of rows fewer than dictionary values, using cache is slower and we should just compute the IDs directly. Related issue: facebookincubator#10057 Reviewed By: mbasmanova Differential Revision: D58215380
…hen it is not beneficial (#10084) Summary: Pull Request resolved: #10084 Similar to #7150, when we only need to make IDs for a small number of rows fewer than dictionary values, using cache is slower and we should just compute the IDs directly. Related issue: #10057 Reviewed By: mbasmanova Differential Revision: D58215380 fbshipit-source-id: 50c904f06f4614525d289d36c792bfbf04ed8f6e
@Yuhta I'm not entirely sure, but I guess it's {Driver: running Exchange(0)<xdb_cpu_executor_task_3:0.0 0x7f91a9896700> HashProbe(1)<xdb_cpu_executor_task_3:0.0 0x7f91a9845a00> FilterProject(2)<xdb_cpu_executor_task_3:0.0 0x7f91a9835c00> Aggregation(3)<xdb_cpu_executor_task_3:0.0 0x7f91a9834000> PartitionedOutput(4)<xdb_cpu_executor_task_3:0.0 0x7f91a9865900> CallbackSink(5)<xdb_cpu_executor_task_3:0.0 0x7f91a9896a80> {OpCallStatus: executing HashProbe::getOutput for 0ms}} {Driver: running Exchange(0)<xdb_cpu_executor_task_3:0.0 0x7f91a9896700> HashProbe(1)<xdb_cpu_executor_task_3:0.0 0x7f91a9845a00> FilterProject(2)<xdb_cpu_executor_task_3:0.0 0x7f91a9835c00> Aggregation(3)<xdb_cpu_executor_task_3:0.0 0x7f91a9834000> PartitionedOutput(4)<xdb_cpu_executor_task_3:0.0 0x7f91a9865900> CallbackSink(5)<xdb_cpu_executor_task_3:0.0 0x7f91a9896a80> {OpCallStatus: executing Aggregation::addInput for 0ms}} Each executor process spent its most time on different plans, (maybe) one process was busy executing Thank you. |
@Yuhta For your information, the version of Velox for my executor process is taken from the upstream code as of January 2024, specifically from the commit titled "Rename getDataType and getDataChannels funcs in HiveDataSink (#8404)" on January 17, 2024. |
Plan 2 has Aggregation and Join. It seems likely that "bad" dictionary was produced by the Join and is causing trouble during Aggregation. |
…hen it is not beneficial (facebookincubator#10084) Summary: Pull Request resolved: facebookincubator#10084 Similar to facebookincubator#7150, when we only need to make IDs for a small number of rows fewer than dictionary values, using cache is slower and we should just compute the IDs directly. Related issue: facebookincubator#10057 Reviewed By: mbasmanova Differential Revision: D58215380 fbshipit-source-id: 50c904f06f4614525d289d36c792bfbf04ed8f6e
…hen it is not beneficial (facebookincubator#10084) Summary: Pull Request resolved: facebookincubator#10084 Similar to facebookincubator#7150, when we only need to make IDs for a small number of rows fewer than dictionary values, using cache is slower and we should just compute the IDs directly. Related issue: facebookincubator#10057 Reviewed By: mbasmanova Differential Revision: D58215380 fbshipit-source-id: 50c904f06f4614525d289d36c792bfbf04ed8f6e
I think it's likely due to HashProbe plan[2] wrapping input in dictionary to filter out non-matching rows. The ratio of 5600/2763772 has the same order of magnitude as 1024/1066500. @mbasmanova I think the solution should belong to the same story as #7801 |
@Yuhta I've tried #10084.
The execution times for the query are as follows:
Here are my thoughts:
Thank you. |
The exact condition to use cache is tricky and depending on the data. The current solution makes sure there is no large regression if we are using dictionary for filtering. For more investigation, it would be nice if you can find out what the dictionary vector is wrapping around. If it is from the probe side rows, I would imagine it's used exclusively for filtering so cache should not improve performance here. For build side rows, they are extracted from row container so should not be in dictionary. So it's a little mystery here why cache is beneficial. Maybe the join duplicates the probe side rows in some cases? |
…hen it is not beneficial (facebookincubator#10084) Summary: Pull Request resolved: facebookincubator#10084 Similar to facebookincubator#7150, when we only need to make IDs for a small number of rows fewer than dictionary values, using cache is slower and we should just compute the IDs directly. Related issue: facebookincubator#10057 Reviewed By: mbasmanova Differential Revision: D58215380 fbshipit-source-id: 50c904f06f4614525d289d36c792bfbf04ed8f6e
I've investigated further. The sparse dictionary vectors are returned from The shape of the data for the two tables (t1, t2) is as follows.
The queries for each count result (slightly modified to hide the real table and column name) are as follows: -- 1.
SELECT count(name)
FROM t1
WHERE tm_col >= timestamp '2022-07-11 03:00:00'
AND tm_col < timestamp '2022-07-11 03:10:00'
;
-- 2.
SELECT count(DISTINCT name)
FROM t1
WHERE tm_col >= timestamp '2022-07-11 03:00:00'
AND tm_col < timestamp '2022-07-11 03:10:00'
;
-- 3.
SELECT count(name)
FROM t2
WHERE tm_col >= timestamp '2022-09-30 05:30:00'
AND tm_col < timestamp '2022-09-30 06:00:00'
;
-- 4.
SELECT count(DISTINCT name)
FROM t2
WHERE tm_col >= timestamp '2022-09-30 05:30:00'
AND tm_col < timestamp '2022-09-30 06:00:00'
;
-- 5., 6.
SELECT count(name), count(DISTINCT name)
FROM (
SELECT name AS name
FROM t1
WHERE tm_col >= timestamp '2022-07-11 03:00:00'
AND tm_col < timestamp '2022-07-11 03:10:00'
) JOIN (
SELECT ip, name AS name2
FROM t2
WHERE tm_col >= timestamp '2022-09-30 05:30:00'
AND tm_col < timestamp '2022-09-30 06:00:00'
) ON name = name2
;
-- 7., 8
SELECT count(name1), count(DISTINCT name1)
FROM (
SELECT name AS name1
FROM t1
WHERE tm_col >= timestamp '2022-07-11 03:00:00'
AND tm_col < timestamp '2022-07-11 03:10:00'
)
WHERE NOT EXISTS
(
SELECT 1
FROM t2
WHERE tm_col >= timestamp '2022-09-30 05:30:00'
AND tm_col < timestamp '2022-09-30 06:00:00'
AND name = name1
)
;
-- 9., 10
SELECT count(name1), count(DISTINCT name1)
FROM (
SELECT name AS name1
FROM t1
WHERE tm_col >= timestamp '2022-07-11 03:00:00'
AND tm_col < timestamp '2022-07-11 03:10:00'
)
WHERE EXISTS
(
SELECT 1
FROM t2
WHERE tm_col >= timestamp '2022-09-30 05:30:00'
AND tm_col < timestamp '2022-09-30 06:00:00'
AND name = name1
)
; |
I see so the build side are both duplicating and filtering. Agree that the current solution should be enough, unless there is very important use case that requires us to optimize for this data shape. |
Description
Hi,
(I believe this issue is more of a performance bug report rather than an enhancement suggestion. However, since it is not a functional bug, I have chosen to classify it under the 'enhancement' category.)
I am building an OLAP DBMS system that uses Velox for the execution engine. In this issue, two executor processes exchange intermediate results. The query, which has been slightly modified to hide the real table name, is as follows:
The subqueries' resulting record count is 2,763,772 and 5,600 respectively.
The query became very slow after applying #7404 to my local Velox repository.
I investigated it and found the source code line that causes the problem is in ExchangeQueue.cpp.
The line is as follows:
The code link is as follows:
velox/velox/exec/ExchangeQueue.cpp
Line 121 in 3a7f8a8
After I changed the code to the following, the query became significantly faster (from 176 secs to 43 secs).
(I will call this modification as M1.)
The modified code effectively disables what #7404 tries to achieve. Strange.
So, I've run valgrind with callgrind on the original code and the resulting performance graph of one of the two executor processes was as follows:
memset()
was taking a big portion of runtime. It is called bystd::fill()
byVectorHasher::makeValueIdsDecoded()
.The code link is as follows:
velox/velox/exec/VectorHasher.cpp
Line 252 in 3a7f8a8
M1's valgrind result graph is as follows:
memset()
's portion is now negligible.Upon further investigation, I found that the
DecodedVector
's base vector size becomes disproportionately large relative to theSelectivityVector
's size in the unmodified code. For example, theDecodedVector
's base vector size is 1,066,500, while theSelectivityVector
's size is 1024. For M1, theDecodedVector
's base vector size is 11,900, while theSelectivityVector
's size is 1024.As a result, the cost of clearing the value ID cache outweighs the benefits of caching.
When I removed the caching logic, the query performed as quickly as M1.
Since I am working with a modified version of Velox's source code and not the current official source, I cannot be completely certain that this issue is present in the current version. However, I believe there is a high probability that it exists.
It would be nice if
VectorHasher
could be made more intelligent to avoid this kind of issue. (For example, disable the caching logic ifDecodedVector
's size is too big for theSelectivityVector
's size.)Thank you.
The text was updated successfully, but these errors were encountered: