Closed
Conversation
Stolb27
commented
Apr 24, 2020
darthunix
reviewed
Apr 24, 2020
darthunix
reviewed
Apr 24, 2020
darthunix
approved these changes
Apr 27, 2020
Collaborator
Author
|
Request for test cancelled. |
RekGRpth
pushed a commit
that referenced
this pull request
Dec 3, 2025
Fix Orca cost model to prefer hashing smaller tables
Previously in Orca it was possible to achieve bad hash join plans that hashed
a much bigger table. This happened because in Orca's cost model there is a
cost associated with columns used in the join conditions, and this cost was
smaller when tuples are hashed than when tuples fed from an outer child. This
doesn't really make sense since it could make Orca hash a bigger table if
there are enough join conditions, no matter how much bigger this table is.
To make sure this never happens, increase the cost per join column for inner
child, so that it is bigger than for outer child (same as cost per byte
already present).
Additionally, Orca increased cost per join column for outer child when
spilling was predicted, which doesn't make sense either since there is no
additional hashing when spilling is enabled. Postgres planner only imposes
additional per-byte (or rather per-page) cost when spilling hash join, so Orca
should have the same per-join-column cost for both spilling and non-spilling
cases.
A lot of tests are affected by this change, but for most of them only costs
are changed. For some, hash joins are reordered, swapping inner and outer
children, since Orca previously hashed the bigger child in some cases. In case
of LOJNullRejectingZeroPlacePredicates.mdp this actually restored the old plan
specified in the comment. Also add a new regress test.
One common change in some tests are replacing Hash Semi Join with a regular
Hash Join + Sort + GroupAggregate. There is only Left Semi Join, so swapping
the inner and outer children is impossible in case of semi joins. This means
that it's slightly cheaper to convert Hash Semi Join to regular Hash Join to
be able to swap the children. The opposite conversion also takes place where
previously GroupAggregate was used.
Another common change is replacing HashJoin(table1, Broadcast(table2)) gets
replaced with HashJoin(Redistribute(table1), Redistribute(table2)), adding
another slice. This happens because the cost for hashing is now slightly
bigger, and so Orca prefers to split hashing table2 to all segments, instead
of every segment hashing all rows as it would be with Broadcast.
Below are some notable changes in minidump files:
- ExtractPredicateFromDisjWithComputedColumns.mdp
This patch changed the join order from ((cust, sales), datedim) to ((datedim,
sales), cust). All three tables are identical from Orca's point of view: they
are all empty and all table scans are 24 bytes wide, so there is no reason for
Orca to prefer one join order over the other since they all have the same cost.
- HAWQ-TPCH-Stat-Derivation.mdp
The only change in the plan is swapping children on 3rd Hash Join in the plan,
one involving lineitem_ao_column_none_level0 and
HashJoin(partsupp_ao_column_none_level0, part_ao_column_none_level0).
lineitem_ao_column_none_level0 is predicted to have approximately 22 billion
rows and the hash join is predicted to have approximately 10 billion rows, so
making the hash join the inner child is good in this case, since the smaller
relation is hashed.
- Nested-Setops-2.mdp
Same here. Two swaps were performed between dept and emp in two different
places. dept contains 1 row and emp contains 10001, so it's better if dept is
hashed. A Redistribute Motion was also replaced with Broadcast Motion in both
cases.
- TPCH-Q5.mdp
Probably the best improvement out of these plans. The previous plan had this
join order:
```
-> Hash Join (6,000,000 rows)
-> Hash Join (300,000,000 rows)
-> lineitem (1,500,000,000 rows)
-> Hash Join (500,000 rows)
-> supplier (2,500,000 rows)
-> Hash Join (5 rows)
-> nation (25 rows)
-> region (1 row)
-> Hash Join (100,000,000 rows)
-> customer (40,000,000 rows)
-> orders (100,000,000 rows)
```
which contains hashing 100 million rows twice (first order, then its hash join
with customer). The new plan has no such issues:
```
-> Hash Join (6,000,000 rows)
-> Hash Join (170,000,000 rows)
-> lineitem (1,500,000,000 rows)
-> Hash Join (20,000,000 rows)
-> orders (100,000,000 rows)
-> Hash Join (7,000,000 rows)
-> customer (40,000,000 rows)
-> Hash Join (5 rows)
-> nation (25 rows)
-> region (1 row)
-> supplier (2,500,000 rows)
```
This plan only hashes around 30 million rows in total, much better than 200
million.
Ticket: ADBDEV-8413
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
According to https://github.com/greenplum-db/gpdb/pull/9985#issuecomment-618426312