Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
 Hash Join Cost Update #1344
This PR focuses on improving the cost model estimate for Hash Joins. At first we wanted to tackle join reordering, however, we realized the cost model itself needed work to provide more accurate estimates before we tackled join reordering.
Along the way we encountered bugs for the statistics and costs, which unfortunately took time to fix. The primary addition is the improvement for the cost estimate in
It also took time to research the best approach for estimating the cost, as there is surprisingly little literature on the topic. In the end, we decided to try what Postgres does for their hash join cost estimate.
First PR is #1285 for reference. The first PR largely concerned the testing infrastructure, who's code is also included in this PR.
Overall the code looks good. I feel like there are some extra things that could be tried on the hash join cost model, like using statistics more to find correlations between columns (not sure if you tried some things that ended up not working).
I have a question that might be silly, but the way the bucket size estimation is working, from what I gathered, is looking at all the join keys and picking the optimal, but this cost is only accurate if the optimizer is using the same values to make that decision.
I look forward to see how this new model performed on the perf tests you wrote for the first PR.
Also, I looked at the fixes from the first PR and they look good as well.