New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[15721] Hash Join Cost Update #1344

Open
wants to merge 47 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@GustavoAngulo
Member

GustavoAngulo commented May 5, 2018

This PR focuses on improving the cost model estimate for Hash Joins. At first we wanted to tackle join reordering, however, we realized the cost model itself needed work to provide more accurate estimates before we tackled join reordering.

Along the way we encountered bugs for the statistics and costs, which unfortunately took time to fix. The primary addition is the improvement for the cost estimate in cost_calculator.cpp.

It also took time to research the best approach for estimating the cost, as there is surprisingly little literature on the topic. In the end, we decided to try what Postgres does for their hash join cost estimate.

First PR is #1285 for reference. The first PR largely concerned the testing infrastructure, who's code is also included in this PR.

GustavoAngulo and others added some commits Mar 23, 2018

nappelson and others added some commits May 6, 2018

@gangsterveggies

Overall the code looks good. I feel like there are some extra things that could be tried on the hash join cost model, like using statistics more to find correlations between columns (not sure if you tried some things that ended up not working).

I have a question that might be silly, but the way the bucket size estimation is working, from what I gathered, is looking at all the join keys and picking the optimal, but this cost is only accurate if the optimizer is using the same values to make that decision.

I look forward to see how this new model performed on the perf tests you wrote for the first PR.

Also, I looked at the fixes from the first PR and they look good as well.

@@ -18,209 +18,168 @@
// CONNECTIONS
//===----------------------------------------------------------------------===//
// Peloton port
SETTING_int(port,

This comment has been minimized.

@gangsterveggies

gangsterveggies May 9, 2018

These changes look accidental, no need to change the indentation of this file.

output_cost_ = table_stats->num_rows * DEFAULT_TUPLE_COST;
}
void PostgresCostCalculator::Visit(
UNUSED_ATTRIBUTE const PhysicalIndexScan *op) {

This comment has been minimized.

@gangsterveggies

gangsterveggies May 9, 2018

The attribute op is being used, so no need for the UNUSED_ATTRIBUTE flag.

StatsStorage::GetInstance()->GetTableStats(
op->table_->GetDatabaseOid(), op->table_->GetTableOid(), txn_));
if (table_stats->GetColumnCount() == 0) {
output_cost_ = 1.f;

This comment has been minimized.

@gangsterveggies

gangsterveggies May 9, 2018

Why is this cost 1 if the column count is 0? We can consider that even though this is case is a no op since there are no columns this operation has some constant cost, but in that case the DummyScan and the empty IndexScan should have the same cost (currently it's 0). (I know this came from the original cost model, but we can still change it on the new code).

if (stats == nullptr) continue;
// TODO: A new hash join PR uses 256 as default so we're using this for now and hardcoding it here
auto num_buckets = 256.0;

This comment has been minimized.

@gangsterveggies

gangsterveggies May 9, 2018

Even though hardcoding this constant here seems fine, since this is tied to another variable used in the hash join execution/codegen the variables should be tied (for example, by setting some constant on the hash join implementation file or on the include/common) such that if someone in the future changes one, both change.

@qwkai

Changes look good overall. The modifications based on last pr comments have also been addressed.

output_cost_ = 0.f;
}
void PostgresCostCalculator::Visit(const PhysicalOrderBy *) { SortCost(); }

This comment has been minimized.

@qwkai

qwkai May 11, 2018

The parameter should be (UNUSED_ATTRIBUTE const PhysicalOrderBy *op) for consistency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment