Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perfect Hash Join #1959

Merged
merged 122 commits into from Sep 7, 2021
Merged

Perfect Hash Join #1959

merged 122 commits into from Sep 7, 2021

Conversation

diegomestre2
Copy link
Contributor

@diegomestre2 diegomestre2 commented Jul 5, 2021

This PR provides the implementation of perfect hash join optimization.

For now, the optimization happens if the following statements are true:

  • Inner Join
  • One condition
  • Build side with no duplicates
  • Integer keys

This is done inside the physical hash join as an alternative path. In case it is not possible, it falls back to a normal hash join.

The perfect hash join copies the build data from the hash table after the build phase and stores it in a columnar manner.
The probe is done using the difference between the min value and the probe keys as indexes into the perfect hash table.
The result is a reference to the probe data and a dictionary vector containing the build keys.

Experiments with a fixed build side of 1024 and varying the probe side from 10k to 10M tuples achieve speedups of almost 8x.
Speed up over hash join

TODOs:

  • Fast path to reference both probes and build when the probe is part of the build domain.
  • Support inner joins with more conditions

@hannes
Copy link
Member

hannes commented Aug 13, 2021

Did you need help getting this sorted in multi-threading @diegomestre2 ?

@diegomestre2
Copy link
Contributor Author

Sorry for not finishing this PR. I spent some time on it and couldn't figure it out. But I will try it again and get back to you if I can't sort it out. I was just too busy during the previous weeks.

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes @pdet ! Some more comments:

TemplatedFillSelectionVectorBuild<int16_t>(source, sel_vec, seq_sel_vec, count);
break;
case PhysicalType::TIME32:
case PhysicalType::DATE32:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PhysicalType::TIME32/DATE32/TIME64/DATE64/TIMESTAMP do not exist in our system (they are arrow only)

TemplatedFillSelectionVectorProbe<int16_t>(source, build_sel_vec, probe_sel_vec, count, probe_sel_count);
break;
case PhysicalType::TIME32:
case PhysicalType::DATE32:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, these types aren't used

}
// with integral types
for (auto &&join_stat : op.join_stats) {
if (!join_stat->type.IsIntegral() || join_stat->type == LogicalTypeId::HUGEINT) {
Copy link
Collaborator

@Mytherin Mytherin Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ Can we check the physical type here instead of the logical type so this also works for dates/times/etc?

Plus, can we add hugeint support? Unless that is too much work.

}
// with equality condition
for (auto &&condition : op.conditions) {
if (condition.comparison != ExpressionType::COMPARE_EQUAL) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check null_values_are_equal here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with adding the other data types in the future, but I think for this PR I just want to get Diego's stuff working.

@@ -95,8 +95,7 @@ class DictionaryBuffer : public VectorBuffer {
void SetSelVector(const SelectionVector &vector) {
this->sel_vector.Initialize(vector);
}

private:
// private:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this remain private?

@@ -3,7 +3,6 @@
# group: [inner]

statement ok
PRAGMA enable_verification
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be removed?

@@ -34,6 +72,9 @@ class PhysicalHashJoin : public PhysicalComparisonJoin {
vector<LogicalType> build_types;
//! Duplicate eliminated types; only used for delim_joins (i.e. correlated subqueries)
vector<LogicalType> delim_types;
//! Checks and execute perfect hash join optimization
unique_ptr<PerfectHashJoinExecutor> pjoin_executor {nullptr};
bool use_perfect_hash {false};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

#include "duckdb/parallel/thread_context.hpp"
#include "duckdb/storage/storage_manager.hpp"

#include <utility>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this include added on purpose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grr, CLion again

@Mytherin
Copy link
Collaborator

Mytherin commented Sep 4, 2021

There still appears to be some failures remaining in the SQLancer tests. Could you have a look there?

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes. Looks good. One more comment:

statement ok
insert into null_equal_sec values (4),(5),(NULL)

#Unoptimized version can't handle this query
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? That seems like a bug. Optimizers should not be necessary for correctness.

Copy link
Member

@pdet pdet Sep 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It falls on a non-implemented type of the unoptimized join, if I remember correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unoptimized Result:
INTERNAL Error: Unimplemented comparison type for merge join!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the comparison type being:
COMPARE_NOT_DISTINCT_FROM

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a bug then. A merge join shouldn't be generated, but rather a hash join or a nested loop join, no? Probably a missed case in #1363.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I'll fix it here. :-)

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments w.r.t. code coverage:

@@ -32,6 +33,10 @@ static void TemplatedGatherLoop(Vector &rows, const SelectionVector &row_sel, Ve
data[col_idx] = Load<T>(row + col_offset);
ValidityBytes row_mask(row);
if (!row_mask.RowIsValid(row_mask.GetValidityEntry(entry_idx), idx_in_entry)) {
if (build_size > STANDARD_VECTOR_SIZE && col_mask.AllValid()) {
//! We need to initialize the mask with the vector size.
col_mask.Initialize(build_size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not covered, could you add a test that covers this?

col.SetVectorType(VectorType::FLAT_VECTOR);
switch (col.GetType().InternalType()) {
case PhysicalType::UINT8:
TemplatedFullScanLoop<uint8_t>(rows, col, count, col_offset, col_no);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost none of these types are covered. Could you add tests for all these types?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did some type loop for the tests right? Can you point me out to one of these files?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See e.g. here

data[i] = Load<T>(row + col_offset);
ValidityBytes row_mask(row);
if (!row_mask.RowIsValid(row_mask.GetValidityEntry(entry_idx), idx_in_entry)) {
col_mask.SetInvalid(i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is uncovered. Could you add a test that covers this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this line to be hit we need null values in the hash table.
However null comparisons are not done in this join, they are skipped in a phase prior to that.
I don't think we should remove this code, because at some point we might implement it anyway, but it is "unreachable" through testing right now. What you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps turn this into an internal exception for now if it is never supposed to be triggered currently, with the line of code commented out? If we need it in the future we can remove the exception again.

join_state.build_max = stats_build->max;
join_state.estimated_cardinality = op.estimated_cardinality;
if (build_range.type().id() == LogicalTypeId::DECIMAL) {
switch (build_range.type().InternalType()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are uncovered. Could you add a test that covers this?

join_state.build_range = build_range.value_.bigint;
break;
default:
throw std::runtime_error("Invalid Physical Type for Decimals");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternalException

case PhysicalType::INT32:
join_state.build_range = build_range.value_.integer;
break;
case PhysicalType::INT64:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decimals can also be HUGEINT

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but that physical type is ignored for this join, I can add it anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants