New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perfect Hash Join #1959
Perfect Hash Join #1959
Conversation
Did you need help getting this sorted in multi-threading @diegomestre2 ? |
Sorry for not finishing this PR. I spent some time on it and couldn't figure it out. But I will try it again and get back to you if I can't sort it out. I was just too busy during the previous weeks. |
…ating selection vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes @pdet ! Some more comments:
TemplatedFillSelectionVectorBuild<int16_t>(source, sel_vec, seq_sel_vec, count); | ||
break; | ||
case PhysicalType::TIME32: | ||
case PhysicalType::DATE32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PhysicalType::TIME32/DATE32/TIME64/DATE64/TIMESTAMP do not exist in our system (they are arrow only)
TemplatedFillSelectionVectorProbe<int16_t>(source, build_sel_vec, probe_sel_vec, count, probe_sel_count); | ||
break; | ||
case PhysicalType::TIME32: | ||
case PhysicalType::DATE32: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, these types aren't used
} | ||
// with integral types | ||
for (auto &&join_stat : op.join_stats) { | ||
if (!join_stat->type.IsIntegral() || join_stat->type == LogicalTypeId::HUGEINT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ Can we check the physical type here instead of the logical type so this also works for dates/times/etc?
Plus, can we add hugeint support? Unless that is too much work.
} | ||
// with equality condition | ||
for (auto &&condition : op.conditions) { | ||
if (condition.comparison != ExpressionType::COMPARE_EQUAL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check null_values_are_equal
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with adding the other data types in the future, but I think for this PR I just want to get Diego's stuff working.
@@ -95,8 +95,7 @@ class DictionaryBuffer : public VectorBuffer { | |||
void SetSelVector(const SelectionVector &vector) { | |||
this->sel_vector.Initialize(vector); | |||
} | |||
|
|||
private: | |||
// private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this remain private?
@@ -3,7 +3,6 @@ | |||
# group: [inner] | |||
|
|||
statement ok | |||
PRAGMA enable_verification |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be removed?
@@ -34,6 +72,9 @@ class PhysicalHashJoin : public PhysicalComparisonJoin { | |||
vector<LogicalType> build_types; | |||
//! Duplicate eliminated types; only used for delim_joins (i.e. correlated subqueries) | |||
vector<LogicalType> delim_types; | |||
//! Checks and execute perfect hash join optimization | |||
unique_ptr<PerfectHashJoinExecutor> pjoin_executor {nullptr}; | |||
bool use_perfect_hash {false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^
#include "duckdb/parallel/thread_context.hpp" | ||
#include "duckdb/storage/storage_manager.hpp" | ||
|
||
#include <utility> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this include added on purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grr, CLion again
There still appears to be some failures remaining in the SQLancer tests. Could you have a look there? |
…tats builds are null to avoid null comparisons when optimizing for perfect Ht
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes. Looks good. One more comment:
statement ok | ||
insert into null_equal_sec values (4),(5),(NULL) | ||
|
||
#Unoptimized version can't handle this query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? That seems like a bug. Optimizers should not be necessary for correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It falls on a non-implemented type of the unoptimized join, if I remember correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unoptimized Result:
INTERNAL Error: Unimplemented comparison type for merge join!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the comparison type being:
COMPARE_NOT_DISTINCT_FROM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a bug then. A merge join shouldn't be generated, but rather a hash join or a nested loop join, no? Probably a missed case in #1363.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I'll fix it here. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments w.r.t. code coverage:
@@ -32,6 +33,10 @@ static void TemplatedGatherLoop(Vector &rows, const SelectionVector &row_sel, Ve | |||
data[col_idx] = Load<T>(row + col_offset); | |||
ValidityBytes row_mask(row); | |||
if (!row_mask.RowIsValid(row_mask.GetValidityEntry(entry_idx), idx_in_entry)) { | |||
if (build_size > STANDARD_VECTOR_SIZE && col_mask.AllValid()) { | |||
//! We need to initialize the mask with the vector size. | |||
col_mask.Initialize(build_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not covered, could you add a test that covers this?
col.SetVectorType(VectorType::FLAT_VECTOR); | ||
switch (col.GetType().InternalType()) { | ||
case PhysicalType::UINT8: | ||
TemplatedFullScanLoop<uint8_t>(rows, col, count, col_offset, col_no); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost none of these types are covered. Could you add tests for all these types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did some type loop for the tests right? Can you point me out to one of these files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See e.g. here
data[i] = Load<T>(row + col_offset); | ||
ValidityBytes row_mask(row); | ||
if (!row_mask.RowIsValid(row_mask.GetValidityEntry(entry_idx), idx_in_entry)) { | ||
col_mask.SetInvalid(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is uncovered. Could you add a test that covers this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this line to be hit we need null values in the hash table.
However null comparisons are not done in this join, they are skipped in a phase prior to that.
I don't think we should remove this code, because at some point we might implement it anyway, but it is "unreachable" through testing right now. What you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps turn this into an internal exception for now if it is never supposed to be triggered currently, with the line of code commented out? If we need it in the future we can remove the exception again.
join_state.build_max = stats_build->max; | ||
join_state.estimated_cardinality = op.estimated_cardinality; | ||
if (build_range.type().id() == LogicalTypeId::DECIMAL) { | ||
switch (build_range.type().InternalType()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines are uncovered. Could you add a test that covers this?
join_state.build_range = build_range.value_.bigint; | ||
break; | ||
default: | ||
throw std::runtime_error("Invalid Physical Type for Decimals"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InternalException
case PhysicalType::INT32: | ||
join_state.build_range = build_range.value_.integer; | ||
break; | ||
case PhysicalType::INT64: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decimals can also be HUGEINT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but that physical type is ignored for this join, I can add it anyway.
This PR provides the implementation of perfect hash join optimization.
For now, the optimization happens if the following statements are true:
This is done inside the physical hash join as an alternative path. In case it is not possible, it falls back to a normal hash join.
The perfect hash join copies the build data from the hash table after the build phase and stores it in a columnar manner.
The probe is done using the difference between the min value and the probe keys as indexes into the perfect hash table.
The result is a reference to the probe data and a dictionary vector containing the build keys.
Experiments with a fixed build side of 1024 and varying the probe side from 10k to 10M tuples achieve speedups of almost 8x.
TODOs: