Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perfect Hash Join #1959

Merged
merged 122 commits into from
Sep 7, 2021
Merged
Show file tree
Hide file tree
Changes from 98 commits
Commits
Show all changes
122 commits
Select commit Hold shift + click to select a range
33d309f
Add test for hash join with perfect hash
diegomestre2 Apr 21, 2021
656c7f7
Add propagate stats to join for perfect hash
diegomestre2 Apr 22, 2021
9925b05
Adding includes
diegomestre2 Apr 23, 2021
5b833c5
Adding implementations for perfect hash join
diegomestre2 Apr 26, 2021
2634e69
uncommenting branch
diegomestre2 Apr 26, 2021
b31e717
Perfect hash join should occur inside hash join in case it falls back
diegomestre2 Apr 26, 2021
a545967
Removing unwanted files and adding flag to physical hash join constru…
diegomestre2 Apr 26, 2021
244a35a
Creating perfect hash join state object to store ranges and info abou…
diegomestre2 Apr 27, 2021
20fa952
Adding commented code
diegomestre2 Apr 28, 2021
46ea7ea
Retrieving left side and right side
diegomestre2 Apr 29, 2021
d373392
Probe right side using the difference of the minimum value
diegomestre2 Apr 30, 2021
1d61fc0
Joins and copies the right side in the perfect hash join
diegomestre2 May 2, 2021
f438825
Fetching right side from hashtable
diegomestre2 May 2, 2021
d568c5d
The build side should be fetched using the selection vector from the …
diegomestre2 May 3, 2021
e6df3b7
Adding checks for perfect hash table
diegomestre2 May 5, 2021
4acb1d2
Perfect hash table should be built right after hash table build is done
diegomestre2 May 6, 2021
48a3e0c
Check for perfect hash join only when columns are numeric
diegomestre2 May 6, 2021
1dd4063
Adding Vector constructor with variable size
diegomestre2 May 7, 2021
e2a8841
Build perfect hash table structure
diegomestre2 May 7, 2021
49b3000
Add gather mathods to class definition
diegomestre2 May 7, 2021
07b5f4b
Fix for non numeric statistics
diegomestre2 May 7, 2021
6c3ddf1
Create scan over hash table to store offsets
diegomestre2 May 7, 2021
852d0c1
Build columnar hash table for perfect hash join
diegomestre2 May 9, 2021
724e320
Fix for nullptrsd
diegomestre2 May 9, 2021
eca0065
Merge with master
diegomestre2 May 9, 2021
a20b1cc
Fix merge
diegomestre2 May 10, 2021
c3730cf
Adding changes in the hash table building
diegomestre2 May 10, 2021
1dc5e93
Replacing min-max checking with vectorized primitive
diegomestre2 May 11, 2021
cc5ed9f
Propagating build and probe stats to execution
diegomestre2 May 11, 2021
4508830
Check during the planning whether the probe min-max is part of the bu…
diegomestre2 May 11, 2021
5a4d82d
Building columnar hashtable
diegomestre2 May 13, 2021
688d721
Adding full scan for hashtable
diegomestre2 May 13, 2021
99d21c6
Getting key locations before scanning hashtable
diegomestre2 May 18, 2021
158138c
Init data to buffer in vector new constructor
diegomestre2 May 18, 2021
9004438
Slicing probe side
diegomestre2 May 18, 2021
ad31f83
MinMax should use source type to template function
diegomestre2 May 19, 2021
05c3c71
Partial probe perfect hash table
diegomestre2 May 19, 2021
b565ef6
generate selection vector to add build side to result
diegomestre2 May 20, 2021
93afa21
Working version
diegomestre2 May 22, 2021
5f29c25
Adding templated selection vector creation for invisible join
diegomestre2 May 22, 2021
26e7d44
Evaluating range and creating selection vector in one function for in…
diegomestre2 May 25, 2021
7a6ba61
Invisible join not possible for no integral type
diegomestre2 May 25, 2021
1b418ec
Fixes for duplicate checking on build side
diegomestre2 May 25, 2021
f8f66bf
Checking for duplicates in the build side
diegomestre2 May 26, 2021
8bad438
Replacing slice by reference in invisible join
diegomestre2 May 27, 2021
9347a2c
adding function to fill new build structure with its respective posit…
diegomestre2 May 28, 2021
a1de9ad
Working version without duplicate handling
diegomestre2 May 31, 2021
89b3fc7
Initial rework on invisible join, moving it to its own operator
diegomestre2 May 31, 2021
751eb08
Rework on invisible join
diegomestre2 Jun 1, 2021
62d8b34
Rework on invisible join, storing build columns in vectors
diegomestre2 Jun 1, 2021
a9e9c67
Rework on invisible join, getting build side keys
diegomestre2 Jun 2, 2021
6769619
Rework in invisible join, building columnar structure without hashtable
diegomestre2 Jun 3, 2021
7917f6c
We only cover invisible joins with one condition for now
diegomestre2 Jun 3, 2021
2cef8da
Update tests
diegomestre2 Jun 4, 2021
582f0bb
Get build values as selection vector
diegomestre2 Jun 6, 2021
4547b4c
Get build values as selection vector and use it for the probe
diegomestre2 Jun 6, 2021
88afbf1
Fix issues with larger than standard vector size problems
diegomestre2 Jun 7, 2021
48d9193
merge with upstream master
diegomestre2 Jun 7, 2021
2cbc951
More fixes for merge
diegomestre2 Jun 8, 2021
ce3019f
More fixes for merge
diegomestre2 Jun 8, 2021
ef41cca
Fix for build with bigger than vector size
diegomestre2 Jun 9, 2021
be6fec1
Current working repo
diegomestre2 Jun 10, 2021
20e4512
Including fast pass version
diegomestre2 Jun 11, 2021
00c4b47
Increase build threshould
diegomestre2 Jun 11, 2021
8716314
Add tests with variable num of columns
diegomestre2 Jun 11, 2021
96168a3
Add tests with variable num of columns
diegomestre2 Jun 11, 2021
d125648
Add tests with variable num of columns
diegomestre2 Jun 11, 2021
3472e16
Rework for pull request
diegomestre2 Jun 11, 2021
e128e7c
More perfect hash join refactory
diegomestre2 Jun 14, 2021
2694620
More perfect hash join refactory, moving into a new class
diegomestre2 Jun 14, 2021
a61f5c1
More perfect hash join refactory
diegomestre2 Jun 15, 2021
fb15fed
Moving perfect hash join code to executor
diegomestre2 Jun 16, 2021
4b038df
Converting static methods into non-static
diegomestre2 Jun 16, 2021
a3681e9
Including signed columns in perfect hash join
diegomestre2 Jun 17, 2021
2548c83
Removing code in vector that checks index limit for dictionary vectors
diegomestre2 Jun 17, 2021
d4f4813
Fix for negative keys
diegomestre2 Jun 17, 2021
6fcb961
Replacing perfect hash table allocation size by range rather than key…
diegomestre2 Jun 18, 2021
ce581e2
Allocating selection vector for probe only once
diegomestre2 Jun 18, 2021
571d8d0
More fixes for pull request
diegomestre2 Jun 19, 2021
241ba8a
More fixes for tests that are breaking
diegomestre2 Jun 20, 2021
b49c400
Support for duplicate check on perfect hash join
diegomestre2 Jun 21, 2021
3d9dc41
Fix for propagation after update
diegomestre2 Jun 21, 2021
67fa02f
Separating probe and build when making a selection vector in perfect …
diegomestre2 Jun 22, 2021
89d3e74
More fixes for tests breaking
diegomestre2 Jun 23, 2021
c4eb83d
More fixes for tests breaking and merge with master
diegomestre2 Jun 24, 2021
c575e25
More fixes for merge with main branch
diegomestre2 Jun 25, 2021
65823f9
Refining implementation
diegomestre2 Jun 26, 2021
a0e10ce
Merge
diegomestre2 Jun 26, 2021
b3045b1
Merge with master
diegomestre2 Jun 27, 2021
7cd22f3
Adding more tests for perfect hash join
diegomestre2 Jun 29, 2021
8bcf1ce
Merge branch 'master' of https://github.com/duckdb/duckdb
diegomestre2 Jun 29, 2021
5505012
Adding more tests to perfect hash join
diegomestre2 Jun 29, 2021
e4b7950
Fixes
diegomestre2 Jun 30, 2021
cfd1a75
Merge branch 'master' of https://github.com/duckdb/duckdb
diegomestre2 Jun 30, 2021
dfdbefd
Fixes for pull request
diegomestre2 Jul 5, 2021
33b9755
Merge branch 'master' of https://github.com/duckdb/duckdb
diegomestre2 Jul 5, 2021
166d699
Fix for deleted function in join hashtable object
diegomestre2 Jul 5, 2021
ef37a8c
Merge branch 'master' of https://github.com/duckdb/duckdb
diegomestre2 Jul 5, 2021
be3d70c
Merge with master
diegomestre2 Jul 18, 2021
0e8773c
Fixes for merge
diegomestre2 Jul 18, 2021
6464e9f
Merge with master
diegomestre2 Aug 28, 2021
1e9c77f
Fix for some tidy check errors
diegomestre2 Aug 28, 2021
f2f9bb6
Merge branch 'duckdb:master' into master
pdet Sep 1, 2021
0ee7b63
Fix for parallel issue on perfect HJ
pdet Sep 1, 2021
8f767c4
Fixing includes
pdet Sep 1, 2021
f7f9f23
Fixing the invalid key types on Perfect HJ
pdet Sep 2, 2021
9e2b49d
More fixes for Perfect HJ, avoiding null comparisons and properly cre…
pdet Sep 2, 2021
a754dee
Fixing validity mask creation in row gather function
pdet Sep 3, 2021
61de84c
PR Requests
pdet Sep 3, 2021
7d42096
Fixing Perfect HJ bugs related to null values and decimal types
pdet Sep 3, 2021
7a0e4e4
Fixing test join invisible probe
pdet Sep 3, 2021
8bfcc33
Hopefully fixing the build
pdet Sep 3, 2021
5d53052
Small fix when constructing nullmasks on row gather and checking if s…
pdet Sep 5, 2021
1da8d54
Only run perfect HJ when build size <= 10pow6
pdet Sep 5, 2021
93f6502
PR requests
pdet Sep 6, 2021
5e0a200
Merge branch 'master' of github.com:diegomestre2/duckdb
pdet Sep 6, 2021
ee8140c
Adding internal execeptions on code that is should be unreachable at …
pdet Sep 6, 2021
2af736c
Commenting out unused line so git doesn't get angry with me
pdet Sep 6, 2021
a0f9a93
Update test_not_distinct_from.test
Mytherin Sep 6, 2021
a77f703
Codecov on unsigned types of Perfect HJ
pdet Sep 6, 2021
a6ba64b
Merge branch 'master' of github.com:diegomestre2/duckdb
pdet Sep 6, 2021
726173a
forgot the decimals
pdet Sep 7, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/common/enums/join_type.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,8 @@ bool IsRightOuterJoin(JoinType type) {
return type == JoinType::OUTER || type == JoinType::RIGHT;
}

bool IsInnerJoin(JoinType type) {
diegomestre2 marked this conversation as resolved.
Show resolved Hide resolved
return type == JoinType::INNER;
}

} // namespace duckdb
76 changes: 76 additions & 0 deletions src/common/row_operations/row_gather.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -117,4 +117,80 @@ void RowOperations::Gather(Vector &rows, const SelectionVector &row_sel, Vector
}
}

template <class T>
static void TemplatedFullScanLoop(Vector &rows, Vector &col, idx_t count, idx_t col_offset, idx_t col_no) {
// Precompute mask indexes
idx_t entry_idx;
idx_t idx_in_entry;
ValidityBytes::GetEntryIndex(col_no, entry_idx, idx_in_entry);

auto ptrs = FlatVector::GetData<data_ptr_t>(rows);
auto data = FlatVector::GetData<T>(col);
auto &col_mask = FlatVector::Validity(col);

for (idx_t i = 0; i < count; i++) {
auto row = ptrs[i];
data[i] = Load<T>(row + col_offset);
ValidityBytes row_mask(row);
if (!row_mask.RowIsValid(row_mask.GetValidityEntry(entry_idx), idx_in_entry)) {
col_mask.SetInvalid(i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is uncovered. Could you add a test that covers this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this line to be hit we need null values in the hash table.
However null comparisons are not done in this join, they are skipped in a phase prior to that.
I don't think we should remove this code, because at some point we might implement it anyway, but it is "unreachable" through testing right now. What you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps turn this into an internal exception for now if it is never supposed to be triggered currently, with the line of code commented out? If we need it in the future we can remove the exception again.

}
}
}

void RowOperations::FullScanColumn(const RowLayout &layout, Vector &rows, Vector &col, idx_t count, idx_t col_no) {
const auto col_offset = layout.GetOffsets()[col_no];
col.SetVectorType(VectorType::FLAT_VECTOR);
switch (col.GetType().InternalType()) {
case PhysicalType::UINT8:
TemplatedFullScanLoop<uint8_t>(rows, col, count, col_offset, col_no);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost none of these types are covered. Could you add tests for all these types?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did some type loop for the tests right? Can you point me out to one of these files?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See e.g. here

break;
case PhysicalType::UINT16:
TemplatedFullScanLoop<uint16_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::UINT32:
TemplatedFullScanLoop<uint32_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::UINT64:
TemplatedFullScanLoop<uint64_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::BOOL:
case PhysicalType::INT8:
TemplatedFullScanLoop<int8_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::INT16:
TemplatedFullScanLoop<int16_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::INT32:
TemplatedFullScanLoop<int32_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::INT64:
TemplatedFullScanLoop<int64_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::INT128:
TemplatedFullScanLoop<hugeint_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::FLOAT:
TemplatedFullScanLoop<float>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::DOUBLE:
TemplatedFullScanLoop<double>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::POINTER:
TemplatedFullScanLoop<uintptr_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::INTERVAL:
TemplatedFullScanLoop<interval_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::HASH:
TemplatedFullScanLoop<hash_t>(rows, col, count, col_offset, col_no);
break;
case PhysicalType::VARCHAR:
TemplatedFullScanLoop<string_t>(rows, col, count, col_offset, col_no);
break;
default:
throw NotImplementedException("Unimplemented type for RowOperations::FullScanColumn");
}
}

} // namespace duckdb
5 changes: 2 additions & 3 deletions src/common/types/vector.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,17 @@
#include "duckdb/common/algorithm.hpp"
#include "duckdb/common/assert.hpp"
#include "duckdb/common/exception.hpp"
#include "duckdb/common/operator/comparison_operators.hpp"
#include "duckdb/common/pair.hpp"
#include "duckdb/common/printer.hpp"
#include "duckdb/common/serializer.hpp"
#include "duckdb/common/to_string.hpp"
#include "duckdb/common/types/chunk_collection.hpp"
#include "duckdb/common/types/null_value.hpp"
#include "duckdb/common/types/sel_cache.hpp"
#include "duckdb/common/types/vector_cache.hpp"
#include "duckdb/common/vector_operations/vector_operations.hpp"
#include "duckdb/storage/buffer/buffer_handle.hpp"
#include "duckdb/common/operator/comparison_operators.hpp"
#include "duckdb/common/types/vector_cache.hpp"

#include <cstring> // strlen() on Solaris

Expand Down Expand Up @@ -156,7 +156,6 @@ void Vector::Slice(const SelectionVector &sel, idx_t count) {
}
Vector child_vector(*this);
auto child_ref = make_buffer<VectorChildBuffer>(move(child_vector));

auto dict_buffer = make_buffer<DictionaryBuffer>(sel);
vector_type = VectorType::DICTIONARY_VECTOR;
buffer = move(dict_buffer);
Expand Down
28 changes: 24 additions & 4 deletions src/execution/join_hashtable.cpp
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
#include "duckdb/execution/join_hashtable.hpp"

#include "duckdb/storage/buffer_manager.hpp"

#include "duckdb/common/exception.hpp"
#include "duckdb/common/operator/comparison_operators.hpp"
#include "duckdb/common/row_operations/row_operations.hpp"
#include "duckdb/common/types/null_value.hpp"
#include "duckdb/common/types/row_data_collection.hpp"
#include "duckdb/common/row_operations/row_operations.hpp"
#include "duckdb/common/vector_operations/unary_executor.hpp"
#include "duckdb/common/vector_operations/vector_operations.hpp"
#include "duckdb/common/operator/comparison_operators.hpp"
#include "duckdb/storage/buffer_manager.hpp"

namespace duckdb {

Expand Down Expand Up @@ -827,4 +827,24 @@ void JoinHashTable::ScanFullOuter(DataChunk &result, JoinHTScanState &state) {
}
}

idx_t JoinHashTable::FillWithHTOffsets(data_ptr_t *key_locations, JoinHTScanState &state) {

// iterate over blocks
idx_t key_count = 0;
while (state.block_position < blocks.size()) {
auto &block = blocks[state.block_position];
auto handle = buffer_manager.Pin(block.block);
auto base_ptr = handle->node->buffer;
// go through all the tuples within this block
while (state.position < block.count) {
auto tuple_base = base_ptr + state.position * entry_size;
// store its locations
key_locations[key_count++] = tuple_base;
state.position++;
}
state.block_position++;
state.position = 0;
}
return key_count;
}
} // namespace duckdb
1 change: 1 addition & 0 deletions src/execution/operator/join/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ add_library_unity(
physical_index_join.cpp
physical_join.cpp
physical_nested_loop_join.cpp
perfect_hash_join_executor.cpp
physical_piecewise_merge_join.cpp)
set(ALL_OBJECT_FILES
${ALL_OBJECT_FILES} $<TARGET_OBJECTS:duckdb_operator_join>
Expand Down
221 changes: 221 additions & 0 deletions src/execution/operator/join/perfect_hash_join_executor.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,221 @@
#include "duckdb/execution/operator/join/perfect_hash_join_executor.hpp"

#include "duckdb/common/types/row_layout.hpp"
#include "duckdb/execution/operator/join/physical_hash_join.hpp"

namespace duckdb {

PerfectHashJoinExecutor::PerfectHashJoinExecutor(PerfectHashJoinStats pjoin_stats_) : pjoin_stats(pjoin_stats_) {
}

bool PerfectHashJoinExecutor::CanDoPerfectHashJoin() {
return pjoin_stats.is_build_small;
}

void PerfectHashJoinExecutor::BuildPerfectHashTable(JoinHashTable *hash_table_ptr, JoinHTScanState &join_ht_state,
LogicalType key_type) {
// First, allocate memory for each build column
auto build_size = pjoin_stats.build_range + 1;
for (auto type : hash_table_ptr->build_types) {
perfect_hash_table.emplace_back(type, build_size);
}
// and for duplicate_checking
bitmap_build_idx = unique_ptr<bool[]>(new bool[build_size]);
memset(bitmap_build_idx.get(), 0, sizeof(bool) * build_size); // set false

// Now fill columns with build data
FullScanHashTable(join_ht_state, key_type, hash_table_ptr);
}

void PerfectHashJoinExecutor::FullScanHashTable(JoinHTScanState &state, LogicalType key_type,
JoinHashTable *hash_table) {
Vector tuples_addresses(LogicalType::POINTER, hash_table->size()); // allocate space for all the tuples
auto key_locations = FlatVector::GetData<data_ptr_t>(tuples_addresses); // get a pointer to vector data
// TODO: In a parallel finalize: One should exclusivly lock and each thread should do one part of the code below.
// Go through all the blocks and fill the keys addresses
auto keys_count = hash_table->FillWithHTOffsets(key_locations, state);
// Scan the build keys in the hash table
Vector build_vector(key_type, keys_count);
RowOperations::FullScanColumn(hash_table->layout, tuples_addresses, build_vector, keys_count, 0);
// Now fill the selection vector using the build keys and create a sequential vector
// todo: add check for fast pass when probe is part of build domain
SelectionVector sel_build(keys_count + 1);
SelectionVector sel_tuples(keys_count + 1);
FillSelectionVectorSwitchBuild(build_vector, sel_build, sel_tuples, keys_count);
// early out
if (has_duplicates)
return;
if (unique_keys == pjoin_stats.build_range + 1 && !hash_table->has_null) {
pjoin_stats.is_build_dense = true;
}
keys_count = unique_keys; // do not condider keys out of the range
// Full scan the remaining build columns and fill the perfect hash table
for (idx_t i = 0; i < hash_table->build_types.size(); i++) {
auto &vector = perfect_hash_table[i];
D_ASSERT(vector.GetType() == hash_table->build_types[i]);
const auto col_no = hash_table->condition_types.size() + i;
const auto col_offset = hash_table->layout.GetOffsets()[col_no];
RowOperations::Gather(tuples_addresses, sel_tuples, vector, sel_build, keys_count, col_offset, col_no);
}
}

void PerfectHashJoinExecutor::FillSelectionVectorSwitchBuild(Vector &source, SelectionVector &sel_vec,
SelectionVector &seq_sel_vec, idx_t count) {
switch (source.GetType().id()) {
diegomestre2 marked this conversation as resolved.
Show resolved Hide resolved
case LogicalTypeId::TINYINT:
TemplatedFillSelectionVectorBuild<int8_t>(source, sel_vec, seq_sel_vec, count);
break;
case LogicalTypeId::SMALLINT:
TemplatedFillSelectionVectorBuild<int16_t>(source, sel_vec, seq_sel_vec, count);
break;
case LogicalTypeId::INTEGER:
TemplatedFillSelectionVectorBuild<int32_t>(source, sel_vec, seq_sel_vec, count);
break;
case LogicalTypeId::BIGINT:
TemplatedFillSelectionVectorBuild<int64_t>(source, sel_vec, seq_sel_vec, count);
break;
case LogicalTypeId::UTINYINT:
TemplatedFillSelectionVectorBuild<uint8_t>(source, sel_vec, seq_sel_vec, count);
break;
case LogicalTypeId::USMALLINT:
TemplatedFillSelectionVectorBuild<uint16_t>(source, sel_vec, seq_sel_vec, count);
break;
case LogicalTypeId::UINTEGER:
TemplatedFillSelectionVectorBuild<uint32_t>(source, sel_vec, seq_sel_vec, count);
break;
case LogicalTypeId::UBIGINT:
TemplatedFillSelectionVectorBuild<uint64_t>(source, sel_vec, seq_sel_vec, count);
break;
default:
throw NotImplementedException("Type not supported");
break;
}
}

template <typename T>
void PerfectHashJoinExecutor::TemplatedFillSelectionVectorBuild(Vector &source, SelectionVector &sel_vec,
SelectionVector &seq_sel_vec, idx_t count) {
auto min_value = pjoin_stats.build_min.GetValue<T>();
auto max_value = pjoin_stats.build_max.GetValue<T>();
VectorData vector_data;
source.Orrify(count, vector_data);
auto data = reinterpret_cast<T *>(vector_data.data);
// generate the selection vector
for (idx_t i = 0, sel_idx = 0; i != count; ++i) {
diegomestre2 marked this conversation as resolved.
Show resolved Hide resolved
auto data_idx = vector_data.sel->get_index(i);
auto input_value = data[data_idx];
// add index to selection vector if value in the range
if (min_value <= input_value && input_value <= max_value) {
auto idx = (idx_t)(input_value - min_value); // subtract min value to get the idx position
sel_vec.set_index(sel_idx++, idx);
if (bitmap_build_idx[idx]) {
has_duplicates = true;
break;
} else {
bitmap_build_idx[idx] = true;
unique_keys++;
}
}
seq_sel_vec.set_index(i, i);
}
}

bool PerfectHashJoinExecutor::ProbePerfectHashTable(ExecutionContext &context, DataChunk &result,
PhysicalHashJoinState *physical_state, JoinHashTable *ht_ptr,
PhysicalOperator *operator_child) {
do {
// fetch the chunk to join
operator_child->GetChunk(context, physical_state->child_chunk, physical_state->child_state.get());
if (physical_state->child_chunk.size() == 0) {
// no more keys to probe
return true;
}
// fetch the join keys from the chunk
physical_state->probe_executor.Execute(physical_state->child_chunk, physical_state->join_keys);
// select the keys that are in the min-max range
auto &keys_vec = physical_state->join_keys.data[0];
auto keys_count = physical_state->join_keys.size();
// todo: add check for fast pass when probe is part of build domain
FillSelectionVectorSwitchProbe(keys_vec, physical_state->build_sel_vec, physical_state->probe_sel_vec,
keys_count);
// If build is dense and probe is in build's domain, just reference probe
if (pjoin_stats.is_build_dense && keys_count == probe_sel_count) {
result.Reference(physical_state->child_chunk);
} else {
// otherwise, filter it out the values that do not match
result.Slice(physical_state->child_chunk, physical_state->probe_sel_vec, probe_sel_count, 0);
}
// on the build side, we need to fetch the data and build dictionary vectors with the sel_vec
for (idx_t i = 0; i < ht_ptr->build_types.size(); i++) {
auto &result_vector = result.data[physical_state->child_chunk.ColumnCount() + i];
D_ASSERT(result_vector.GetType() == ht_ptr->build_types[i]);
auto &build_vec = perfect_hash_table[i];
result_vector.Reference(build_vec); //
result_vector.Slice(physical_state->build_sel_vec, probe_sel_count);
}
probe_sel_count = 0;
} while (result.size() == 0);
return true;
}

void PerfectHashJoinExecutor::FillSelectionVectorSwitchProbe(Vector &source, SelectionVector &build_sel_vec,
SelectionVector &probe_sel_vec, idx_t count) {
switch (source.GetType().id()) {
case LogicalTypeId::TINYINT:
diegomestre2 marked this conversation as resolved.
Show resolved Hide resolved
TemplatedFillSelectionVectorProbe<int8_t>(source, build_sel_vec, probe_sel_vec, count);
break;
case LogicalTypeId::SMALLINT:
TemplatedFillSelectionVectorProbe<int16_t>(source, build_sel_vec, probe_sel_vec, count);
break;
case LogicalTypeId::INTEGER:
TemplatedFillSelectionVectorProbe<int32_t>(source, build_sel_vec, probe_sel_vec, count);
break;
case LogicalTypeId::BIGINT:
TemplatedFillSelectionVectorProbe<int64_t>(source, build_sel_vec, probe_sel_vec, count);
break;
case LogicalTypeId::UTINYINT:
TemplatedFillSelectionVectorProbe<uint8_t>(source, build_sel_vec, probe_sel_vec, count);
break;
case LogicalTypeId::USMALLINT:
TemplatedFillSelectionVectorProbe<uint16_t>(source, build_sel_vec, probe_sel_vec, count);
break;
case LogicalTypeId::UINTEGER:
TemplatedFillSelectionVectorProbe<uint32_t>(source, build_sel_vec, probe_sel_vec, count);
break;
case LogicalTypeId::UBIGINT:
TemplatedFillSelectionVectorProbe<uint64_t>(source, build_sel_vec, probe_sel_vec, count);
break;
default:
throw NotImplementedException("Type not supported");
break;
}
}

template <typename T>
void PerfectHashJoinExecutor::TemplatedFillSelectionVectorProbe(Vector &source, SelectionVector &build_sel_vec,
SelectionVector &probe_sel_vec, idx_t count) {
auto min_value = pjoin_stats.build_min.GetValue<T>();
auto max_value = pjoin_stats.build_max.GetValue<T>();
VectorData vector_data;
source.Orrify(count, vector_data);
auto data = reinterpret_cast<T *>(vector_data.data);

// build selection vector for non-dense build
for (idx_t i = 0, sel_idx = 0; i != count; ++i) {
// retrieve value from vector
auto data_idx = vector_data.sel->get_index(i);
auto input_value = data[data_idx];
// add index to selection vector if value in the range
if (min_value <= input_value && input_value <= max_value) {
auto idx = (idx_t)(input_value - min_value); // subtract min value to get the idx position
// check for matches in the build
if (bitmap_build_idx[idx]) {
build_sel_vec.set_index(sel_idx, idx);
probe_sel_vec.set_index(sel_idx++, i);
probe_sel_count++;
}
}
}
}

} // namespace duckdb