New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HAWQ-1607. This commit implements applying Bloom filter during Scan outer table #1360
Conversation
…uter table, test cases will be added with HAWQ-1608. 1. Pash down Bloom filter structure to outer table scan(only support parquet); 2. Check if the tuple from outer table is found in Bloom filter structure. 3. Add a GUC hawq_hashjoin_bloomfilter_sampling_number. This guc value controls the Bloom filter sampling number, while scanning outer table, for first N tuples of the outer table, if the ratio is larger than hawq_hashjoin_bloomfilter_ratio, the remain tuples will not be checked by Bloom filter. 4. If there is any expression on outer join keys except T_Var(projection), such as, fact.c1 + 1 = dim.c1. 2, if there are multiple join keys, e.g. fact.c1 = dim.c1 and fact.c2 = dim.c2, Bloomfilter won't be created. Since these cases invloves pushing down expression and project information to scan, which will be implemented later.
src/include/nodes/execnodes.h
Outdated
@@ -1522,6 +1540,9 @@ typedef struct ScanState | |||
/* The type of the table that is being scanned */ | |||
TableType tableType; | |||
|
|||
/* Runtime filter */ | |||
struct RuntimeFilterState runtimeFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since ScanState need to be copied some times, it's better to use a point of RuntimeFilterState in the struct and allocate memory dynamically.
src/backend/executor/nodeHashjoin.c
Outdated
memcpy(rf->hashfunctions, hjstate->hj_HashTable->hashfunctions, i*sizeof(FmgrInfo)); | ||
size_t size = offsetof(BloomFilterData, data) + hjstate->hj_HashTable->bloomfilter->data_size; | ||
rf->bloomfilter = palloc0(size); | ||
memcpy(rf->bloomfilter, hjstate->hj_HashTable->bloomfilter, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just assign hjstate->hj_HashTable->bloomfilter to rf->bloomfilter ?
src/backend/cdb/cdbparquetrowgroup.c
Outdated
|
||
if(hawqAttrToParquetColNum[i] == 1) | ||
int colReaderIndex = 0; | ||
int16 proj[128]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use natts instead of 128.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
merged into master. |
2. Check if the tuple from outer table is found in Bloom filter structure.
3. Add a GUC hawq_hashjoin_bloomfilter_sampling_number. This guc value controls the Bloom filter sampling number, while scanning outer table, for first N tuples of the outer table, if the ratio is larger than hawq_hashjoin_bloomfilter_ratio, the remain tuples will not be checked by Bloom filter.
4. If there is any expression on outer join keys except T_Var(projection), such as, fact.c1 + 1 = dim.c1. 2, if there are multiple join keys, e.g. fact.c1 = dim.c1 and fact.c2 = dim.c2, Bloomfilter won't be created. Since these cases invloves pushing down expression and project information to scan, which will be implemented later.
Please review, thanks!