Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove index joins #9751

Merged
merged 5 commits into from Nov 22, 2023
Merged

Remove index joins #9751

merged 5 commits into from Nov 22, 2023

Conversation

taniabogatsch
Copy link
Contributor

After repeated issues with the index join, we decided that removing it is the best choice for duckdb as of its current state. Currently, the index join is slow, and the results are often incorrect. Making it both performant and correct requires significant effort and with the current state and all the past changes to our indexes, it might be faster to reimplement the index join than attempting to fix it.

I did more benchmarking as suggested by @pdet. Looking at the profiler, we primarily see a lot of idle time spent waiting to acquire locks. From this, I concluded that we should use both write and read locks for indexes to support fast index joins. If we only use indexes for constraints, we always need to acquire a write lock, as we cannot allow concurrent changes to the index while validating constraints. In this scenario, we do not need to distinguish between read and write locks.

Benchmarks

Here is an example of a scenario where an index join should beat a hash join. However, duckdb performs significantly worse when using index joins.

name Index Join ART High Selectivity
group art

load
SET threads=1;
PRAGMA force_index_join;
CREATE TABLE pk_tbl (j VARCHAR, i VARCHAR PRIMARY KEY, h INTEGER, k INTEGER, n VARCHAR);
INSERT INTO pk_tbl SELECT (range || 'abcdef' || range || 'abcdef' || range) AS j, j AS i, range, range, j AS n FROM range(800000);
CREATE TABLE fk_tbl (i VARCHAR REFERENCES pk_tbl(i));
INSERT INTO fk_tbl SELECT (range * 80000) || 'abcdef' || (range * 80000) || 'abcdef' || (range * 80000) FROM range(10);

run
SELECT * FROM pk_tbl JOIN fk_tbl USING (i);

Index Join

tania@motorbook duckdb % build/reldebug/benchmark/benchmark_runner benchmark/micro/index/join/high_selectivity_index_join.benchmark 
name	run	timing
benchmark/micro/index/join/high_selectivity_index_join.benchmark	1	0.047384
benchmark/micro/index/join/high_selectivity_index_join.benchmark	2	0.047003
benchmark/micro/index/join/high_selectivity_index_join.benchmark	3	0.046800
benchmark/micro/index/join/high_selectivity_index_join.benchmark	4	0.047088
benchmark/micro/index/join/high_selectivity_index_join.benchmark	5	0.046856

Hash Join

tania@motorbook duckdb % build/reldebug/benchmark/benchmark_runner benchmark/micro/index/join/high_selectivity_index_join.benchmark
name	run	timing
benchmark/micro/index/join/high_selectivity_index_join.benchmark	1	0.009318
benchmark/micro/index/join/high_selectivity_index_join.benchmark	2	0.009221
benchmark/micro/index/join/high_selectivity_index_join.benchmark	3	0.009251
benchmark/micro/index/join/high_selectivity_index_join.benchmark	4	0.009372
benchmark/micro/index/join/high_selectivity_index_join.benchmark	5	0.009366

@Mytherin Mytherin merged commit 0a18eb0 into duckdb:main Nov 22, 2023
42 of 43 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

@taniabogatsch taniabogatsch deleted the remove-idx-join branch November 22, 2023 09:36
krlmlr added a commit to duckdb/duckdb-r that referenced this pull request Dec 11, 2023
Merge pull request duckdb/duckdb#9751 from taniabogatsch/remove-idx-join
Merge pull request duckdb/duckdb#9735 from carlopi/wasm_threads
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants