-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory on exploding join #11752
Labels
Comments
@soerenwolfers thanks for this issue. On Linux (running in Docker), I get this: duckdb(n=200000, sorted=False) took 1.31s and used 0.0441GiB
where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=False) took 2.09s and used 2.89GiB
where size of join was: 8.42e+07rows (~2.51GiB)
duckdb(n=200000, sorted=True) took 0.714s and used 0.00505GiB
where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=True) took 0.672s and used 0.00216GiB
where size of join was: 8.42e+07rows (~2.51GiB) On MacOS (M2 Pro), I get: duckdb(n=200000, sorted=False) took 1.28s and used 0.00197GiB
where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=False) took 1.61s and used 0.0136GiB
where size of join was: 8.42e+07rows (~2.51GiB)
duckdb(n=200000, sorted=True) took 0.653s and used 0.00102GiB
where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=True) took 0.626s and used 0.00639GiB
where size of join was: 8.42e+07rows (~2.51GiB) So this likely has something to do with the different allocators used. Are you running on AMD64 or arm64? |
AMD64, specifically:
|
Thanks! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What happens?
Depending on the size of the involved dataframes, duckdb will either use absolutely no memory or suddenly run entirely out of memory when I increase the size of the pandas-dataframe inputs to the following query:
I am running into this problem from code that is essentially doing "sparse matrix multiplication". I was previously told at #11588, fairly so, that I should use the right tool for the job, but it feels like this one might be an easy win (which would allow me to stick to duckdb more) since some threshold might just need some tuning.
PS1: One workaround I found is sorting the input data before passing it to duckdb.
PS2: The problem doesn't occur for inputs generated purely within duckdb, e.g.
To Reproduce
Using duckdb version 0.10.2 on a Ubuntu 20.04.6LTS with an 8-core Intel i5 and 16GB RAM, run
Note how increasing
n
from200_000
to205_000
increases memory usage from0.0401GiB
to3.42GiB
despite only increasing the number of rows in the join by 5%.Increasing
n
further to400_000
gets my Python proces OOM killed.OS:
Linux
DuckDB Version:
0.10.2
DuckDB Client:
Python
Full Name:
Soeren Wolfers
Affiliation:
G-Research
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: