Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory on exploding join #11752

Open
2 tasks done
soerenwolfers opened this issue Apr 21, 2024 · 3 comments
Open
2 tasks done

Out of memory on exploding join #11752

soerenwolfers opened this issue Apr 21, 2024 · 3 comments

Comments

@soerenwolfers
Copy link

soerenwolfers commented Apr 21, 2024

What happens?

Depending on the size of the involved dataframes, duckdb will either use absolutely no memory or suddenly run entirely out of memory when I increase the size of the pandas-dataframe inputs to the following query:

SELECT 
    df1.key AS key1,
    df2.key AS key2,
    count(*) AS c
FROM df1
JOIN df2 
USING (match)
GROUP BY ALL

I am running into this problem from code that is essentially doing "sparse matrix multiplication". I was previously told at #11588, fairly so, that I should use the right tool for the job, but it feels like this one might be an easy win (which would allow me to stick to duckdb more) since some threshold might just need some tuning.

PS1: One workaround I found is sorting the input data before passing it to duckdb.
PS2: The problem doesn't occur for inputs generated purely within duckdb, e.g.

 SELECT 
  unnest(ARRAY(SELECT (random() * {m})::BIGINT FROM range({n}))) AS key,
  unnest(ARRAY(SELECT (random() * {m})::BIGINT FROM range({n}))) AS match,

To Reproduce

Using duckdb version 0.10.2 on a Ubuntu 20.04.6LTS with an 8-core Intel i5 and 16GB RAM, run

pip install duckdb psutil faker pandas pyarrow
import duckdb
import pandas as pd
import numpy as np
import timeit
import os
import time
import threading
import itertools
import psutil

def used_memory():
    vm = psutil.virtual_memory()
    return (vm.total - vm.available) / 2**30

class Timer:
    def __init__(self, name):
        self.name = name
        self.stopped = False
        self.start_memory = None
        self.max_memory = None

    def __enter__(self):
        self.t1 = timeit.default_timer()
        
        self.start_memory = used_memory()
        self.max_memory = self.start_memory

        def update_max_memory():
            while not self.stopped:
                self.max_memory = max(self.max_memory, used_memory())
                time.sleep(0.05)
        threading.Thread(target=update_max_memory).start()

    def __exit__(self, *args, **kwargs):
        self.stopped = True
        t2 = timeit.default_timer()
        print(f"{self.name} took {t2 - self.t1:.3g}s and used {self.max_memory - self.start_memory:.3g}GiB")


def fake_data(m, n) -> pd.DataFrame:
    rng = np.random.default_rng(0)
    key = rng.integers(0, m, n)
    match = rng.integers(0, m, n)
    df = pd.DataFrame({'key': key, 'match': match})
    return df 

m = 500
for sort, n in itertools.product([False, True,], [200_000, 205_000,]):
    df1 = fake_data(m, n)
    df2 = fake_data(m, n)
    if sort:
        df1.sort_values(['key', 'match'], inplace=True)
        df2.sort_values(['key', 'match'], inplace=True)
    with Timer(f"duckdb(n={n}, sorted={sort})"):
        res = duckdb.query("""
            SELECT 
                df1.key AS key1,
                df2.key AS key2,
                count(*) AS c
            FROM df1
            JOIN df2 
            USING (match)
            GROUP BY ALL
            """).arrow()
    n_rows = duckdb.query("SELECT sum(c) FROM res").fetchall()[0][0]
    n_bytes_per_row = 8 * 4
    print(f"\twhere size of join was: {n_rows:.3g}rows (~{n_rows * n_bytes_per_row / 2**30:.3g}GiB)")
duckdb(n=200000, sorted=False) took 2.96s and used 0.0401GiB
	where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=False) took 4.55s and used 3.42GiB
	where size of join was: 8.42e+07rows (~2.51GiB)
duckdb(n=200000, sorted=True) took 1.34s and used 0.000141GiB
	where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=True) took 1.33s and used 0.00144GiB
	where size of join was: 8.42e+07rows (~2.51GiB)

Note how increasing n from 200_000 to 205_000 increases memory usage from 0.0401GiB to 3.42GiB despite only increasing the number of rows in the join by 5%.
Increasing n further to 400_000 gets my Python proces OOM killed.

OS:

Linux

DuckDB Version:

0.10.2

DuckDB Client:

Python

Full Name:

Soeren Wolfers

Affiliation:

G-Research

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@szarnyasg
Copy link
Collaborator

@soerenwolfers thanks for this issue. On Linux (running in Docker), I get this:

duckdb(n=200000, sorted=False) took 1.31s and used 0.0441GiB
        where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=False) took 2.09s and used 2.89GiB
        where size of join was: 8.42e+07rows (~2.51GiB)
duckdb(n=200000, sorted=True) took 0.714s and used 0.00505GiB
        where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=True) took 0.672s and used 0.00216GiB
        where size of join was: 8.42e+07rows (~2.51GiB)

On MacOS (M2 Pro), I get:

duckdb(n=200000, sorted=False) took 1.28s and used 0.00197GiB
	where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=False) took 1.61s and used 0.0136GiB
	where size of join was: 8.42e+07rows (~2.51GiB)
duckdb(n=200000, sorted=True) took 0.653s and used 0.00102GiB
	where size of join was: 8.02e+07rows (~2.39GiB)
duckdb(n=205000, sorted=True) took 0.626s and used 0.00639GiB
	where size of join was: 8.42e+07rows (~2.51GiB)

So this likely has something to do with the different allocators used. Are you running on AMD64 or arm64?

@soerenwolfers
Copy link
Author

AMD64, specifically:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 140
model name	: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
stepping	: 1
microcode	: 0xb4
cpu MHz		: 2400.000
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 27
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves split_lock_detect dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid movdiri movdir64b fsrm avx512_vp2intersect md_clear flush_l1d arch_capabilities
vmx flags	: vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple pml ept_mode_based_exec tsc_scaling
bugs		: spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb gds
bogomips	: 4838.40
clflush size	: 64
cache_alignment	: 64
address sizes	: 39 bits physical, 48 bits virtual
power management:

@szarnyasg
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants