What happens?
Repeatedly inserting pandas DataFrames via connection.register() inside a single open transaction causes process RSS to grow linearly with each chunk, even after unregister(), del, and gc.collect().
Using SET python_scan_all_frames=true and referencing a local variable instead of explicit use of register()/unregister() avoids the problem.
To Reproduce
The following unmodified script shows the memory growing with each chunk until the process hits >10GB.
Change the USE_REGISTER flag to False to avoid the use of register() which bypasses the problem.
#!/usr/bin/env python
import gc
import os
import tempfile
import shutil
from pathlib import Path
import duckdb
import numpy as np
import pandas as pd
import psutil
CHUNKS = 100
MAX_RSS_GB = 10.0
# ~1 GiB/chunk:
# 8 int64 columns * 16,000,000 rows ~= 1.0 GiB, plus pandas overhead.
ROWS_PER_CHUNK = 16_000_000
COLS = 8
USE_REGISTER = True
def rss_gb() -> float:
return psutil.Process(os.getpid()).memory_info().rss / (1024 ** 3)
def make_chunk(chunk_idx: int) -> pd.DataFrame:
base = np.random.randint(
0,
1_000_000,
size=(ROWS_PER_CHUNK, COLS),
dtype=np.int64,
)
return pd.DataFrame(
base,
columns=[f"c{i}" for i in range(COLS)],
)
tmp = Path(tempfile.mkdtemp(prefix="duckdb-register-rss-repro-"))
try:
con = duckdb.connect(str(tmp / "db.duckdb"))
if not USE_REGISTER:
con.execute("SET python_scan_all_frames=false;")
con.execute(
"""
CREATE TABLE t (
c0 BIGINT,
c1 BIGINT,
c2 BIGINT,
c3 BIGINT,
c4 BIGINT,
c5 BIGINT,
c6 BIGINT,
c7 BIGINT
);
"""
)
con.begin()
peak = rss_gb()
print(f"duckdb={duckdb.__version__} start_rss={peak:.2f}GB", flush=True)
for i in range(CHUNKS):
chunk = make_chunk(i)
if USE_REGISTER:
con.register("input", chunk)
con.execute("INSERT INTO t SELECT * FROM input;")
con.unregister("input")
else:
con.execute("INSERT INTO t SELECT * FROM chunk;")
del chunk
gc.collect()
current = rss_gb()
peak = max(peak, current)
print(
f"chunk={i + 1}/{CHUNKS} rss={current:.2f}GB peak={peak:.2f}GB",
flush=True,
)
assert current < MAX_RSS_GB, (
f"RSS exceeded {MAX_RSS_GB}GB after chunk {i + 1}: "
f"rss={current:.2f}GB peak={peak:.2f}GB"
)
con.rollback()
print(f"done peak={peak:.2f}GB", flush=True)
finally:
shutil.rmtree(tmp, ignore_errors=True)
OS:
Ubuntu 24.04LTS
DuckDB Package Version:
1.5.2
Python Version:
3.14
Full Name:
João Pedro Maia Rafael
Affiliation:
Upper Delta, Unipessoal LDA
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?
What happens?
Repeatedly inserting pandas DataFrames via
connection.register()inside a single open transaction causes process RSS to grow linearly with each chunk, even afterunregister(),del, andgc.collect().Using
SET python_scan_all_frames=trueand referencing a local variable instead of explicit use ofregister()/unregister()avoids the problem.To Reproduce
The following unmodified script shows the memory growing with each chunk until the process hits >10GB.
Change the
USE_REGISTERflag toFalseto avoid the use ofregister()which bypasses the problem.OS:
Ubuntu 24.04LTS
DuckDB Package Version:
1.5.2
Python Version:
3.14
Full Name:
João Pedro Maia Rafael
Affiliation:
Upper Delta, Unipessoal LDA
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
Did you include all relevant configuration to reproduce the issue?