Skip to content

Memory is not reclaimed with unregister() when inside a transaction #469

@jprafael

Description

@jprafael

What happens?

Repeatedly inserting pandas DataFrames via connection.register() inside a single open transaction causes process RSS to grow linearly with each chunk, even after unregister(), del, and gc.collect().

Using SET python_scan_all_frames=true and referencing a local variable instead of explicit use of register()/unregister() avoids the problem.

To Reproduce

The following unmodified script shows the memory growing with each chunk until the process hits >10GB.
Change the USE_REGISTER flag to False to avoid the use of register() which bypasses the problem.

#!/usr/bin/env python

import gc
import os
import tempfile
import shutil
from pathlib import Path

import duckdb
import numpy as np
import pandas as pd
import psutil


CHUNKS = 100
MAX_RSS_GB = 10.0

# ~1 GiB/chunk:
# 8 int64 columns * 16,000,000 rows ~= 1.0 GiB, plus pandas overhead.
ROWS_PER_CHUNK = 16_000_000
COLS = 8

USE_REGISTER = True


def rss_gb() -> float:
    return psutil.Process(os.getpid()).memory_info().rss / (1024 ** 3)


def make_chunk(chunk_idx: int) -> pd.DataFrame:
    base = np.random.randint(
        0,
        1_000_000,
        size=(ROWS_PER_CHUNK, COLS),
        dtype=np.int64,
    )
    return pd.DataFrame(
        base,
        columns=[f"c{i}" for i in range(COLS)],
    )


tmp = Path(tempfile.mkdtemp(prefix="duckdb-register-rss-repro-"))

try:
    con = duckdb.connect(str(tmp / "db.duckdb"))
    
    if not USE_REGISTER:
        con.execute("SET python_scan_all_frames=false;")
    
    con.execute(
        """
        CREATE TABLE t (
            c0 BIGINT,
            c1 BIGINT,
            c2 BIGINT,
            c3 BIGINT,
            c4 BIGINT,
            c5 BIGINT,
            c6 BIGINT,
            c7 BIGINT
        );
        """
    )

    con.begin()

    peak = rss_gb()
    print(f"duckdb={duckdb.__version__} start_rss={peak:.2f}GB", flush=True)

    for i in range(CHUNKS):
        chunk = make_chunk(i)

        if USE_REGISTER:
            con.register("input", chunk)
            con.execute("INSERT INTO t SELECT * FROM input;")
            con.unregister("input")
        else:
            con.execute("INSERT INTO t SELECT * FROM chunk;")


        del chunk
        gc.collect()

        current = rss_gb()
        peak = max(peak, current)
        print(
            f"chunk={i + 1}/{CHUNKS} rss={current:.2f}GB peak={peak:.2f}GB",
            flush=True,
        )

        assert current < MAX_RSS_GB, (
            f"RSS exceeded {MAX_RSS_GB}GB after chunk {i + 1}: "
            f"rss={current:.2f}GB peak={peak:.2f}GB"
        )

    con.rollback()
    print(f"done peak={peak:.2f}GB", flush=True)

finally:
    shutil.rmtree(tmp, ignore_errors=True)

OS:

Ubuntu 24.04LTS

DuckDB Package Version:

1.5.2

Python Version:

3.14

Full Name:

João Pedro Maia Rafael

Affiliation:

Upper Delta, Unipessoal LDA

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions