Skip to content

Arrow pycapsules use 2x the memory of fetch_arrow_table #137

@paultiq

Description

@paultiq

What happens?

DuckDBPyRelation::ToArrowCapsule consumes 2x the memory of ToArrowTableInternal. It fully materializes the results before constructing the arrow table, not leveraging a PhysicalArrowCollector.

Some rough test results (edit: for a 1GB Arrow Table):

Test Time Peak Table Size
Unchanged - ExecuteOrThrow(); 5.818s 2.36GB 1.08GB
Streaming - ExecuteOrThrow(true); 4.682s 1.22GB 1.08GB
PhysicalArrowCollector + Stream (code not shown) 3.937s 1.23GB 1.08GB
FetchArrowTable (see ToArrowTableInternal) 4.447s 1.17GB 1.08GB
Streaming

The simplest change is changing the materialization to streaming - ExecuteOrThrow(true);, this alone halves the memory consumption.

ExecuteOrThrow();

PhysicalArrowCollector

A more complete change would be to refactor ToArrowCapsule and ToArrowTableInternal to both use common PhysicalArrowCollector code. A light wrapper to construct an ArrowArrayStream would be needed in DuckDBPyResult::FetchArrowCapsule.

Related optimization - ToPolars

I noticed this when looking at ToPolars memory consumption in duckdb/duckdb#19356. If the above issue is solved, then Polars dataframe construction can be done directly without the ToArrowTableInternal step at all.

A simple change like the below would significantly improve memory and performance:

if (!lazy) {
auto arrow = ToArrowTableInternal(batch_size, true);
return py::cast<PolarsDataFrame>(pybind11::module_::import("polars").attr("DataFrame")(arrow));
}

=>

		auto polars_module = pybind11::module_::import("polars");
		return py::cast<PolarsDataFrame>(polars_module.attr("DataFrame")(*this));

To Reproduce

import os
import time
from resource import getrusage, RUSAGE_SELF
import duckdb
import pyarrow as pa

query = "SELECT r, r||'foo' AS foo FROM range(100000000) t(r) WHERE r % 2 = 0"

start = time.perf_counter()
rel = duckdb.sql(query)
table = pa.table(rel)  # Direct import from Arrow C Stream capsule
elapsed = time.perf_counter() - start

peak_gb = getrusage(RUSAGE_SELF).ru_maxrss * 1024 / (1024**3)
table_gb = table.nbytes / (1024**3)

print(f"Time: {elapsed:.3f}s  Peak: {peak_gb:.2f}GB  Table: {table_gb:.2f}GB")

OS:

Linux

DuckDB Package Version:

1.4.1, 1.5.0+dev110

Python Version:

3.13

Full Name:

Paul T

Affiliation:

Iqmo

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions