Skip to content

c/driver/postgresql: adbc_ingest silently misaligns list/large_list/fixed_size_list rows when the source Arrow array is sliced (parent.offset > 0) #4319

@mediprtl

Description

@mediprtl

What happened?

PostgresCopyListFieldWriter::Write (c/driver/postgresql/copy/writer.h, both the IsFixedSize and variable-length branches) computes the child range for each row from the logical row index without adding array_view_->offset. When the parent List / LargeList / FixedSizeList array has offset > 0 (a sliced parent), the writer reads the wrong slot of the offsets buffer — or, for fixed-size, multiplies the wrong base index by the element size. The resulting child ranges still index into the still-full child values buffer, so list elements end up attached to the wrong rows.

Practical impact: silent, per-row drift of list-column values when an Arrow table is sliced into multiple batches and ingested via adbc_ingest with mode="create" then mode="append". The first chunk (offset=0) is always correct; every subsequent chunk's list/array column is shifted by the chunk's parent.offset. Scalar columns are unaffected because their writers route through ArrowArrayViewGetInt*, which honors offset.

Reproduced end-to-end on the postgres-test service from this repo's compose.yaml against adbc-driver-postgresql 1.11.0, with pyarrow 23 and 24, for list<string>, large_list<string>, and fixed_size_list<string, 2>.

Stack Trace

No exception — silent data corruption.

How can we reproduce the bug?

docker compose up --detach --wait postgres-test, then:

import pyarrow as pa
from adbc_driver_postgresql import dbapi

URI = "postgresql://postgres:password@localhost:5432/postgres"
N, batch = 6, 3

def expected(i):
    # variable length so any drift breaks structure, not just values
    return [f"r{i}-a", f"r{i}-b"] if i % 2 == 0 else [f"r{i}-x"]

tbl = pa.table({
    "pk":   pa.array(list(range(N))),
    "tags": pa.array([expected(i) for i in range(N)],
                     type=pa.large_list(pa.string())),
})

with dbapi.connect(URI) as conn, conn.cursor() as cur:
    cur.execute("DROP TABLE IF EXISTS adbc_list_bug")
    for i, off in enumerate(range(0, N, batch)):
        cur.adbc_ingest("adbc_list_bug", tbl.slice(off, batch),
                        mode="create" if i == 0 else "append")
    cur.execute("SELECT pk, tags FROM adbc_list_bug ORDER BY pk")
    for pk, tags in cur.fetchall():
        print(pk, tags, "OK" if tags == expected(pk) else "DRIFTED")

Observed: pk 0–2 correct, pk 3–5 drifted. Repeats identically with pa.list_(pa.string()) and pa.list_(pa.string(), 2).

The variable-length structure also drifts — pk=3 (expected 1 element) receives the 2-element value from row 0, which nails the diagnosis to "the offsets buffer is being misread" rather than "child values are shifted independently."

Root cause

c/driver/postgresql/copy/writer.h, PostgresCopyListFieldWriter::Write (both template branches) use the logical index directly:

if constexpr (IsFixedSize) {
  start = index * array_view_->layout.child_size_elements;
  end   = start + array_view_->layout.child_size_elements;
} else {
  start = ArrowArrayViewListChildOffset(array_view_, index);
  end   = ArrowArrayViewListChildOffset(array_view_, index + 1);
}

ArrowArrayViewListChildOffset (nanoarrow inline_array.h) is, unlike its sibling ArrowArrayViewGetIntUnsafe, not offset-aware — it reads buffer_views[1].data.as_int32[i] (or as_int64) directly. And the fixed-size branch never references offset either. So both branches misbehave when array_view_->offset > 0.

PyArrow's Table.slice(off, len) produces parent List / FixedSizeList arrays with array.offset = off, sharing the original offsets/child buffers, so any multi-batch adbc_ingest path (or any caller passing a sliced source) trips this.

Suggested fix

const int64_t logical = array_view_->offset + index;
if constexpr (IsFixedSize) {
  start = logical * array_view_->layout.child_size_elements;
  end   = start + array_view_->layout.child_size_elements;
} else {
  start = ArrowArrayViewListChildOffset(array_view_, logical);
  end   = ArrowArrayViewListChildOffset(array_view_, logical + 1);
}

Built locally with that change, .so swapped into the unmodified wheel — all three list types ingest correctly across multi-chunk slices in the same venv where the unpatched wheel drifts.

Workaround (driver-user side)

Pass non-sliced inputs only. Table.combine_chunks() and pa.concat_tables([sliced]) are not sufficient — they short-circuit for a single-chunk slice and preserve offset > 0. Per-column ChunkedArray.combine_chunks() (or Table.from_arrays([c.combine_chunks() for c in t.columns], names=…)) does materialize and reset offsets to 0.

Environment/Setup

  • adbc-driver-postgresql 1.11.0 (also reproduces on main)
  • pyarrow 23.0.1 and 24.0.0
  • macOS arm64; the postgres-test container from this repo's compose.yaml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions