Skip to content

[C++] IPC Stream Reader doesn't check if extra fields are present for RecordBatches #31566

@asfimport

Description

@asfimport

I looked through recent commits and I don't think this issue has been patched since:

import pyarrow as pa
with pa.output_stream("/tmp/f1") as sink:
  with pa.RecordBatchStreamWriter(sink, rb1.schema) as writer:
    writer.write(rb1)
    end_rb1 = sink.tell()

with pa.output_stream("/tmp/f2") as sink:
  with pa.RecordBatchStreamWriter(sink, rb2.schema) as writer:
    writer.write(rb2)
    start_rb2_only = sink.tell()
    writer.write(rb2)
    end_rb2 = sink.tell()

# Stitch to togher rb1.schema, rb1 and rb2 without schema.
with pa.output_stream("/tmp/f3") as sink:
  with pa.input_stream("/tmp/f1") as inp:
     sink.write(inp.read(end_rb1))
  with pa.input_stream("/tmp/f2") as inp:
    inp.seek(start_rb2_only)
    sink.write(inp.read(end_rb2 - start_rb2_only))

with pa.ipc.open_stream("/tmp/f3") as sink:
  print(sink.read_all())

Yields:

{{pyarrow.Table
c1: int64
----
c1: [[1],[1]]

I would expect this to error because the second stiched in record batch has more fields then necessary but it appears to load just fine.

Is this intended behavior?

Reporter: Micah Kornfield / @emkornfield

Note: This issue was originally created as ARROW-16160. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions