Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Importing an extension type without ARROW:extension:metadata crashes #41741

Closed
paleolimbot opened this issue May 20, 2024 · 5 comments
Closed
Assignees
Milestone

Comments

@paleolimbot
Copy link
Member

Describe the bug, including details regarding any error messages, version, and platform.

I ran into this when running tests in geoarrow-c, which includes an implementation of the GeoArrow extension types for Arrow C++ ( geoarrow/geoarrow-c#94 ). The sequence of events that triggers this is:

  • Extension type registered that contains both ARROW:extension:name and ARROW:extension:metadata
  • ImportType() from the C data interface with an extension type that only contains ARROW:extension:name .

The backtrace from C++ that throws the exception is:

libc++abi.dylib!__cxa_throw (Unknown Source:0)
libarrow.1500.2.0.dylib!std::__1::__throw_length_error[abi:ue170006](char const*) (Unknown Source:0)
libarrow.1500.2.0.dylib!std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__throw_length_error[abi:ue170006]() const (Unknown Source:0)
libarrow.1500.2.0.dylib!std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__append(unsigned long) (Unknown Source:0)
libarrow.1500.2.0.dylib!arrow::KeyValueMetadata::DeleteMany(std::__1::vector<long long, std::__1::allocator<long long>>) (Unknown Source:0)
libarrow.1500.2.0.dylib!arrow::(anonymous namespace)::SchemaImporter::DoImport() (Unknown Source:0)
libarrow.1500.2.0.dylib!arrow::ImportType(ArrowSchema*) (Unknown Source:0)
geoarrow_arrow_test!ArrowTest_ArrowTestExtensionTypeRegister_Test::TestBody() (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/src/geoarrow/geoarrow_arrow_test.cc:116)
geoarrow_arrow_test!void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest.cc:2607)
geoarrow_arrow_test!void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest.cc:2643)
geoarrow_arrow_test!testing::Test::Run() (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest.cc:2682)
geoarrow_arrow_test!testing::TestInfo::Run() (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest.cc:2861)
geoarrow_arrow_test!testing::TestSuite::Run() (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest.cc:3015)
geoarrow_arrow_test!testing::internal::UnitTestImpl::RunAllTests() (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest.cc:5855)
geoarrow_arrow_test!bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest.cc:2607)
geoarrow_arrow_test!bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest.cc:2643)
geoarrow_arrow_test!testing::UnitTest::Run() (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest.cc:5438)
geoarrow_arrow_test!RUN_ALL_TESTS() (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/include/gtest/gtest.h:2490)
geoarrow_arrow_test!main (/Users/deweydunnington/Desktop/rscratch/geoarrow-c/build/_deps/googletest-src/googletest/src/gtest_main.cc:52)
start (Unknown Source:0)

Reproducer in Python:

import pyarrow as pa

# pip install --extra-index-url https://pypi.fury.io/arrow-nightlies/ \
#     --prefer-binary --pre nanoarrow
import nanoarrow as na

class DummyExtType(pa.ExtensionType):
    def __init__(self):
        super().__init__(pa.null(), "arrow.test")

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized):
        return DummyExtType()

    def __arrow_ext_serialize__(self):
        return b"{}"

pa.register_extension_type(DummyExtType())

# Works!
na_schema = na.extension_type(na.null(), "arrow.test", b"{}")
print(na_schema.metadata)
#> <nanoarrow._lib.SchemaMetadata>
#> - b'ARROW:extension:name': b'arrow.test'
#> - b'ARROW:extension:metadata': b'{}'
na_array = na.c_array([], na_schema)
pa.array(na_array)
#> <pyarrow.lib.ExtensionArray object at 0x105aee080>
#> 0 nulls

# Crashes
na_schema = na.extension_type(na.null(), "arrow.test")
print(na_schema.metadata)
#> <nanoarrow._lib.SchemaMetadata>
#> - b'ARROW:extension:name': b'arrow.test'
na_array = na.c_array([], na_schema)
# pa.array(na_array)
#> The Kernel crashed while executing code in the current cell or a previous cell.

Component(s)

C++

@jorisvandenbossche
Copy link
Member

A reproducer just with pyarrow as well (using your registered ext type example):

class DummyArray:
    def __init__(self):
        self._field = pa.field("", pa.null(), metadata={"ARROW:extension:name": "arrow.test"})
        self._array = pa.array([None, None], pa.null())

    def __arrow_c_array__(self, requested_schema=None):
        return self._field.__arrow_c_schema__(), self._array.__arrow_c_array__()[1]

pa.array(DummyArray())

which crashes, and when adding "ARROW:extension:metadata": "" to the dict, it works.

Now, apart from your fix (thanks for that!), I realize from the example that it is right now actually impossible to register an extension type without metadata (at least from Python).

I don't know it we should fix that (e.g. allow __arrow_ext_serialize__ to return None in addition to bytes?), or whether this is OK as long as we ensure this still works when actually receiving data with extension types without metadata (i.e. your fix for this issue)

@jorisvandenbossche
Copy link
Member

We should probably also check that this works in other places that imports extension types, like reading IPC.

@jorisvandenbossche
Copy link
Member

For IPC it seems to work:

table = pa.Table.from_arrays(
   [pa.array([None, None], pa.null())],
   schema=pa.schema([pa.field("a", pa.null(), metadata={"ARROW:extension:name": "arrow.test"})])
)
with pa.ipc.new_file("/tmp/test_ext_no_meta.arrow", table.schema) as f:
    f.write(table)

with pa.ipc.open_file("/tmp/test_ext_no_meta.arrow") as f:
    result = f.read_all()
print(result.schema)

pa.register_extension_type(DummyExtType())

with pa.ipc.open_file("/tmp/test_ext_no_meta.arrow") as f:
    result = f.read_all()
print(result.schema)

And the same for write/read Parquet (where we get the extension type from the ARROW:schema)

@paleolimbot paleolimbot changed the title [C++] Importing an extension type without ARROW:extension:metadata that was registered with metadata crashes [C++] Importing an extension type without ARROW:extension:metadata crashes May 22, 2024
@paleolimbot
Copy link
Member Author

Good call! It looks like the version of this that IPC and Parquet (which I believe uses the IPC Schema encoding) use is here and has more or less the same logic.

if (metadata != nullptr) {
// Look for extension metadata in custom_metadata field
int name_index = metadata->FindKey(kExtensionTypeKeyName);
if (name_index != -1) {
std::shared_ptr<ExtensionType> ext_type =
GetExtensionType(metadata->value(name_index));
if (ext_type != nullptr) {
int data_index = metadata->FindKey(kExtensionMetadataKeyName);
std::string type_data = data_index == -1 ? "" : metadata->value(data_index);
ARROW_ASSIGN_OR_RAISE(type, ext_type->Deserialize(type, type_data));
// Remove the metadata, for faithful roundtripping
if (data_index != -1) {
RETURN_NOT_OK(metadata->DeleteMany({name_index, data_index}));
} else {
RETURN_NOT_OK(metadata->Delete(name_index));
}
}
// NOTE: if extension type is unknown, we do not raise here and
// simply return the storage type.
}
}

(Also good call on the registration not mattering!)

paleolimbot added a commit that referenced this issue May 24, 2024
…ttempting to delete it (#41763)

### Rationale for this change

Neither Schema.fbs nor the Arrow C Data interface nor the columnar specification indicates that the ARROW:extension:metadata key must be present; however, the `ImportType()` implementation assumes that both `ARROW:extension:name` and `ARROW:extension:metadata` are both present and throws an exception if `ARROW:extension:metadata` is missing. This causes pyarrow to crash (see issue for reproducer).

### What changes are included in this PR?

This PR checks that the extension metadata is present before attempting to delete it.

### Are these changes tested?

Yes (test added).

### Are there any user-facing changes?

No.
* GitHub Issue: #41741

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
@paleolimbot paleolimbot added this to the 17.0.0 milestone May 24, 2024
@paleolimbot
Copy link
Member Author

Issue resolved by pull request 41763
#41763

vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
…fore attempting to delete it (apache#41763)

### Rationale for this change

Neither Schema.fbs nor the Arrow C Data interface nor the columnar specification indicates that the ARROW:extension:metadata key must be present; however, the `ImportType()` implementation assumes that both `ARROW:extension:name` and `ARROW:extension:metadata` are both present and throws an exception if `ARROW:extension:metadata` is missing. This causes pyarrow to crash (see issue for reproducer).

### What changes are included in this PR?

This PR checks that the extension metadata is present before attempting to delete it.

### Are these changes tested?

Yes (test added).

### Are there any user-facing changes?

No.
* GitHub Issue: apache#41741

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants