[Format] Add metadata for single and double precision complex numbers #16264

asfimport · 2017-03-16T03:24:58Z

Numerical computing libraries like NumPy and TensorFlow feature complex64 and complex128 numbers

Reporter: Wes McKinney / @wesm

PRs and other links:

_{Note: This issue was originally created as ARROW-638. Please see the migration documentation for further details.}

asfimport · 2017-03-16T16:41:57Z

Julien Le Dem / @julienledem:

I see several possibilities:

first class complex types: but we have to curate types so that we don't create too many of them.
have two float columns: but I guess we want the two values next to each other
have one float column where odd and even indices are the components of the complex value: which I guess would be the same binary representation as an array of numpy.complex values
embed floats in a fixed_width_byte_array: we lose meaning in the metadata but also allows a zero copy import of a numpy array.
define a generic row/struct/compound type which is a fixed width representation of several values: Basically the same binary representation as the previous one but associating metadata to it.

asfimport · 2017-03-16T18:41:55Z

Wes McKinney / @wesm:
As a practical matter, complex64 and complex128 probably need to be compatible with std::complex, which is implemented as a C struct with the values next to each other in memory – this is used in TensorFlow for example:

https://github.com/tensorflow/tensorflow/blob/6dfa434252b1cc3f6c523edc329853b3e8e4911c/tensorflow/core/framework/numeric_types.h#L33

in C++, this boils down to

  template<typename _Tp>
    struct complex
    {

<SNIP>

    private:
      _Tp _M_real;
      _Tp _M_imag;
    };

asfimport · 2019-08-21T16:16:29Z

Wes McKinney / @wesm:
This might be simpler to handle as an ExtensionType in C++ to start

jorisvandenbossche · 2024-01-23T11:51:45Z

Mailing list discussion about this from 2021: https://www.mail-archive.com/dev@arrow.apache.org/msg23352.html

kou · 2024-06-10T07:20:31Z

lists.apache.org URL: https://lists.apache.org/thread/bngbnnhyq7lkyx8cg7l2qs1msd0ngg82

sjperkins · 2024-08-26T11:11:27Z

Replying to @rok 's comment

[Docs] Update extension type examples to not use UUID #43809 (comment)

here as it's probably more relevant to this thread.

Thanks for the ping @rok -- I really should re-propose a Complex number. I'm now thinking along the lines of ComplexFloat = FixedSizeBinary(64) and ComplexDouble = FixedSizeBinary(128), rather than the original FixedSizeListArray(float32(), 2) and FixedSizeListArray(float64(), 2) approach. I think the former will work better with FixedShapeTensor and VariableShapeTensor.

Oh interesting approach. Is there other systems that do this? Would this approach be better fitted for vectorization? I suppose it would be more efficient for Parquet.

I'm not aware of other systems that do this and haven't considered the benefits to vectorization. Mostly it satisfies the Fixed Width requirements in FixedShapeTensor (and I guess VariableShapeTensor).

arrow/cpp/src/arrow/extension/fixed_shape_tensor.cc

Lines 208 to 220 in 51e9f70

    
           Result<std::shared_ptr<Tensor>> FixedShapeTensorType::MakeTensor( 
        
               const std::shared_ptr<ExtensionScalar>& scalar) { 
        
             const auto& ext_scalar = internal::checked_cast<const ExtensionScalar&>(*scalar); 
        
             const auto& ext_type = 
        
                 internal::checked_cast<const FixedShapeTensorType&>(*scalar->type); 
        
             if (!is_fixed_width(*ext_type.value_type())) { 
        
               return Status::TypeError("Cannot convert non-fixed-width values to Tensor."); 
        
             } 
        
             const auto& array = 
        
                 internal::checked_cast<const FixedSizeListScalar*>(ext_scalar.value.get())->value; 
        
             if (array->null_count() > 0) { 
        
               return Status::Invalid("Cannot convert data with nulls to Tensor."); 
        
             }

Then one could interpret the raw bytes as real and imaginary components. I guess one would have to consider endianess here.

From hazy memory, the previous attempt created numeric primitive complex64 and complex128 types based on std::complex and std::complex,

ARROW-638: [C++] Complex Number Support #10452

but this resulted in extensive changes throughout the code base due to the need to support the relevant operations for those types. This probably has knock-on effects to binary size.

A FixedSizeBinary(64/128) might be a good compromise between creating a fixed with type while not implementing every primitive operation for complex numbers.

rok · 2024-08-26T12:31:18Z

@sjperkins looking at the past discussion I agree extension type is the way to go. I'm not sure about the storage type though. It would be nice to have zero-copy path to numpy which, as per this, would need fixed_size_list approach.

That said, wouldn't it be possible to interpret Fixed/VariableShapeTensor as a complex tensor if an extra dimension of size 2 was added (and strides were done correctly, namely the complex dimension had the smallest stride)? I think the memory layout in this case would match numpy's.
FixedShapeTensor can be zero-copied to numpy today and can probably be cast to complex in numpy. It would of course be better to add a new extension type to not make users do this manually.

sjperkins · 2024-08-27T16:02:15Z

@sjperkins looking at the past discussion I agree extension type is the way to go. I'm not sure about the storage type though. It would be nice to have zero-copy path to numpy which, as per this, would need fixed_size_list approach.

/cc @maupardh1 who started #39754

I don't think using FixedSizeBinary precludes zero-copy -- The following, similar to @zeroshade 's suggestion in #39753 (comment) seems to work (although I realise more work would be needed at the cython layer to accept np.array(..., np.complex64) in pa.array)

import pyarrow as pa
import numpy as np

import pyarrow as pa
import numpy as np
from numpy.testing import assert_array_equal

COMPLEX64_STORAGE_TYPE = pa.binary(8)

class ComplexFloatExtensionType(pa.ExtensionType):
    def __init__(self):
        pa.ExtensionType.__init__(self, COMPLEX64_STORAGE_TYPE, 'complex64')

    def __arrow_ext_serialize__(self):
        return b''

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized_data):
        return ComplexFloatExtensionType()

    def wrap_array(self, storage_array):
        return pa.ExtensionArray.from_storage(self, storage_array)

    def __arrow_ext_class__(self):
        return ComplexFloatExtensionArray

class ComplexFloatExtensionArray(pa.ExtensionArray):
    @classmethod
    def from_numpy(cls, array):
        if array.dtype != np.complex64:
            raise ValueError("Only complex64 dtype is supported")
        storage_array = pa.FixedSizeBinaryArray.from_buffers(
            COMPLEX64_STORAGE_TYPE, len(array),
            [None, pa.py_buffer(array.view(np.uint8))]
        )
        return pa.ExtensionArray.from_storage(ComplexFloatExtensionType(), storage_array)

    def to_numpy(self, zero_copy_only=True, writeable=False):
        return np.frombuffer(self.storage.buffers()[1], dtype=np.complex64)

# Register the extension type with Arrow
pa.register_extension_type(ComplexFloatExtensionType())

data = np.array([1 + 2j, 3 + 4j, 5 + 6j, 7 + 8j], dtype=np.complex64)
arrow_data = ComplexFloatExtensionArray.from_numpy(data)

# Arrow buffers use Numpy buffer
assert arrow_data.storage.buffers()[1].address == data.view(np.uint8).ctypes.data
roundtrip = arrow_data.to_numpy()
assert_array_equal(roundtrip, data)
# Final array uses original array buffers
assert roundtrip.ctypes.data == data.ctypes.data

data2 = pa.array(data, type=ComplexFloatExtensionType())
assert_array_equal(data, data2)

tt = pa.fixed_shape_tensor(ComplexFloatExtensionType(), (2,))
storage = pa.FixedSizeListArray.from_arrays(arrow_data, 2)
assert len(storage) == 2
tensor = pa.ExtensionArray.from_storage(tt, storage)

print(arrow_data)
print(tensor)

One downside might be that there isn't yet support for custom extension type output

[C++] Provide way for extension array to provide it's own value pretty printer #36648

so at the C++ arrow layer, the developer would be looking at a bunch of binary data. But it's an extension type and it wouldn't preclude developers from interpreting the fixed width buffers as std::complex<float> via the data_as/mutable_data_as methods.

That said, wouldn't it be possible to interpret Fixed/VariableShapeTensor as a complex tensor if an extra dimension of size 2 was added (and strides were done correctly, namely the complex dimension had the smallest stride)? I think the memory layout in this case would match numpy's. FixedShapeTensor can be zero-copied to numpy today and can probably be cast to complex in numpy. It would of course be better to add a new extension type to not make users do this manually.

Yes this should work -- I've used something similar with nested FixedSizeListArrays to represent complex arrays whose underlying buffers can simply be passed to the appropriate NumPy method. However, would this not create the need to special case a lot of type handling? i.e. there may need to be:

A basic ComplexFloat + ComplexDouble
A FixedShapeTensor and a ComplexFixedShapeTensor (or possibly indicate complex in the serialized metadata?)
Same for VariableShapeTensors
and maybe other compound types.

If the above are valid concerns, my bias is towards the pa.binary(8/16) storage type. It's fixed width (so works as a tensor type), has the same binary format as std::complex<float/double> and np.complex64/128. Actually to be fair the underlying FixedSizeList buffer also has the same binary format, but I think its not fixed width because nulls are possible.

rok · 2024-08-27T16:22:46Z

I don't think using FixedSizeBinary precludes zero-copy -- The following, similar to @zeroshade 's suggestion in #39753 (comment) seems to work (although I realise more work would be needed at the cython layer to accept np.array(..., np.complex64) in pa.array)

Agreed, my point was that an extension array with FixedSizeList storage doesn't preclude zero-copy and I'm not sure there's benefits to using it over FixedSizeBinary over storage. Introducing an extension type doesn't come with requirement to introduce kernels or other machinery for it.

Yes this should work -- I've used something similar with nested FixedSizeListArrays to represent complex arrays whose underlying buffers can simply be passed to the appropriate NumPy method. However, would this not create the need to special case a lot of type handling? i.e. there may need to be:

A basic ComplexFloat + ComplexDouble

A FixedShapeTensor and a ComplexFixedShapeTensor (or possibly indicate complex in the serialized metadata?)

Same for VariableShapeTensors

and maybe other compound types.

I think the question here is also do we want a complex tensor extension array or a complex extension array. I am not sure we can use an complex extension array as storage of FixedShapeTensorArray, though if we can that would be best. Can an extension array be storage to another extension array? (Or am I misunderstanding and you mean we should introduce a primary complex type?)

sjperkins · 2024-08-27T17:02:12Z

I think the question here is also do we want a complex tensor extension array or a complex extension array. I am not sure we can use an complex extension array as storage of FixedShapeTensorArray,

I suspect this is possible because in this excerpt from the larger example above,

tt = pa.fixed_shape_tensor(ComplexFloatExtensionType(), (2,))
storage = pa.FixedSizeListArray.from_arrays(arrow_data, 2)
assert len(storage) == 2
tensor = pa.ExtensionArray.from_storage(tt, storage)

arrow_data is a Python Extension Array associated with an Python Extension type with storage type pa.binary(8). Does your question relate to the underlying C++ Extension Type and Array?

ipdb> type(arrow_data)
<class '__main__.ComplexFloatExtensionArray'>
ipdb>

Although I think wrapping it in a FixedSizeList is necessary for it to be accepted by the FixedShapeTensorType which uses a FixedSizeList under the hood?

though if we can that would be best. Can an extension array be storage to another extension array?

The python example above might suggest it's possible, but I'm admittedly not sure about the C++ layer.

(Or am I misunderstanding and you mean we should introduce a primary complex type?

I'm advocating for an C++ Extension ComplexFloatType (complex64) with an associated Extension ComplexFloatArray. The FixedShapeTensorType should accept the ComplexFloatType as a type and the associated FixedShapeTensorArray should be able to wrap the ComplexFloatArray. This could be used for a proper conversion to NumPy and Pandas. And the same again with ComplexDoubleType (complex128). They would be exposed in the Python layer, much like the other canonical Extension Types.

jorisvandenbossche mentioned this issue Jan 23, 2024

[Python] support for complex64 and complex128 as primitive types for zero-copy interop with numpy #39753

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Format] Add metadata for single and double precision complex numbers #16264

[Format] Add metadata for single and double precision complex numbers #16264

asfimport commented Mar 16, 2017

asfimport commented Mar 16, 2017

asfimport commented Mar 16, 2017

asfimport commented Aug 21, 2019

jorisvandenbossche commented Jan 23, 2024

kou commented Jun 10, 2024

sjperkins commented Aug 26, 2024 •

edited

Loading

rok commented Aug 26, 2024

sjperkins commented Aug 27, 2024 •

edited

Loading

rok commented Aug 27, 2024

sjperkins commented Aug 27, 2024 •

edited

Loading

[Format] Add metadata for single and double precision complex numbers #16264

[Format] Add metadata for single and double precision complex numbers #16264

Comments

asfimport commented Mar 16, 2017

PRs and other links:

asfimport commented Mar 16, 2017

asfimport commented Mar 16, 2017

asfimport commented Aug 21, 2019

jorisvandenbossche commented Jan 23, 2024

kou commented Jun 10, 2024

sjperkins commented Aug 26, 2024 • edited Loading

rok commented Aug 26, 2024

sjperkins commented Aug 27, 2024 • edited Loading

rok commented Aug 27, 2024

sjperkins commented Aug 27, 2024 • edited Loading

sjperkins commented Aug 26, 2024 •

edited

Loading

sjperkins commented Aug 27, 2024 •

edited

Loading

sjperkins commented Aug 27, 2024 •

edited

Loading