Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Format] Add metadata for single and double precision complex numbers #16264

Open
asfimport opened this issue Mar 16, 2017 · 10 comments
Open

[Format] Add metadata for single and double precision complex numbers #16264

asfimport opened this issue Mar 16, 2017 · 10 comments

Comments

@asfimport
Copy link
Collaborator

Numerical computing libraries like NumPy and TensorFlow feature complex64 and complex128 numbers

Reporter: Wes McKinney / @wesm

PRs and other links:

Note: This issue was originally created as ARROW-638. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Julien Le Dem / @julienledem:

I see several possibilities:

  • first class complex types: but we have to curate types so that we don't create too many of them.
  • have two float columns: but I guess we want the two values next to each other
  • have one float column where odd and even indices are the components of the complex value: which I guess would be the same binary representation as an array of numpy.complex values
  • embed floats in a fixed_width_byte_array: we lose meaning in the metadata but also allows a zero copy import of a numpy array.
  • define a generic row/struct/compound type which is a fixed width representation of several values: Basically the same binary representation as the previous one but associating metadata to it.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
As a practical matter, complex64 and complex128 probably need to be compatible with std::complex, which is implemented as a C struct with the values next to each other in memory – this is used in TensorFlow for example:

https://github.com/tensorflow/tensorflow/blob/6dfa434252b1cc3f6c523edc329853b3e8e4911c/tensorflow/core/framework/numeric_types.h#L33

in C++, this boils down to

  template<typename _Tp>
    struct complex
    {

<SNIP>

    private:
      _Tp _M_real;
      _Tp _M_imag;
    };

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This might be simpler to handle as an ExtensionType in C++ to start

@jorisvandenbossche
Copy link
Member

Mailing list discussion about this from 2021: https://www.mail-archive.com/dev@arrow.apache.org/msg23352.html

@kou
Copy link
Member

kou commented Jun 10, 2024

@sjperkins
Copy link
Contributor

sjperkins commented Aug 26, 2024

Replying to @rok 's comment

here as it's probably more relevant to this thread.

Thanks for the ping @rok -- I really should re-propose a Complex number. I'm now thinking along the lines of ComplexFloat = FixedSizeBinary(64) and ComplexDouble = FixedSizeBinary(128), rather than the original FixedSizeListArray(float32(), 2) and FixedSizeListArray(float64(), 2) approach. I think the former will work better with FixedShapeTensor and VariableShapeTensor.

Oh interesting approach. Is there other systems that do this? Would this approach be better fitted for vectorization? I suppose it would be more efficient for Parquet.

I'm not aware of other systems that do this and haven't considered the benefits to vectorization. Mostly it satisfies the Fixed Width requirements in FixedShapeTensor (and I guess VariableShapeTensor).

Result<std::shared_ptr<Tensor>> FixedShapeTensorType::MakeTensor(
const std::shared_ptr<ExtensionScalar>& scalar) {
const auto& ext_scalar = internal::checked_cast<const ExtensionScalar&>(*scalar);
const auto& ext_type =
internal::checked_cast<const FixedShapeTensorType&>(*scalar->type);
if (!is_fixed_width(*ext_type.value_type())) {
return Status::TypeError("Cannot convert non-fixed-width values to Tensor.");
}
const auto& array =
internal::checked_cast<const FixedSizeListScalar*>(ext_scalar.value.get())->value;
if (array->null_count() > 0) {
return Status::Invalid("Cannot convert data with nulls to Tensor.");
}

Then one could interpret the raw bytes as real and imaginary components. I guess one would have to consider endianess here.

From hazy memory, the previous attempt created numeric primitive complex64 and complex128 types based on std::complex and std::complex,

but this resulted in extensive changes throughout the code base due to the need to support the relevant operations for those types. This probably has knock-on effects to binary size.

A FixedSizeBinary(64/128) might be a good compromise between creating a fixed with type while not implementing every primitive operation for complex numbers.

@rok
Copy link
Member

rok commented Aug 26, 2024

@sjperkins looking at the past discussion I agree extension type is the way to go. I'm not sure about the storage type though. It would be nice to have zero-copy path to numpy which, as per this, would need fixed_size_list approach.

That said, wouldn't it be possible to interpret Fixed/VariableShapeTensor as a complex tensor if an extra dimension of size 2 was added (and strides were done correctly, namely the complex dimension had the smallest stride)? I think the memory layout in this case would match numpy's.
FixedShapeTensor can be zero-copied to numpy today and can probably be cast to complex in numpy. It would of course be better to add a new extension type to not make users do this manually.

@sjperkins
Copy link
Contributor

sjperkins commented Aug 27, 2024

@sjperkins looking at the past discussion I agree extension type is the way to go. I'm not sure about the storage type though. It would be nice to have zero-copy path to numpy which, as per this, would need fixed_size_list approach.

/cc @maupardh1 who started #39754

I don't think using FixedSizeBinary precludes zero-copy -- The following, similar to @zeroshade 's suggestion in #39753 (comment) seems to work (although I realise more work would be needed at the cython layer to accept np.array(..., np.complex64) in pa.array)

import pyarrow as pa
import numpy as np

import pyarrow as pa
import numpy as np
from numpy.testing import assert_array_equal

COMPLEX64_STORAGE_TYPE = pa.binary(8)

class ComplexFloatExtensionType(pa.ExtensionType):
    def __init__(self):
        pa.ExtensionType.__init__(self, COMPLEX64_STORAGE_TYPE, 'complex64')

    def __arrow_ext_serialize__(self):
        return b''

    @classmethod
    def __arrow_ext_deserialize__(cls, storage_type, serialized_data):
        return ComplexFloatExtensionType()

    def wrap_array(self, storage_array):
        return pa.ExtensionArray.from_storage(self, storage_array)

    def __arrow_ext_class__(self):
        return ComplexFloatExtensionArray

class ComplexFloatExtensionArray(pa.ExtensionArray):
    @classmethod
    def from_numpy(cls, array):
        if array.dtype != np.complex64:
            raise ValueError("Only complex64 dtype is supported")
        storage_array = pa.FixedSizeBinaryArray.from_buffers(
            COMPLEX64_STORAGE_TYPE, len(array),
            [None, pa.py_buffer(array.view(np.uint8))]
        )
        return pa.ExtensionArray.from_storage(ComplexFloatExtensionType(), storage_array)

    def to_numpy(self, zero_copy_only=True, writeable=False):
        return np.frombuffer(self.storage.buffers()[1], dtype=np.complex64)

# Register the extension type with Arrow
pa.register_extension_type(ComplexFloatExtensionType())

data = np.array([1 + 2j, 3 + 4j, 5 + 6j, 7 + 8j], dtype=np.complex64)
arrow_data = ComplexFloatExtensionArray.from_numpy(data)

# Arrow buffers use Numpy buffer
assert arrow_data.storage.buffers()[1].address == data.view(np.uint8).ctypes.data
roundtrip = arrow_data.to_numpy()
assert_array_equal(roundtrip, data)
# Final array uses original array buffers
assert roundtrip.ctypes.data == data.ctypes.data

data2 = pa.array(data, type=ComplexFloatExtensionType())
assert_array_equal(data, data2)

tt = pa.fixed_shape_tensor(ComplexFloatExtensionType(), (2,))
storage = pa.FixedSizeListArray.from_arrays(arrow_data, 2)
assert len(storage) == 2
tensor = pa.ExtensionArray.from_storage(tt, storage)

print(arrow_data)
print(tensor)

One downside might be that there isn't yet support for custom extension type output

so at the C++ arrow layer, the developer would be looking at a bunch of binary data. But it's an extension type and it wouldn't preclude developers from interpreting the fixed width buffers as std::complex<float> via the data_as/mutable_data_as methods.

That said, wouldn't it be possible to interpret Fixed/VariableShapeTensor as a complex tensor if an extra dimension of size 2 was added (and strides were done correctly, namely the complex dimension had the smallest stride)? I think the memory layout in this case would match numpy's. FixedShapeTensor can be zero-copied to numpy today and can probably be cast to complex in numpy. It would of course be better to add a new extension type to not make users do this manually.

Yes this should work -- I've used something similar with nested FixedSizeListArrays to represent complex arrays whose underlying buffers can simply be passed to the appropriate NumPy method. However, would this not create the need to special case a lot of type handling? i.e. there may need to be:

  1. A basic ComplexFloat + ComplexDouble
  2. A FixedShapeTensor and a ComplexFixedShapeTensor (or possibly indicate complex in the serialized metadata?)
  3. Same for VariableShapeTensors
  4. and maybe other compound types.

If the above are valid concerns, my bias is towards the pa.binary(8/16) storage type. It's fixed width (so works as a tensor type), has the same binary format as std::complex<float/double> and np.complex64/128. Actually to be fair the underlying FixedSizeList buffer also has the same binary format, but I think its not fixed width because nulls are possible.

@rok
Copy link
Member

rok commented Aug 27, 2024

I don't think using FixedSizeBinary precludes zero-copy -- The following, similar to @zeroshade 's suggestion in #39753 (comment) seems to work (although I realise more work would be needed at the cython layer to accept np.array(..., np.complex64) in pa.array)

Agreed, my point was that an extension array with FixedSizeList storage doesn't preclude zero-copy and I'm not sure there's benefits to using it over FixedSizeBinary over storage. Introducing an extension type doesn't come with requirement to introduce kernels or other machinery for it.

Yes this should work -- I've used something similar with nested FixedSizeListArrays to represent complex arrays whose underlying buffers can simply be passed to the appropriate NumPy method. However, would this not create the need to special case a lot of type handling? i.e. there may need to be:

  1. A basic ComplexFloat + ComplexDouble
  2. A FixedShapeTensor and a ComplexFixedShapeTensor (or possibly indicate complex in the serialized metadata?)
  3. Same for VariableShapeTensors
  4. and maybe other compound types.

I think the question here is also do we want a complex tensor extension array or a complex extension array. I am not sure we can use an complex extension array as storage of FixedShapeTensorArray, though if we can that would be best. Can an extension array be storage to another extension array? (Or am I misunderstanding and you mean we should introduce a primary complex type?)

@sjperkins
Copy link
Contributor

sjperkins commented Aug 27, 2024

I think the question here is also do we want a complex tensor extension array or a complex extension array. I am not sure we can use an complex extension array as storage of FixedShapeTensorArray,

I suspect this is possible because in this excerpt from the larger example above,

tt = pa.fixed_shape_tensor(ComplexFloatExtensionType(), (2,))
storage = pa.FixedSizeListArray.from_arrays(arrow_data, 2)
assert len(storage) == 2
tensor = pa.ExtensionArray.from_storage(tt, storage)

arrow_data is a Python Extension Array associated with an Python Extension type with storage type pa.binary(8). Does your question relate to the underlying C++ Extension Type and Array?

ipdb> type(arrow_data)
<class '__main__.ComplexFloatExtensionArray'>
ipdb> 

Although I think wrapping it in a FixedSizeList is necessary for it to be accepted by the FixedShapeTensorType which uses a FixedSizeList under the hood?

though if we can that would be best. Can an extension array be storage to another extension array?

The python example above might suggest it's possible, but I'm admittedly not sure about the C++ layer.

(Or am I misunderstanding and you mean we should introduce a primary complex type?

I'm advocating for an C++ Extension ComplexFloatType (complex64) with an associated Extension ComplexFloatArray. The FixedShapeTensorType should accept the ComplexFloatType as a type and the associated FixedShapeTensorArray should be able to wrap the ComplexFloatArray. This could be used for a proper conversion to NumPy and Pandas. And the same again with ComplexDoubleType (complex128). They would be exposed in the Python layer, much like the other canonical Extension Types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants