-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Format] Add metadata for single and double precision complex numbers #16264
Comments
I see several possibilities:
|
Wes McKinney / @wesm: in C++, this boils down to template<typename _Tp>
struct complex
{
<SNIP>
private:
_Tp _M_real;
_Tp _M_imag;
}; |
Wes McKinney / @wesm: |
Mailing list discussion about this from 2021: https://www.mail-archive.com/dev@arrow.apache.org/msg23352.html |
lists.apache.org URL: https://lists.apache.org/thread/bngbnnhyq7lkyx8cg7l2qs1msd0ngg82 |
Replying to @rok 's comment here as it's probably more relevant to this thread.
I'm not aware of other systems that do this and haven't considered the benefits to vectorization. Mostly it satisfies the Fixed Width requirements in FixedShapeTensor (and I guess VariableShapeTensor). arrow/cpp/src/arrow/extension/fixed_shape_tensor.cc Lines 208 to 220 in 51e9f70
Then one could interpret the raw bytes as real and imaginary components. I guess one would have to consider endianess here. From hazy memory, the previous attempt created numeric primitive complex64 and complex128 types based on std::complex and std::complex, but this resulted in extensive changes throughout the code base due to the need to support the relevant operations for those types. This probably has knock-on effects to binary size. A FixedSizeBinary(64/128) might be a good compromise between creating a fixed with type while not implementing every primitive operation for complex numbers. |
@sjperkins looking at the past discussion I agree extension type is the way to go. I'm not sure about the storage type though. It would be nice to have zero-copy path to numpy which, as per this, would need fixed_size_list approach. That said, wouldn't it be possible to interpret Fixed/VariableShapeTensor as a complex tensor if an extra dimension of size 2 was added (and strides were done correctly, namely the complex dimension had the smallest stride)? I think the memory layout in this case would match numpy's. |
/cc @maupardh1 who started #39754 I don't think using import pyarrow as pa
import numpy as np
import pyarrow as pa
import numpy as np
from numpy.testing import assert_array_equal
COMPLEX64_STORAGE_TYPE = pa.binary(8)
class ComplexFloatExtensionType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, COMPLEX64_STORAGE_TYPE, 'complex64')
def __arrow_ext_serialize__(self):
return b''
@classmethod
def __arrow_ext_deserialize__(cls, storage_type, serialized_data):
return ComplexFloatExtensionType()
def wrap_array(self, storage_array):
return pa.ExtensionArray.from_storage(self, storage_array)
def __arrow_ext_class__(self):
return ComplexFloatExtensionArray
class ComplexFloatExtensionArray(pa.ExtensionArray):
@classmethod
def from_numpy(cls, array):
if array.dtype != np.complex64:
raise ValueError("Only complex64 dtype is supported")
storage_array = pa.FixedSizeBinaryArray.from_buffers(
COMPLEX64_STORAGE_TYPE, len(array),
[None, pa.py_buffer(array.view(np.uint8))]
)
return pa.ExtensionArray.from_storage(ComplexFloatExtensionType(), storage_array)
def to_numpy(self, zero_copy_only=True, writeable=False):
return np.frombuffer(self.storage.buffers()[1], dtype=np.complex64)
# Register the extension type with Arrow
pa.register_extension_type(ComplexFloatExtensionType())
data = np.array([1 + 2j, 3 + 4j, 5 + 6j, 7 + 8j], dtype=np.complex64)
arrow_data = ComplexFloatExtensionArray.from_numpy(data)
# Arrow buffers use Numpy buffer
assert arrow_data.storage.buffers()[1].address == data.view(np.uint8).ctypes.data
roundtrip = arrow_data.to_numpy()
assert_array_equal(roundtrip, data)
# Final array uses original array buffers
assert roundtrip.ctypes.data == data.ctypes.data
data2 = pa.array(data, type=ComplexFloatExtensionType())
assert_array_equal(data, data2)
tt = pa.fixed_shape_tensor(ComplexFloatExtensionType(), (2,))
storage = pa.FixedSizeListArray.from_arrays(arrow_data, 2)
assert len(storage) == 2
tensor = pa.ExtensionArray.from_storage(tt, storage)
print(arrow_data)
print(tensor) One downside might be that there isn't yet support for custom extension type output so at the C++ arrow layer, the developer would be looking at a bunch of binary data. But it's an extension type and it wouldn't preclude developers from interpreting the fixed width buffers as
Yes this should work -- I've used something similar with nested FixedSizeListArrays to represent complex arrays whose underlying buffers can simply be passed to the appropriate NumPy method. However, would this not create the need to special case a lot of type handling? i.e. there may need to be:
If the above are valid concerns, my bias is towards the |
Agreed, my point was that an extension array with
I think the question here is also do we want a complex tensor extension array or a complex extension array. I am not sure we can use an complex extension array as storage of FixedShapeTensorArray, though if we can that would be best. Can an extension array be storage to another extension array? (Or am I misunderstanding and you mean we should introduce a primary complex type?) |
I suspect this is possible because in this excerpt from the larger example above, tt = pa.fixed_shape_tensor(ComplexFloatExtensionType(), (2,))
storage = pa.FixedSizeListArray.from_arrays(arrow_data, 2)
assert len(storage) == 2
tensor = pa.ExtensionArray.from_storage(tt, storage)
ipdb> type(arrow_data)
<class '__main__.ComplexFloatExtensionArray'>
ipdb> Although I think wrapping it in a FixedSizeList is necessary for it to be accepted by the FixedShapeTensorType which uses a FixedSizeList under the hood?
The python example above might suggest it's possible, but I'm admittedly not sure about the C++ layer.
I'm advocating for an C++ Extension ComplexFloatType (complex64) with an associated Extension ComplexFloatArray. The FixedShapeTensorType should accept the ComplexFloatType as a type and the associated FixedShapeTensorArray should be able to wrap the ComplexFloatArray. This could be used for a proper conversion to NumPy and Pandas. And the same again with ComplexDoubleType (complex128). They would be exposed in the Python layer, much like the other canonical Extension Types. |
Numerical computing libraries like NumPy and TensorFlow feature complex64 and complex128 numbers
Reporter: Wes McKinney / @wesm
PRs and other links:
Note: This issue was originally created as ARROW-638. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: