Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Implement pa.array() with type=union type #19157

Open
asfimport opened this issue Jul 1, 2018 · 7 comments
Open

[Python] Implement pa.array() with type=union type #19157

asfimport opened this issue Jul 1, 2018 · 7 comments

Comments

@asfimport
Copy link

asfimport commented Jul 1, 2018

It would be useful to be able to generate unions during type inference:

In [11]: pa.array([{'a': 1, 'b': 'string'}, {'b': 2}])
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-11-c554b698271b> in <module>()
----> 1 pa.array([{'a': 1, 'b': 'string'}, {'b': 2}])

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
    179         if mask is not None:
    180             raise ValueError("Masks only supported with ndarray-like inputs")
--> 181         return _sequence_to_array(obj, size, type, pool)
    182 
    183 

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
     24         if size is None:
     25             with nogil:
---> 26                 check_status(ConvertPySequence(sequence, pool, &out))
     27         else:
     28             c_size = size

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
     89             raise ArrowNotImplementedError(message)
     90         elif status.IsTypeError():
---> 91             raise ArrowTypeError(message)
     92         elif status.IsCapacityError():
     93             raise ArrowCapacityError(message)

ArrowTypeError: ../src/arrow/python/builtin_convert.cc:794 code: AppendPySequence(seq, size, real_type, builder.get())
../src/arrow/python/iterators.h:60 code: func(value)
../src/arrow/python/builtin_convert.cc:619 code: value_converters_[i]->AppendSingle(valueobj ? valueobj : Py_None)
../src/arrow/python/builtin_convert.cc:414 code: internal::CIntFromPython(obj, &value)
../src/arrow/python/helpers.cc:259 code: CheckPyError()
an integer is required (got type str)

In [12]: pa.array([5, 'str', False])
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-12-9e3343f08351> in <module>()
----> 1 pa.array([5, 'str', False])

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib.array()
    179         if mask is not None:
    180             raise ValueError("Masks only supported with ndarray-like inputs")
--> 181         return _sequence_to_array(obj, size, type, pool)
    182 
    183 

~/code/arrow/python/pyarrow/array.pxi in pyarrow.lib._sequence_to_array()
     24         if size is None:
     25             with nogil:
---> 26                 check_status(ConvertPySequence(sequence, pool, &out))
     27         else:
     28             c_size = size

~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
     89             raise ArrowNotImplementedError(message)
     90         elif status.IsTypeError():
---> 91             raise ArrowTypeError(message)
     92         elif status.IsCapacityError():
     93             raise ArrowCapacityError(message)

ArrowTypeError: ../src/arrow/python/builtin_convert.cc:794 code: AppendPySequence(seq, size, real_type, builder.get())
../src/arrow/python/iterators.h:60 code: func(value)
../src/arrow/python/builtin_convert.cc:414 code: internal::CIntFromPython(obj, &value)
../src/arrow/python/helpers.cc:259 code: CheckPyError()
an integer is required (got type str)

Reporter: Wes McKinney / @wesm

Related issues:

Note: This issue was originally created as ARROW-2774. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I'm not sure it's a good idea to do this by default, since it would hide problems with unsanitized input. Intuitively, I don't think real "union"-type data is frequent in the real world (as opposed to unsanitized data).

Also, we would have to choose a union kind (dense or sparse).

@asfimport
Copy link
Author

Uwe Korn / @xhochy:
This is definitely something users would like to see but I would also like to see this hidden behind a flag. Being able to deal with unsanitized input is often a typical pandas use case in exploratory data analysis but once you use this as part of a production pipeline, you rather want to have it error.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I agree that disabling by default is a good idea. Maybe we can have an allow_unions=True flag

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Unions will be really helpful when we get to working on CSV reading – handling messy CSV files with columns having non-standard strings or other markers in numeric columns has been a major issue for pandas over the years.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
I'm still not convinced this is a good idea. Consider pa.array([1, 2.3]). Should it return a union<int64, float64>?

cc @amol- for advice.

@asfimport
Copy link
Author

Joris Van den Bossche / @jorisvandenbossche:
Agreed that we shouldn't do that by default, but we can keep this issue about actually supporting it? Because now construction of a union array from a python sequence is not even supported when explicitly mentioning the type.

In [52]: typ = pa.union([pa.field("int", "int64"), pa.field("float", "float64")], mode="sparse")

In [53]: pa.array([1, 2.3], type=typ)
...
ArrowNotImplementedError: sparse_union
../src/arrow/util/converter.h:265  VisitTypeInline(*visitor.type, &visitor)
../src/arrow/python/python_to_arrow.cc:1015  (MakeConverter<PyConverter, PyConverterTrait>( options.type, options, pool))

@chairmank
Copy link

chairmank commented Apr 28, 2023

I would also like pyarrow.array to automatically convert Python values when a sparse union or dense union type is explicitly specified. I frequently use dense union types to represent data that originated in protocol buffers with oneof fields. It is inconvenient to have to implement special handling of this case when the target Arrow schema is known.

Also, I would like to politely observe that example code snippets in previous comments are misleading, because they do not distinguish between child fields that happen to have the same data type.

Antoine Pitrou / @pitrou: I'm still not convinced this is a good idea. Consider pa.array([1, 2.3]). Should it return a union<int64, float64>?

cc @amol- for advice.

Joris Van den Bossche / @jorisvandenbossche: Agreed that we shouldn't do that by default, but we can keep this issue about actually supporting it? Because now construction of a union array from a python sequence is not even supported when explicitly mentioning the type.

In [52]: typ = pa.union([pa.field("int", "int64"), pa.field("float", "float64")], mode="sparse")

In [53]: pa.array([1, 2.3], type=typ)
...
ArrowNotImplementedError: sparse_union
../src/arrow/util/converter.h:265  VisitTypeInline(*visitor.type, &visitor)
../src/arrow/python/python_to_arrow.cc:1015  (MakeConverter<PyConverter, PyConverterTrait>( options.type, options, pool))

As an example, consider the following union type:

>>> string_predicate_type = pa.dense_union([
...     pa.field("equals", pa.string(), False),
...     pa.field("regexp", pa.string(), False),
...     pa.field("is_null", pa.null()),
... ])
>>> string_predicate_type
DenseUnionType(dense_union<regexp: string not null=0, regexp: string not null=1, is_null: null=2>)

Both equals and regexp are string, but they are semantically distinct. For pyarrow.array to convert Python values to the correct child field type, the values ought to be tagged:

pa.array([{"equals": "foo"}, {"regexp": "[0-9a-f]{16}"}, {"is_null": None}], type=string_predicate_type)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants