Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] pyarrow.concat_tables did not really enable the unify_schemas, it only supply unify_schemas options but not actually enable it #38809

Open
braindevices opened this issue Nov 20, 2023 · 4 comments

Comments

@braindevices
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Describe the enhancement requested

in R there is concat_tables(..., unify_schemas = TRUE)(https://arrow.apache.org/docs/r/reference/concat_tables.html)
In python this function does not allow concat table with different schema.

unify_schemas

If TRUE, the schemas of the tables will be first unified with fields of the same name being merged, then each table will be promoted to the unified schema before being concatenated. Otherwise, all tables should have the same schema.

I tried the promote_options="permissive" which does not seem to do the trick.

t1 = pa.Table.from_pydict({'a': [1., 2], 'b': [{'k1': 0, 'k2':1}, {'k3': 'ok'}]})
t2 = pa.Table.from_pydict({'a': [1., None], 'b': [{'k1': None}, {'k3': None}]})

pa.concat_tables([t1, t2], unify_schemas=True, promote_options="permissive")

It will have ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<k1: null, k3: null> output fields: struct<k1: int64, k2: int64, k3: string>

I expect this will be just ok

    cdef cppclass CConcatenateTablesOptions" arrow::ConcatenateTablesOptions":
        c_bool unify_schemas
        CField.CMergeOptions field_merge_options

We can see the bool field is already defined in there, but there is no way to set value in the actual function

def concat_tables(tables, MemoryPool memory_pool=None, str promote_options="none", **kwargs):

in the code you only set the field_merge_options which is option of unify_schema().
I think you either should initialize the unify_schemas to true, oh you need to actually handle the keyword unify_schemas. Otherwise, the field merge options actually not in effect at all.

Component(s)

Parquet, Python

@AlenkaF AlenkaF changed the title pyarrow.concat_tables did not really enable the unify_schemas, it only supply unify_schemas options but not actually enable it [Python] pyarrow.concat_tables did not really enable the unify_schemas, it only supply unify_schemas options but not actually enable it Nov 27, 2023
@AlenkaF
Copy link
Member

AlenkaF commented Nov 27, 2023

Looking at the last PR that was implementing the type promotion for concating pyarrow/Arrow Tables (see #36845) that there is a structure of which types can be promoted and how:

https://github.com/apache/arrow/pull/36846/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302

In Python promote options are passed to unify_schemas here:

arrow/python/pyarrow/table.pxi

Lines 5251 to 5253 in eb5de18

options.unify_schemas = promote_options != "none"
c_result_table = GetResultValue(
ConcatenateTables(c_tables, options, pool))

but note that unify_schemas=True in pa.concat_tables([t1, t2], unify_schemas=True, promote_options="permissive") doesn't have any effect.

With the MergeOptions set the C++ handles the promotion for different cases (no struct mentioned): https://github.com/apache/arrow/pull/36846/files#diff-34166c61c1bd83226ad1277c13dca57ba9afa62dc7fba5206146ef6d2abf623e

I am not sure but at first look it seems to me that struct type is not handled.

I suggest looking at the python tests in the linked PR to see what kind of type promotion is possible in concat_tables https://github.com/apache/arrow/pull/36846/files#diff-72bd29ba764c85f18fbf9e74898b72d47ed8a1b04be879c1a5b2a59382e2eaef.

@braindevices
Copy link
Author

@AlenkaF Thanks! I get it. So the options.unify_schemas is automatically set to true if the promote_options is set.
And it is unfortunate that the struct type is not handled here

@Fokko
Copy link
Contributor

Fokko commented Jan 25, 2024

I'm also hitting this issue, might provide a fix when I get the time

@braindevices
Copy link
Author

Hi @Fokko

I'm also hitting this issue, might provide a fix when I get the time

any update on this? I guess the key is to implement MergeStructTypes in type.cc

some thinking about this feature:

  • we should make a union set of the key names, whenever there is a new key the result struct should have it
  • when the key has the same name, try to promote the value type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants