Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8342: [Python] Continue to return dict from "metadata" properties accessing KeyValueMetadata #6855

Closed
wants to merge 3 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Apr 6, 2020

This patch relegates the KeyValueMetadata wrapper to an implementation detail, so existing third party code should be unaffected, and we can decide later when and how to expose this object publicly in the future without needing to revert the ARROW-8079.

I also fixed the change related to the "pandas" metadata key. Relatedly, I changed the KeyValueMetadata ctor to raise a KeyError if a use of the mixed-argument constructor (merging a prior object with some **kwargs) would create duplicate keys

@github-actions
Copy link

github-actions bot commented Apr 7, 2020

@wesm
Copy link
Member Author

wesm commented Apr 7, 2020

@github-actions crossbow submit test-conda-python-3.7-kartothek-latest test-conda-python-3.7-kartothek-master test-conda-python-3.7-dask-master test-conda-python-3.7-dask-latest

@wesm
Copy link
Member Author

wesm commented Apr 7, 2020

@github-actions crossbow submit test-conda-python-3.7-kartothek-latest test-conda-python-3.7-kartothek-master test-conda-python-3.8-dask-master test-conda-python-3.7-dask-latest

@github-actions
Copy link

github-actions bot commented Apr 7, 2020

Revision: 612d2cb

Submitted crossbow builds: ursa-labs/crossbow @ actions-79

Task Status
test-conda-python-3.7-dask-latest CircleCI
test-conda-python-3.7-kartothek-latest CircleCI
test-conda-python-3.7-kartothek-master CircleCI
test-conda-python-3.8-dask-master CircleCI

@wesm
Copy link
Member Author

wesm commented Apr 7, 2020

The Karthothek tests failing with

        if ARROW_LARGER_EQ_0150:
            # https://issues.apache.org/jira/browse/ARROW-8142
            if len(table) == 0:
                df = table.to_pandas(date_as_object=True)
                new_types = {
                    col: df[col].cat.categories.dtype
                    for col in df.select_dtypes("category")
                }
                if new_types:
                    df = df.astype(new_types)
                    table = pa.Table.from_pandas(df)
            else:
                schema = table.schema
                for i in range(len(schema)):
                    field = schema[i]
                    if pa.types.is_dictionary(field.type):
                        new_field = pa.field(
                            field.name,
                            field.type.value_type,
                            field.nullable,
                            field.metadata,
                        )
                        schema = schema.remove(i).insert(i, new_field)
    
                table = table.cast(schema)
        else:
            for i in range(table.num_columns):
                col = table[i]
>               if col.name in exclude:
E               AttributeError: 'pyarrow.lib.ChunkedArray' object has no attribute 'name'

Seems that the "ARROW_LARGER_EQ_0150" logic may not have triggered properly with this development build. @xhochy advice?

@wesm
Copy link
Member Author

wesm commented Apr 7, 2020

Ah appears that the git tags aren't being fetched properly in the task

Copying pyarrow.egg-info to /opt/conda/envs/arrow/lib/python3.7/site-packages/pyarrow-0.1.dev6329+g612d2cb-py3.7.egg-info

@wesm
Copy link
Member Author

wesm commented Apr 7, 2020

@kszucs could you take a look at this if you know the solution?

@kszucs
Copy link
Member

kszucs commented Apr 7, 2020

You don't have the tags pushed to your fork. I'm updating the scripts to pass SETUPTOOLS_SCM_PRETEND_VERSION.

@kszucs
Copy link
Member

kszucs commented Apr 7, 2020

@wesm please push the tags to your fork then resubmit the builds. I'm working on the fix in a separate patch.

@pitrou
Copy link
Member

pitrou commented Apr 7, 2020

I suggest reverting the original changes (git revert) instead of trying to improve the current implementation. Also, the mapping not being mutable will still break some uses.

@wesm
Copy link
Member Author

wesm commented Apr 7, 2020

@pitrou In this patch the KeyValueMetadata object is no longer exposed publicly unless you go looking for it (it isn't accessible at all anymore on Field or Schema objects). So current use cases should not be impacted

@wesm
Copy link
Member Author

wesm commented Apr 7, 2020

@github-actions crossbow submit test-conda-python-3.7-kartothek-latest test-conda-python-3.7-kartothek-master test-conda-python-3.8-dask-master test-conda-python-3.7-dask-latest

@github-actions
Copy link

github-actions bot commented Apr 7, 2020

Revision: b2efc83

Submitted crossbow builds: ursa-labs/crossbow @ actions-83

Task Status
test-conda-python-3.7-dask-latest CircleCI
test-conda-python-3.7-kartothek-latest CircleCI
test-conda-python-3.7-kartothek-master CircleCI
test-conda-python-3.8-dask-master CircleCI

@jorisvandenbossche
Copy link
Member

The failure in dask seems to suggest that we are still putting the wrong or duplicated pandas metadata in write_to_dataset (but didn't check if this is actually the case).

Now I think of it, in the long term, we should check with dask how to fix this issue, because if we have dataset writing, write_to_dataset won't longer need to go through pandas for the groupby, which means the pandas_metadata will also change compared to what it is now.

@wesm
Copy link
Member Author

wesm commented Apr 7, 2020

OK, looking, evidently there are no tests on our side to validate

@wesm
Copy link
Member Author

wesm commented Apr 8, 2020

@github-actions crossbow submit test-conda-python-3.8-dask-master test-conda-python-3.7-dask-latest

@wesm
Copy link
Member Author

wesm commented Apr 8, 2020

@jorisvandenbossche there was a pesky bytes/unicode issue, fixed

@github-actions
Copy link

github-actions bot commented Apr 8, 2020

Revision: 32643e6

Submitted crossbow builds: ursa-labs/crossbow @ actions-86

Task Status
test-conda-python-3.7-dask-latest CircleCI
test-conda-python-3.8-dask-master CircleCI

@wesm
Copy link
Member Author

wesm commented Apr 8, 2020

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants