Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI][Python] Tests with pandas nightlies have been failing for the last days witn NotImplementedError on S3 dtype #39437

Closed
raulcd opened this issue Jan 3, 2024 · 10 comments · Fixed by #39498

Comments

@raulcd
Copy link
Member

raulcd commented Jan 3, 2024

Describe the bug, including details regarding any error messages, version, and platform.

The following nightly job with the pandas nightlies have been failing for the last couple of weeks:

There seems to be an issue with S3:

 >       raise NotImplementedError(dtype)
E       NotImplementedError: |S3

opt/conda/envs/arrow/lib/python3.10/site-packages/pandas/core/indexes/base.py:634: NotImplementedError

Component(s)

Continuous Integration, Python

@raulcd
Copy link
Member Author

raulcd commented Jan 3, 2024

cc @AlenkaF @jorisvandenbossche I am unsure if this was opened before but I haven't been able to find a related issue.

@raulcd raulcd added the Priority: Blocker Marks a blocker for the release label Jan 3, 2024
@raulcd raulcd added this to the 15.0.0 milestone Jan 4, 2024
@AlenkaF
Copy link
Member

AlenkaF commented Jan 8, 2024

This is a pandas regression known from version 2.0.0:

if Version("2.0.0") <= Version(pd.__version__) < Version("2.2.0"):
# TODO: regression in pandas, hopefully fixed in next version
# https://issues.apache.org/jira/browse/ARROW-18394
# https://github.com/pandas-dev/pandas/issues/50127

Will update the Version check (from 2.2.0 to 2.3.0 as the CI uses 2.3.0.dev)

@AlenkaF
Copy link
Member

AlenkaF commented Jan 8, 2024

Numpy string dtype error is fixed with #39498, but there are still some issues:

  • two TestConvertMetadata tests are failing with DeprecationWarning: make_block is deprecated and will be removed in a future version. Use public APIs instead.
  • test_backwards_compatible_column_metadata_handling with AssertionError
  • test_categories_with_string_pyarrow_dtype with AssertionError

I think there was an open issue for the deprecation warning, but am not sure. Will look for it now.

@AlenkaF
Copy link
Member

AlenkaF commented Jan 8, 2024

Didn't find anything. Will look for the issue and a fix.

@jorisvandenbossche
Copy link
Member

The deprecation warning can be ignored for now (that's still being discussed on the pandas side), although we could update the test to not fail for it so our CI can be green otherwise.

The other two items seem things we should investigate further.

@AlenkaF
Copy link
Member

AlenkaF commented Jan 8, 2024

👍
I am currently investigating two remaining items and will change the test with deprecation warning to not fail.

@AlenkaF
Copy link
Member

AlenkaF commented Jan 8, 2024

The error in the test_categories_with_string_pyarrow_dtype is due to a difference in the PyArrow array data type when being converted from pandas string[pyarrow]. For pandas version 2.1.4 I get:

>>> df1 = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
>>> df1 = df1.astype("category")

>>> df2 = pd.DataFrame({"x": ["foo", "bar", "foo"]})
>>> df2 = df2.astype("category")

>>> pa.array(df1["x"]).type
DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
>>> pa.array(df2["x"]).type
DictionaryType(dictionary<values=string, indices=int8, ordered=0>)

and for the dev version I get:

>>> pa.array(df1["x"]).type
DictionaryType(dictionary<values=large_string, indices=int8, ordered=0>)
>>> pa.array(df2["x"]).type
DictionaryType(dictionary<values=string, indices=int8, ordered=0>)

meaning arrow string dtype gets converted to a large string with pandas dev hence raising an error. The PR that caused the change on the pandas side: pandas-dev/pandas#56220

Will update the test to reflect the change.

@jorisvandenbossche
Copy link
Member

For the Parquet failure (old metadata), this seems a regression on the pandas side in how tz-aware data gets converted from arrow to pandas -> pandas-dev/pandas#56775

@raulcd
Copy link
Member Author

raulcd commented Jan 8, 2024

@AlenkaF @jorisvandenbossche , do you think this is a blocker?

@jorisvandenbossche
Copy link
Member

For CI purposes, I think it would be good to have the currently open PR in the release. But nothing in that PR is critical, it's only test changes.

jorisvandenbossche added a commit that referenced this issue Jan 8, 2024
…CI build (#39498)

Update version checks and assertions of pyarrow array equality for pandas failing tests on the CI: [test-conda-python-3.10-pandas-nightly](https://github.com/ursacomputing/crossbow/actions/runs/7391976015/job/20109720695)

* Closes: #39437

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
raulcd pushed a commit that referenced this issue Jan 9, 2024
…CI build (#39498)

Update version checks and assertions of pyarrow array equality for pandas failing tests on the CI: [test-conda-python-3.10-pandas-nightly](https://github.com/ursacomputing/crossbow/actions/runs/7391976015/job/20109720695)

* Closes: #39437

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
clayburn pushed a commit to clayburn/arrow that referenced this issue Jan 23, 2024
…ghtly CI build (apache#39498)

Update version checks and assertions of pyarrow array equality for pandas failing tests on the CI: [test-conda-python-3.10-pandas-nightly](https://github.com/ursacomputing/crossbow/actions/runs/7391976015/job/20109720695)

* Closes: apache#39437

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ghtly CI build (apache#39498)

Update version checks and assertions of pyarrow array equality for pandas failing tests on the CI: [test-conda-python-3.10-pandas-nightly](https://github.com/ursacomputing/crossbow/actions/runs/7391976015/job/20109720695)

* Closes: apache#39437

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…ghtly CI build (apache#39498)

Update version checks and assertions of pyarrow array equality for pandas failing tests on the CI: [test-conda-python-3.10-pandas-nightly](https://github.com/ursacomputing/crossbow/actions/runs/7391976015/job/20109720695)

* Closes: apache#39437

Lead-authored-by: AlenkaF <frim.alenka@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants