Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pandas 1.4.0 #1881

Merged
merged 7 commits into from
Feb 9, 2022
Merged

Support pandas 1.4.0 #1881

merged 7 commits into from
Feb 9, 2022

Conversation

tamargrey
Copy link
Contributor

@tamargrey tamargrey commented Feb 4, 2022

Adds support for pandas 1.4.0

closes #1865

@codecov
Copy link

codecov bot commented Feb 4, 2022

Codecov Report

Merging #1881 (30e7a08) into main (9ae6dc7) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1881      +/-   ##
==========================================
- Coverage   98.83%   98.76%   -0.08%     
==========================================
  Files         147      147              
  Lines       16291    16307      +16     
==========================================
+ Hits        16102    16106       +4     
- Misses        189      201      +12     
Impacted Files Coverage Δ
featuretools/__init__.py 73.91% <100.00%> (+1.82%) ⬆️
...ools/tests/primitive_tests/test_dask_primitives.py 61.53% <100.00%> (+3.20%) ⬆️
featuretools/tests/synthesis/test_dask_dfs.py 100.00% <100.00%> (ø)
featuretools/tests/synthesis/test_koalas_dfs.py 100.00% <100.00%> (ø)
featuretools/__main__.py 0.00% <0.00%> (-42.86%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ae6dc7...30e7a08. Read the comment docs.

@tamargrey
Copy link
Contributor Author

tamargrey commented Feb 4, 2022

The test_add_dataframe_from_ks_df failure is coming from how koalas is treating null values in string dtype columns when ks.from_pandas is called: The pd.NA is getting replaced with net.razorvine.pickle.objects.ClassDictConstructor@18016668 (that string of numbers at the end seems to be some sort of hash that changes every single time I try and get the full value, even if the koalas object its coming from hasn't changed. It stays the same once you convert it to pandas, so that makes me think it's a value koalas is generating). That new value isn't getting treated as null by pandas' assert_frame_equal with check_like=True, so the assertion is failing.

I don't know what would have changed between pandas versions to cause this change in koalas. I've confirmed this happens when you switch to pandas 1.4.0 on both koalas 1.8.1 and 1.8.2, and I confirmed this only happens for the string dtype.This is also happening in other fixtures--ks_es for one, though it doesn't seem to have triggered any other failures.

dask_computed_fm = dask_fm.compute().set_index('id').loc[fm.index][fm.columns]
# update the type of the future index column so it doesn't conflict with the pandas fm
dask_fm = dask_fm.compute().astype({'id': 'int64'})
dask_computed_fm = dask_fm.set_index('id').loc[fm.index][fm.columns]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the failing tests were because of a changed (fixed, actually, I think) behavior in set_index. Previously, the following would result in the index's dtype no longer being Int64, and now it's retained as Int64, but that means it now will get caught in the situation that came from #1810, so updating the dtype to be int64 to begin with fixes this problem.

 df = pd.DataFrame({
     'a': pd.Series([1, 2, 3,4], dtype='int64'),
     'b': pd.Series([1, 2, 3,3], dtype='Int64'),
 })
df.set_index('b')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to revert some of the check_dtype=False changes that were introduced in #1810?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are columns beyond id that will have different dtypes from the pandas dataframe (values, for example, which is Integer for Pandas and IntegerNullable otherwise because of the changes in #1810 ). So we need to keep it in

@thehomebrewnerd
Copy link
Contributor

The test_add_dataframe_from_ks_df failure is coming from how koalas is treating null values in string dtype columns when ks.from_pandas is called: The pd.NA is getting replaced with net.razorvine.pickle.objects.ClassDictConstructor@18016668 (that string of numbers at the end seems to be some sort of hash that changes every single time I try and get the full value, even if the koalas object its coming from hasn't changed. It stays the same once you convert it to pandas, so that makes me think it's a value koalas is generating). That new value isn't getting treated as null by pandas' assert_frame_equal with check_like=True, so the assertion is failing.

I don't know what would have changed between pandas versions to cause this change in koalas. I've confirmed this happens when you switch to pandas 1.4.0 on both koalas 1.8.1 and 1.8.2, and I confirmed this only happens for the string dtype.This is also happening in other fixtures--ks_es for one, though it doesn't seem to have triggered any other failures.

We may be getting bit by koalas being in maintenance mode and no longer keeping up with pandas releases any more. Can we change the way we create the Koalas dataframe for that test to side-step this issue? If not, we could restrict the pandas version if koalas is installed like we are are doing in Woodwork right now.

@tamargrey
Copy link
Contributor Author

We may be getting bit by koalas being in maintenance mode and no longer keeping up with pandas releases any more. Can we change the way we create the Koalas dataframe for that test to side-step this issue? If not, we could restrict the pandas version if koalas is installed like we are are doing in Woodwork right now.

@thehomebrewnerd We can definitely find a way around this test's failure since the test is just checking that, if you add a dataframe into a koalas entityset, it matches the original pandas dataframe in the entityset. I'm thinking of several options to get around this issue:

  1. Check dataframe equality in a different way that handles these null value replacements
  2. Add a dataframe from the pd_es fixture that doesn't have nans instead
  3. Add a dataframe of our own creation that does have nans so we can check the values in a way that isn't impacted by the null value replacements

I think I like 2 the best from a simplicity standpoint, but 3 from a completeness standpoint and being able to document this behavior somewhere.

But I'm not convinced that just because we can means we should allow pandas 1.4.0 right now--not having null values maintained when converting from pandas to koalas seems like a big problem to me.

Since this change only showed up between pandas versions, maybe we can open an issue in pandas and, if they are able to put out a bug fix, we can restrict our pandas version to that one later on. If it's not something they can handle, then we can allow this pandas version. Thoughts @rwedge @gsheni ?

@tamargrey tamargrey marked this pull request as ready for review February 7, 2022 20:38
@tamargrey
Copy link
Contributor Author

Not sure what the codecov changes are: https://app.codecov.io/gh/alteryx/featuretools/compare/1881/changes

Seems like __main__.py isn't getting called anymore?

@rwedge
Copy link
Contributor

rwedge commented Feb 7, 2022

Not sure what the codecov changes are: https://app.codecov.io/gh/alteryx/featuretools/compare/1881/changes

Seems like __main__.py isn't getting called anymore?

@tamargrey not sure why this has changed but __main__.py should be getting touched by our cli tests, if only in a subprocess. Looking at them, our CLI tests could probably use some updating

@gsheni gsheni requested a review from dvreed77 February 7, 2022 22:49
@@ -17,6 +20,7 @@ Future Release
* Add ``__setitem__`` method to overload ``add_dataframe`` method on EntitySet (:pr:`1862`)
* Temporarily restrict woodwork max version (:pr:`1872`)
* Split Datetime and LatLong primitives into separate files (:pr:`1861`)
* Update to add support for pandas version 1.4.0 (:pr:`1881`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put this in the enhancement category

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Contributor

@rwedge rwedge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tamargrey tamargrey merged commit 7da0da1 into main Feb 9, 2022
@tamargrey tamargrey deleted the support-pandas-1.4 branch February 9, 2022 15:05
@dvreed77 dvreed77 mentioned this pull request Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update to support pandas 1.4.0
3 participants