Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that df is immutable after from_pandas #10383

Merged
merged 3 commits into from Jul 4, 2023

Conversation

phofl
Copy link
Collaborator

@phofl phofl commented Jun 29, 2023

  • Tests added / passed
  • Passes pre-commit run --all-files

We had a similar problem in dask-expr

@@ -5592,7 +5592,7 @@ def test_attrs_series_in_dataframes():
# the attrs dict for series in a dataframe. Dask uses df.iloc[:0]
# when creating the _meta dataframe in make_meta_pandas(x, index=None).
# Should start xpassing when df.iloc works. Remove the xfail then.
assert df.A.attrs == ddf.A.attrs
assert {"unit": "KG"} == ddf.A.attrs
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting:

df2 = df.copy()

removes the attrs from df. Almost certainly a pandas bug

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this is not supposed to work and the fact that this worked at all was an accident (it's related to caching in pandas, we should definitely not rely on this especially since this will change with CoW).

@phofl phofl closed this Jun 30, 2023
@phofl phofl reopened this Jun 30, 2023
@hendrikmakait hendrikmakait self-requested a review July 4, 2023 14:51
# the attrs dict for series in a dataframe. Dask uses df.iloc[:0]
# when creating the _meta dataframe in make_meta_pandas(x, index=None).
# Should start xpassing when df.iloc works. Remove the xfail then.
assert df.A.attrs == ddf.A.attrs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double-check: We don't support attrs (anymore)? Should we document this somewhere?

Copy link
Collaborator Author

@phofl phofl Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but doing

df.A.attrs = ...

shouldn't work. df.A is a temporary object in a sense, which means that it's metadata should not propagate to df.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more context: This works because pandas uses caching under the hood so that

df.A
df.A

returns the same object. This is an implementation detail that we should not rely on (it will also change when Copy-on-Write becomes the default).

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @phofl!

@hendrikmakait hendrikmakait merged commit d15fd23 into dask:main Jul 4, 2023
44 of 45 checks passed
@phofl phofl deleted the ensure_immutablility branch July 4, 2023 15:31
j-bennet pushed a commit to j-bennet/dask that referenced this pull request Jul 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants