Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make_meta over a Dask Dataframe returns a reference, not a new object #10842

Open
albarji opened this issue Jan 22, 2024 · 0 comments
Open

make_meta over a Dask Dataframe returns a reference, not a new object #10842

albarji opened this issue Jan 22, 2024 · 0 comments
Labels
needs triage Needs a response from a contributor

Comments

@albarji
Copy link

albarji commented Jan 22, 2024

Describe the issue:

Reading the documentation of make_meta it states that

This method creates meta-data based on the type of x

so my understanding is that a new object is returned. However, one can check that when running make_meta over a Dask Dataframe a reference to the dataframe meta is returned. Thus, if any change is made to the returned meta, the meta of the Dataframe is modified as well.

Minimal Complete Verifiable Example:

import dask.dataframe as dd
import pandas as pd

df = pd.DataFrame({
    "col1": [1, 2, 3],
    "col2": [4, 5, 6],
})
df = dd.from_pandas(df, npartitions=2)

print(df.columns)  # returns Index(['col1', 'col2'], dtype='object')

from dask.dataframe.utils import make_meta
meta = make_meta(df)
meta["flag"] = pd.Series([], dtype="bool")
print(df.columns)  # returns Index(['col1', 'col2', 'flag'], dtype='object')

Anything else we need to know?:

In my experience make_meta is very useful to obtain the current meta of a Dataframe and then update it with the necessary changes to provide appropriate meta information to methods such as map_partitions or assign, so that Dask knows how you intend to change the structure of the Dataframe. But since make_meta returns a reference it seems we are forced to make changes to a copy of this meta object, which is inconvenient. Is there any design reason for returning a reference instead of a copy?

Environment:

  • Dask version: 2024.1.0
  • Python version: 3.12.1
  • Operating System: Ubuntu 22.04.3 LTS
  • Install method (conda, pip, source): pip
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Needs a response from a contributor
Projects
None yet
Development

No branches or pull requests

1 participant