Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Dataframe index renaming doesn't work in place #8082

Closed
marcelned opened this issue Aug 24, 2021 · 5 comments
Closed

Dask Dataframe index renaming doesn't work in place #8082

marcelned opened this issue Aug 24, 2021 · 5 comments

Comments

@marcelned
Copy link
Contributor

Inplace parameter doesn't behave as expected:

Screenshot 2021-08-24 at 17 18 27

@jrbourbeau
Copy link
Member

Thanks for repoting @marcelned! I'm able to reproduce. My guess is that since we regenerate .index each time it's accessed:

dask/dask/dataframe/core.py

Lines 480 to 489 in ab69fe1

@property
def index(self):
"""Return dask Index instance"""
return self.map_partitions(
getattr,
"index",
token=self._name + "-index",
meta=self._meta.index,
enforce_metadata=False,
)

There's some underlying state that rename(..., inplace=True) isn't updating that it should. Is this something you're interested in working on? (no obligation though)

@marcelned
Copy link
Contributor Author

marcelned commented Aug 26, 2021

Using the following script to aid in finding the offending/missing logic:

# test_script.py
import pandas as pd

import dask
import dask.dataframe as dd

df = pd.DataFrame(
    {
        "Animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
        "Max Speed": [380.0, 370.0, 24.0, 26.0],
    }
)

ddf = dd.from_pandas(df, npartitions=2)


if __name__ == "__main__":

    # dask dataframe index inplace rename test
    ddf.index.rename("bar", inplace=True)

    try:
        assert ddf.index.name == "foo"
    except AssertionError:
        print(f"{ddf.index.name} != foo")

    # dask series inplace rename test
    ddf["Animal"].rename("Bird", inplace=True)

    try:
        assert ddf["Animal"].name == "Bird"
    except AssertionError:
        print(f'{ddf["Animal"].name} != Bird')

output:

python test_script.py
None != foo
Animal != Bird

So it isn't a problem with only the index, but renaming Dask Series as a whole.

Tracing the stack reveals that the name is indeed changed, but the new renamed Dask Series object gets passed to the ether:

def rename(self, index=None, inplace=False, sorted_index=False):

Screenshot 2021-08-26 at 12 09 53

I'm not too knowledgeable on the exact workings of core Dask to know how to fix this, but one could argue that removing this parameter is a possible solution, unless there is some sort of performance gain that users could benefit from by using an inplace renaming mechanism..

@jsignell
Copy link
Member

Whoa I had no idea that a series could have a different name than its column name. For anyone else reading this, this is what pandas does:

import pandas as pd

df = pd.DataFrame(
    {
        "Animal": ["Falcon", "Falcon", "Parrot", "Parrot"],
        "Max Speed": [380.0, 370.0, 24.0, 26.0],
    }
)
df["Animal"].rename("Bird", inplace=True)
print(df["Animal"])
# 0    Falcon
# 1    Falcon
# 2    Parrot
# 3    Parrot
# Name: Bird, dtype: object

I think you are right to bring up the option of removing the inplace option from rename. That seems like the safest and best path forward. Are you interested in opening a pull request to implement that change?

@jsignell
Copy link
Member

Incidentally the inplace rename does work for a series in dask. The issue you are running into is that when you access the series by getting it off the dataframe like ddf["Animal"] you actually get a new object. So even if you then rename that inplace, you don't affect the real object. Here's where the new object gets created:

return new_dd_object(graph, name, meta, self.divisions)

I still like your idea of not allowing inplace as an option for rename, just trying to give more context about what's going on.

marcelned added a commit to marcelned/dask that referenced this issue Sep 10, 2021
marcelned added a commit to marcelned/dask that referenced this issue Sep 13, 2021
jcrist pushed a commit that referenced this issue Sep 23, 2021
* Deprecate 'inplace' argument for dask series renaming

See issue #8082

* Remove inplace dataframe renaming equality test

* Formatted commits with black
@GenevieveBuckley
Copy link
Contributor

Closed by #8136

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants