Skip to content

Higher than expected memory usage for repartition #3347

@syagev

Description

@syagev

I find that repartitioning a dataframe by splitting is causing higher than expected memory usage. I've encountered this in my project where I do the "not so standard" thing of storing entire numpy arrays (and other objects) inside cells of the dataframe.

The problem is exemplified in this notebook: https://gist.github.com/syagev/4de6f6c1cb25d2b4e9e6a6d846d12793

When processing a large dataframe (where each partition just fits into worker memory) everything is well. When splitting (repartitioning) this dataframe workers fail due to memory usage. However, when converting the objects inside the dataframe to bytes before repartitioning then all is well again. This caused me to think that perhaps this is serialization issue caused by the fact there are objects inside cells in the dataframe.

Is this scenario supported? Should dataframes contain only numeric/string/bytes items?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions