Higher than expected memory usage for repartition

I find that repartitioning a dataframe by splitting is causing higher than expected memory usage. I've encountered this in my project where I do the "not so standard" thing of storing **entire numpy arrays** (and other objects) inside **cells** of the dataframe. 

The problem is exemplified in this notebook: https://gist.github.com/syagev/4de6f6c1cb25d2b4e9e6a6d846d12793

When processing a large dataframe (where each partition just fits into worker memory) everything is well. When splitting (repartitioning) this dataframe workers fail due to memory usage. However, when converting the objects inside the dataframe to bytes before repartitioning then all is well again. This caused me to think that perhaps this is serialization issue caused by the fact there are **objects inside cells** in the dataframe.

Is this scenario supported? Should dataframes contain only numeric/string/bytes items?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Higher than expected memory usage for repartition #3347

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Higher than expected memory usage for repartition #3347

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions