I find that repartitioning a dataframe by splitting is causing higher than expected memory usage. I've encountered this in my project where I do the "not so standard" thing of storing entire numpy arrays (and other objects) inside cells of the dataframe.
The problem is exemplified in this notebook: https://gist.github.com/syagev/4de6f6c1cb25d2b4e9e6a6d846d12793
When processing a large dataframe (where each partition just fits into worker memory) everything is well. When splitting (repartitioning) this dataframe workers fail due to memory usage. However, when converting the objects inside the dataframe to bytes before repartitioning then all is well again. This caused me to think that perhaps this is serialization issue caused by the fact there are objects inside cells in the dataframe.
Is this scenario supported? Should dataframes contain only numeric/string/bytes items?
I find that repartitioning a dataframe by splitting is causing higher than expected memory usage. I've encountered this in my project where I do the "not so standard" thing of storing entire numpy arrays (and other objects) inside cells of the dataframe.
The problem is exemplified in this notebook: https://gist.github.com/syagev/4de6f6c1cb25d2b4e9e6a6d846d12793
When processing a large dataframe (where each partition just fits into worker memory) everything is well. When splitting (repartitioning) this dataframe workers fail due to memory usage. However, when converting the objects inside the dataframe to bytes before repartitioning then all is well again. This caused me to think that perhaps this is serialization issue caused by the fact there are objects inside cells in the dataframe.
Is this scenario supported? Should dataframes contain only numeric/string/bytes items?