You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is it possible to coerce a Pandas dataframe from an existing dask dataframe? e.g.
ddf.to_pandasdataframe(df, etc..)
I'm dumping to a csv and reading it back in later now, and that's slow and silly.
The use case here is for a single node, many core machine, with data that fits in memory, and, a CPU-intensive process that is embarrassingly parallel -- so using ddf.groupby(ddf.index).apply(func) to speed up the work. This turns out to be an order of magnitude faster than multiprocessing, btw. The result of the groupby.apply is a dask dataframe, but I need to do work on it using a variety of pandas functions not currently available in dask.
The text was updated successfully, but these errors were encountered:
Hi Matthew (@mrocklin ), I have been following and trying to use dask for last few months and great work on that. I tried to use above suggestion and it works. however it tries to re-calculate entire graph when i loop through every partition.compute. I have persisted the data frame and can see execution happened in UI. but when i loop through each partition to call other function and pass this partition as data frame, it executes same graph again and again the loop.
my other function to apply on each partition is based on pandas. when map partition is passing the df, its passing as dask and i am unable to change indexes etc.
Is it possible to coerce a Pandas dataframe from an existing dask dataframe? e.g.
ddf.to_pandasdataframe(df, etc..)
I'm dumping to a csv and reading it back in later now, and that's slow and silly.
The use case here is for a single node, many core machine, with data that fits in memory, and, a CPU-intensive process that is embarrassingly parallel -- so using ddf.groupby(ddf.index).apply(func) to speed up the work. This turns out to be an order of magnitude faster than multiprocessing, btw. The result of the groupby.apply is a dask dataframe, but I need to do work on it using a variety of pandas functions not currently available in dask.
The text was updated successfully, but these errors were encountered: