Coercing a Pandas Dataframe from a Dask Dataframe #1122

RedPandaHat · 2016-04-26T21:30:13Z

Is it possible to coerce a Pandas dataframe from an existing dask dataframe? e.g.

ddf.to_pandasdataframe(df, etc..)

I'm dumping to a csv and reading it back in later now, and that's slow and silly.

The use case here is for a single node, many core machine, with data that fits in memory, and, a CPU-intensive process that is embarrassingly parallel -- so using ddf.groupby(ddf.index).apply(func) to speed up the work. This turns out to be an order of magnitude faster than multiprocessing, btw. The result of the groupby.apply is a dask dataframe, but I need to do work on it using a variety of pandas functions not currently available in dask.

mrocklin · 2016-04-26T21:36:51Z

The .compute() method produces a Pandas DataFrame from a Dask DataFrame.

onzo-naga · 2018-01-07T11:21:49Z

Hi Matthew (@mrocklin ), I have been following and trying to use dask for last few months and great work on that. I tried to use above suggestion and it works. however it tries to re-calculate entire graph when i loop through every partition.compute. I have persisted the data frame and can see execution happened in UI. but when i loop through each partition to call other function and pass this partition as data frame, it executes same graph again and again the loop.
my other function to apply on each partition is based on pandas. when map partition is passing the df, its passing as dask and i am unable to change indexes etc.

any thoughts, please? thanks for your help.

TomAugspurger · 2018-01-07T20:39:18Z

persist is async. You might try df = persist(df); wait(df).

mrocklin closed this as completed Apr 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Coercing a Pandas Dataframe from a Dask Dataframe #1122

Coercing a Pandas Dataframe from a Dask Dataframe #1122

RedPandaHat commented Apr 26, 2016

mrocklin commented Apr 26, 2016

onzo-naga commented Jan 7, 2018 •

edited

TomAugspurger commented Jan 7, 2018 •

edited

Coercing a Pandas Dataframe from a Dask Dataframe #1122

Coercing a Pandas Dataframe from a Dask Dataframe #1122

Comments

RedPandaHat commented Apr 26, 2016

mrocklin commented Apr 26, 2016

onzo-naga commented Jan 7, 2018 • edited

TomAugspurger commented Jan 7, 2018 • edited

onzo-naga commented Jan 7, 2018 •

edited

TomAugspurger commented Jan 7, 2018 •

edited