Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coercing a Pandas Dataframe from a Dask Dataframe #1122

Closed
RedPandaHat opened this issue Apr 26, 2016 · 3 comments
Closed

Coercing a Pandas Dataframe from a Dask Dataframe #1122

RedPandaHat opened this issue Apr 26, 2016 · 3 comments

Comments

@RedPandaHat
Copy link

Is it possible to coerce a Pandas dataframe from an existing dask dataframe? e.g.

ddf.to_pandasdataframe(df, etc..)

I'm dumping to a csv and reading it back in later now, and that's slow and silly.

The use case here is for a single node, many core machine, with data that fits in memory, and, a CPU-intensive process that is embarrassingly parallel -- so using ddf.groupby(ddf.index).apply(func) to speed up the work. This turns out to be an order of magnitude faster than multiprocessing, btw. The result of the groupby.apply is a dask dataframe, but I need to do work on it using a variety of pandas functions not currently available in dask.

@mrocklin
Copy link
Member

The .compute() method produces a Pandas DataFrame from a Dask DataFrame.

@onzo-naga
Copy link

onzo-naga commented Jan 7, 2018

Hi Matthew (@mrocklin ), I have been following and trying to use dask for last few months and great work on that. I tried to use above suggestion and it works. however it tries to re-calculate entire graph when i loop through every partition.compute. I have persisted the data frame and can see execution happened in UI. but when i loop through each partition to call other function and pass this partition as data frame, it executes same graph again and again the loop.
my other function to apply on each partition is based on pandas. when map partition is passing the df, its passing as dask and i am unable to change indexes etc.

any thoughts, please? thanks for your help.

@TomAugspurger
Copy link
Member

TomAugspurger commented Jan 7, 2018

persist is async. You might try df = persist(df); wait(df).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants