Naively roundtrip parquet data from Spark #4096

mrocklin · 2018-10-13T15:00:56Z

Currently it is not easy to roundtrip data between Spark and Dask Dataframe with Parquet. There are a variety of details one needs to know to do this well. This would be a good case study to improve our usability.

cc @martindurant

mrocklin · 2018-10-13T15:02:11Z

In particular I'm thinking that the following should probably just work for a variety of data types.

spark_df.write.parquet(fn)
dask_df = dd.read_parquet(fn)

dask_df.to_parquet(fn2)
spark.read.parquet(fn2)

martindurant · 2018-10-13T17:10:30Z

Fastparquet does have a number of tests specifically for spark round-tripping: https://github.com/dask/fastparquet/blob/master/fastparquet/test/test_aroundtrips.py
(these are without dask's part, of course)

mrocklin · 2018-10-13T17:41:42Z

Understood. To be clear, I'm not speaking about fastparquet, I'm talking about dask.dataframe. My attempt to recreate the workflow above failed with dask dataframe. I think that this is a fairly important issue.

martindurant · 2018-10-13T18:21:14Z

Yes, agree, just thought I'd point it out, the FP tests might serve as a starting point. Would be easy to write out a set of tests like this (not for CI, though, probably), and then fix for FP and Arrow, as needed.

mrocklin · 2018-10-13T18:23:36Z

If you have time in the near future I think that you would be the obviously skilled person to fix this :)

…

On Sat, Oct 13, 2018 at 2:21 PM Martin Durant ***@***.***> wrote: Yes, agree, just thought I'd point it out, the FP tests might serve as a starting point. Would be easy to write out a set of tests like this (not for CI, though, probably), and then fix for FP and Arrow, as needed. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4096 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszE5U_Z8s4P7V5ps4JWVHePSdA1jJks5uki8agaJpZM4Xaqs5> .

martindurant · 2018-10-13T18:24:59Z

It's on the list...

xhochy · 2018-10-16T09:23:25Z

We also have a flavor='spark' argument in pyarrow which handles some of the limitations that the Spark Parquet implementation has. Sadly there are still some features of Parquet that pyarrow uses but Spark cannot handle.

mrocklin · 2018-10-16T12:38:46Z

I suspect that Martin feels your frustration :)

…

On Tue, Oct 16, 2018 at 5:23 AM Uwe L. Korn ***@***.***> wrote: We also have a flavor='spark' argument in pyarrow which handles some of the limitations that the Spark Parquet implementation has. Sadly there are still some features of Parquet that pyarrow uses but Spark cannot handle. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4096 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszLL-47f3N9EsCoLwfbBj7YcjzVK-ks5ulaWVgaJpZM4Xaqs5> .

jakirkham · 2019-04-30T16:51:16Z

Is this still of interest?

martindurant · 2019-04-30T17:08:09Z

Could be contemplated in the context of #4336 , but interop with arrow more important now that spark per se (plus pyspark's inherent limitations) - so maybe the issue of rountripping to spark should be something that they care about more than we do. In any case, I don't think there are any plans to work on this. I am ambivalent on whether to close.

mrocklin · 2019-04-30T17:10:58Z

I care about this. I think that it is important for the Dask project to be able to easily move data back and forth. I agree with @martindurant that this might be partially solved by doing work in the Arrow codebase. I'd still like to keep track of it here.

…

On Tue, Apr 30, 2019 at 12:08 PM Martin Durant ***@***.***> wrote: Could be contemplated in the context of #4336 <#4336> , but interop with arrow more important now that spark per se (plus pyspark's inherent limitations) - so maybe the issue of rountripping to spark should be something that they care about more than we do. In any case, I don't think there are any plans to work on this. I am ambivalent on whether to close. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#4096 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTHLIZMSR5OOOIY6TOLPTB4HVANCNFSM4F3KVM4Q> .

jakirkham added the io label Apr 30, 2019

jcrist assigned jcrist and unassigned jcrist Apr 28, 2022

jcrist assigned ian-r-rose May 13, 2022

ian-r-rose mentioned this issue Jun 2, 2022

Test round-tripping dataframe parquet I/O including pyspark #9156

Merged

5 tasks

jrbourbeau closed this as completed in #9156 Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Naively roundtrip parquet data from Spark #4096

Naively roundtrip parquet data from Spark #4096

mrocklin commented Oct 13, 2018

mrocklin commented Oct 13, 2018

martindurant commented Oct 13, 2018

mrocklin commented Oct 13, 2018

martindurant commented Oct 13, 2018

mrocklin commented Oct 13, 2018 via email

martindurant commented Oct 13, 2018

xhochy commented Oct 16, 2018

mrocklin commented Oct 16, 2018 via email

jakirkham commented Apr 30, 2019

martindurant commented Apr 30, 2019

mrocklin commented Apr 30, 2019 via email

Naively roundtrip parquet data from Spark #4096

Naively roundtrip parquet data from Spark #4096

Comments

mrocklin commented Oct 13, 2018

mrocklin commented Oct 13, 2018

martindurant commented Oct 13, 2018

mrocklin commented Oct 13, 2018

martindurant commented Oct 13, 2018

mrocklin commented Oct 13, 2018 via email

martindurant commented Oct 13, 2018

xhochy commented Oct 16, 2018

mrocklin commented Oct 16, 2018 via email

jakirkham commented Apr 30, 2019

martindurant commented Apr 30, 2019

mrocklin commented Apr 30, 2019 via email