Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Naively roundtrip parquet data from Spark #4096

Closed
mrocklin opened this issue Oct 13, 2018 · 11 comments · Fixed by #9156
Closed

Naively roundtrip parquet data from Spark #4096

mrocklin opened this issue Oct 13, 2018 · 11 comments · Fixed by #9156
Assignees
Labels

Comments

@mrocklin
Copy link
Member

Currently it is not easy to roundtrip data between Spark and Dask Dataframe with Parquet. There are a variety of details one needs to know to do this well. This would be a good case study to improve our usability.

cc @martindurant

@mrocklin
Copy link
Member Author

In particular I'm thinking that the following should probably just work for a variety of data types.

spark_df.write.parquet(fn)
dask_df = dd.read_parquet(fn)

dask_df.to_parquet(fn2)
spark.read.parquet(fn2)

@martindurant
Copy link
Member

Fastparquet does have a number of tests specifically for spark round-tripping: https://github.com/dask/fastparquet/blob/master/fastparquet/test/test_aroundtrips.py
(these are without dask's part, of course)

@mrocklin
Copy link
Member Author

Understood. To be clear, I'm not speaking about fastparquet, I'm talking about dask.dataframe. My attempt to recreate the workflow above failed with dask dataframe. I think that this is a fairly important issue.

@martindurant
Copy link
Member

Yes, agree, just thought I'd point it out, the FP tests might serve as a starting point. Would be easy to write out a set of tests like this (not for CI, though, probably), and then fix for FP and Arrow, as needed.

@mrocklin
Copy link
Member Author

mrocklin commented Oct 13, 2018 via email

@martindurant
Copy link
Member

It's on the list...

@xhochy
Copy link
Contributor

xhochy commented Oct 16, 2018

We also have a flavor='spark' argument in pyarrow which handles some of the limitations that the Spark Parquet implementation has. Sadly there are still some features of Parquet that pyarrow uses but Spark cannot handle.

@mrocklin
Copy link
Member Author

mrocklin commented Oct 16, 2018 via email

@jakirkham
Copy link
Member

Is this still of interest?

@jakirkham jakirkham added the io label Apr 30, 2019
@martindurant
Copy link
Member

Could be contemplated in the context of #4336 , but interop with arrow more important now that spark per se (plus pyspark's inherent limitations) - so maybe the issue of rountripping to spark should be something that they care about more than we do. In any case, I don't think there are any plans to work on this. I am ambivalent on whether to close.

@mrocklin
Copy link
Member Author

mrocklin commented Apr 30, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants