New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test spark #4122
Test spark #4122
Conversation
Basic roundtrip functions. from_spark succeeds, with new structure for fastparquet arm to deal with directories without _metadata. (Could do with better error message on failure?) to_spark passes The tests so far include only the very simplest data-types
Basic roundtrip functions. from_spark succeeds, with new structure for fastparquet arm to deal with to_spark passes The tests so far include only the very simplest data-types |
This is a really reassuring test. Thank you! Do you have thoughts on expanding this in this PR? Should we wait for a future PR? Regardless, what else do you think should be done here? Should we add pyspark to one of the entries of the testing matrix on Travis-CI? |
(failure is due to flake) This PR could be useful as-is, but I think it ought to have further types and structures - which will fail, I think. For example:
I would not recommend adding spark to the tests build. It will tend to redefine system networking things and generally I've found problems interacting with other network-oriented testing. It could be a matrix element specifically for spark, which does usually conda-install without a problem. |
There is a PySpark Docker image in Jupyter Docker Stacks. Might be a reasonable starting point for a CI build in a new matrix element. |
path = paths[0].rstrip('/') | ||
paths = (fs.glob(path + '/*.parq*') | ||
+ fs.glob(path + '/*/*.parq*') | ||
+ fs.glob(path + '/*/*/*.parq*')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there possibly more nesting here? Also, is .parq
standard or just a convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention, names are .parq or .parquet. Others are possble, maybe (this is not in the spec), but we want not to get the _metadata and _common_metadata, SUCCESS files or any checksums that may be around.
There could be more levels of nesting, "**" would be useful here.
@martindurant what is the status here? |
My summary above is still valid; I've been waylaid by other commitments, so could use some help filling out tests. Perhaps we can ask some pyarrow person, they may have similar tests somewhere. Some fastparquet apparent problems are being fixed (dask/fastparquet#379) some remain open (dask/fastparquet#375). |
Not planning on working on this in the near future, so closing for now. Would be good to come back to at a later date, but suspect pyarrow should have filled the gaps by then. |
flake8 dask