Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test spark #4122

Closed
wants to merge 2 commits into from
Closed

Test spark #4122

wants to merge 2 commits into from

Conversation

martindurant
Copy link
Member

  • Tests added / passed
  • Passes flake8 dask

Martin Durant added 2 commits October 18, 2018 11:49
Basic roundtrip functions.

from_spark succeeds, with new structure for fastparquet arm to deal with
directories without _metadata. (Could do with better error message on
failure?)

to_spark passes

The tests so far include only the very simplest data-types
@martindurant
Copy link
Member Author

Basic roundtrip functions.

from_spark succeeds, with new structure for fastparquet arm to deal with
directories without _metadata. (Could do with better error message on
failure?)

to_spark passes

The tests so far include only the very simplest data-types

@mrocklin
Copy link
Member

This is a really reassuring test. Thank you! Do you have thoughts on expanding this in this PR? Should we wait for a future PR? Regardless, what else do you think should be done here?

Should we add pyspark to one of the entries of the testing matrix on Travis-CI?

@martindurant
Copy link
Member Author

martindurant commented Oct 23, 2018

(failure is due to flake)

This PR could be useful as-is, but I think it ought to have further types and structures - which will fail, I think.

For example:

  • datetimes will be converted to local by pyspark, even though the parquet spec specifies that they are assumed to be UT. In fastparquet the spark roundtrip test specifically adds the timezone difference, but clearly this breaks the intent of the tests here.
  • i32 type did not round-trip correctly, and I don't know why. Fixing that will take an unknown amount of effort, but it would be nice to get everything fixed in this one place
  • bytes type is not roundtripping, which is definitely fixable for fastparquet, not sure about arrow. Fixed-length type is probably not necessary.

I would not recommend adding spark to the tests build. It will tend to redefine system networking things and generally I've found problems interacting with other network-oriented testing. It could be a matrix element specifically for spark, which does usually conda-install without a problem.

@jakirkham
Copy link
Member

There is a PySpark Docker image in Jupyter Docker Stacks. Might be a reasonable starting point for a CI build in a new matrix element.

path = paths[0].rstrip('/')
paths = (fs.glob(path + '/*.parq*')
+ fs.glob(path + '/*/*.parq*')
+ fs.glob(path + '/*/*/*.parq*'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there possibly more nesting here? Also, is .parq standard or just a convention?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention, names are .parq or .parquet. Others are possble, maybe (this is not in the spec), but we want not to get the _metadata and _common_metadata, SUCCESS files or any checksums that may be around.

There could be more levels of nesting, "**" would be useful here.

@mrocklin
Copy link
Member

mrocklin commented Nov 8, 2018

@martindurant what is the status here?

@martindurant
Copy link
Member Author

My summary above is still valid; I've been waylaid by other commitments, so could use some help filling out tests. Perhaps we can ask some pyarrow person, they may have similar tests somewhere.

Some fastparquet apparent problems are being fixed (dask/fastparquet#379) some remain open (dask/fastparquet#375).

@martindurant
Copy link
Member Author

Not planning on working on this in the near future, so closing for now. Would be good to come back to at a later date, but suspect pyarrow should have filled the gaps by then.

@martindurant martindurant deleted the test_spark branch February 9, 2021 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants