Skip to content

Test round-tripping dataframe parquet I/O including pyspark#9156

Merged
jrbourbeau merged 16 commits intodask:mainfrom
ian-r-rose:roundtrip-spark
Jun 8, 2022
Merged

Test round-tripping dataframe parquet I/O including pyspark#9156
jrbourbeau merged 16 commits intodask:mainfrom
ian-r-rose:roundtrip-spark

Conversation

@ian-r-rose
Copy link
Copy Markdown
Collaborator

@ian-r-rose ian-r-rose commented Jun 2, 2022

This is a proposed fix for #4096. It adds a new test_pyspark_compat module to make sure that we can round-trip data from spark. I've opted to create a new CI workflow for this to prevent the additional weight and pain of including scala in our normal CI environments. It runs nightly on just the pyspark compatibility tests. There could be other ways to set this up, however, so I'm open to discussion on that point.

I'm trying to take the "Naively" from the linked issue seriously. The goal of these tests is to do as little trickery as possible, and look as close to user code as I can (with the caveat that I'm not very familiar with spark). So anything that involves any additional data transformation or non-default arguments is a failure (an there are a few here). However, I mostly don't care if the dataframe metadata are 100% identical after going through the round-trip process. So if we lose some information about the pandas index, or if the order of columns is different, that is probably okay. Instead, we should be testing that the data is faithfully round-tripped without loss or (excessive) coercion.

TODO

  • Figure out a workaround for timestamp-localization issues
  • Possibly trigger an issue if things fail (similar to the upstream builds)

@ian-r-rose ian-r-rose changed the title Roundtrip spark Test round-tripping dataframe parquet I/O including pyspark Jun 2, 2022
@ian-r-rose ian-r-rose added dataframe io tests Unit tests and/or continuous integration labels Jun 2, 2022
@ian-r-rose
Copy link
Copy Markdown
Collaborator Author

@jrbourbeau I took a look at what it would look like to include spark in the main CI environment in my most recent commit. It looks like there are some pretty horrific thread leakages that I don't quite understand (which is the type of thing I was trying to avoid by putting it in its own workflow).

@github-actions github-actions bot removed the io label Jun 2, 2022
@ian-r-rose ian-r-rose marked this pull request as ready for review June 3, 2022 02:42
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ian-r-rose -- this is looking good

)
yield spark

spark.stop()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a context manager instead? FWIW what you have here is fine -- I'm just curious if we could make the cleanup session cleanup more concise

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the pyspark session interface does not expose a context manager (indeed, I had to do the extra stuff with signal handlers because it generally does a poor job of cleaning up after itself)

@jrbourbeau
Copy link
Copy Markdown
Member

Also cc @MrPowers for visibility

@ian-r-rose
Copy link
Copy Markdown
Collaborator Author

Thanks for the review @jrbourbeau!

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this @ian-r-rose! The changes here look like a nice addition to me. I'll plan to merge this in tomorrow unless other have feedback (cc @rjzamora @martindurant as you might enjoy looking at this)

@jrbourbeau jrbourbeau merged commit 22915dc into dask:main Jun 8, 2022
@rjzamora
Copy link
Copy Markdown
Member

rjzamora commented Jun 8, 2022

Thanks for this @ian-r-rose !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataframe io parquet tests Unit tests and/or continuous integration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Naively roundtrip parquet data from Spark

4 participants