Test round-tripping dataframe parquet I/O including pyspark#9156
Test round-tripping dataframe parquet I/O including pyspark#9156jrbourbeau merged 16 commits intodask:mainfrom
Conversation
|
@jrbourbeau I took a look at what it would look like to include spark in the main CI environment in my most recent commit. It looks like there are some pretty horrific thread leakages that I don't quite understand (which is the type of thing I was trying to avoid by putting it in its own workflow). |
916a7f2 to
8cece17
Compare
0d030f5 to
ce4e49d
Compare
5ec221a to
455815b
Compare
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks @ian-r-rose -- this is looking good
| ) | ||
| yield spark | ||
|
|
||
| spark.stop() |
There was a problem hiding this comment.
Could this be a context manager instead? FWIW what you have here is fine -- I'm just curious if we could make the cleanup session cleanup more concise
There was a problem hiding this comment.
As far as I can tell, the pyspark session interface does not expose a context manager (indeed, I had to do the extra stuff with signal handlers because it generally does a poor job of cleaning up after itself)
|
Also cc @MrPowers for visibility |
|
Thanks for the review @jrbourbeau! |
jrbourbeau
left a comment
There was a problem hiding this comment.
Thanks for your work on this @ian-r-rose! The changes here look like a nice addition to me. I'll plan to merge this in tomorrow unless other have feedback (cc @rjzamora @martindurant as you might enjoy looking at this)
|
Thanks for this @ian-r-rose ! |
This is a proposed fix for #4096. It adds a new
test_pyspark_compatmodule to make sure that we can round-trip data from spark. I've opted to create a new CI workflow for this to prevent the additional weight and pain of including scala in our normal CI environments. It runs nightly on just the pyspark compatibility tests. There could be other ways to set this up, however, so I'm open to discussion on that point.I'm trying to take the "Naively" from the linked issue seriously. The goal of these tests is to do as little trickery as possible, and look as close to user code as I can (with the caveat that I'm not very familiar with spark). So anything that involves any additional data transformation or non-default arguments is a failure (an there are a few here). However, I mostly don't care if the dataframe metadata are 100% identical after going through the round-trip process. So if we lose some information about the pandas index, or if the order of columns is different, that is probably okay. Instead, we should be testing that the data is faithfully round-tripped without loss or (excessive) coercion.
pre-commit run --all-filesTODO
Possibly trigger an issue if things fail (similar to the upstream builds)