Test round-tripping dataframe parquet I/O including pyspark#9156

Merged

jrbourbeau merged 16 commits intodask:mainfrom

ian-r-rose:roundtrip-spark

Jun 8, 2022

Collaborator

ian-r-rose commented Jun 2, 2022 •

edited

Loading

This is a proposed fix for #4096. It adds a new test_pyspark_compat module to make sure that we can round-trip data from spark. I've opted to create a new CI workflow for this to prevent the additional weight and pain of including scala in our normal CI environments. It runs nightly on just the pyspark compatibility tests. There could be other ways to set this up, however, so I'm open to discussion on that point.

I'm trying to take the "Naively" from the linked issue seriously. The goal of these tests is to do as little trickery as possible, and look as close to user code as I can (with the caveat that I'm not very familiar with spark). So anything that involves any additional data transformation or non-default arguments is a failure (an there are a few here). However, I mostly don't care if the dataframe metadata are 100% identical after going through the round-trip process. So if we lose some information about the pandas index, or if the order of columns is different, that is probably okay. Instead, we should be testing that the data is faithfully round-tripped without loss or (excessive) coercion.

Closes Naively roundtrip parquet data from Spark #4096
Tests added / passed
Passes pre-commit run --all-files

TODO

Figure out a workaround for timestamp-localization issues
~~Possibly trigger an issue if things fail (similar to the upstream builds)~~

Ian Rose added 4 commits

June 1, 2022 18:06


          Initial spark roundtrip tests

a648992


          Work on CI

e2ff4a3


          Add cron.

884d1a1


          Add test for round-tripping hive-partitioned datasets

b9466c4

ian-r-rose changed the title ~~Roundtrip spark~~ Test round-tripping dataframe parquet I/O including pyspark

ian-r-rose added dataframe io tests labels

ian-r-rose commented

View reviewed changes

dask/tests/test_spark_compat.py Outdated Show resolved Hide resolved

github-actions bot removed io dataframe labels

Collaborator Author

ian-r-rose commented Jun 2, 2022

@jrbourbeau I took a look at what it would look like to include spark in the main CI environment in my most recent commit. It looks like there are some pretty horrific thread leakages that I don't quite understand (which is the type of thing I was trying to avoid by putting it in its own workflow).

ian-r-rose added io parquet labels


          Try adding pyspark to main workflow

8cece17

ian-r-rose force-pushed the roundtrip-spark branch from 916a7f2 to 8cece17 Compare

June 2, 2022 19:43

github-actions bot removed the io label

Ian Rose added 4 commits

June 2, 2022 13:50


          Only run on ubuntu

583a48b


          Remove separate workflow

1c2fc4c


          Round-trip timestamps

8bc784c


          Fix bug in fastparquet engine partitioning inference.

83f3e29

github-actions bot added dataframe io labels

ian-r-rose marked this pull request as ready for review

June 3, 2022 02:42

Ian Rose added 2 commits

June 2, 2022 19:43


          Remove spark environment.yml

c3ffb66


          Don't care about categorical order

ce4e49d

ian-r-rose force-pushed the roundtrip-spark branch from 0d030f5 to ce4e49d Compare

June 3, 2022 18:12

Ian Rose added 2 commits

June 3, 2022 12:32


          Correct skip condition


          Sigh

455815b

ian-r-rose force-pushed the roundtrip-spark branch from 5ec221a to 455815b Compare

June 3, 2022 21:06


          Only reset signal from main thread

47f18eb

jrbourbeau reviewed

View reviewed changes

Member

jrbourbeau left a comment

Thanks @ian-r-rose -- this is looking good

setup.cfg Outdated Show resolved Hide resolved

dask/tests/test_spark_compat.py Outdated Show resolved Hide resolved

.github/workflows/tests.yml Show resolved Hide resolved

dask/tests/test_spark_compat.py Show resolved Hide resolved

dask/dataframe/io/tests/test_parquet.py Show resolved Hide resolved

dask/tests/test_spark_compat.py Outdated Show resolved Hide resolved

dask/tests/test_spark_compat.py

+                  )
+                  yield spark
+                  spark.stop()

Member

jrbourbeau Jun 6, 2022

Could this be a context manager instead? FWIW what you have here is fine -- I'm just curious if we could make the cleanup session cleanup more concise

Collaborator Author

ian-r-rose Jun 6, 2022

As far as I can tell, the pyspark session interface does not expose a context manager (indeed, I had to do the extra stuff with signal handlers because it generally does a poor job of cleaning up after itself)

dask/tests/test_spark_compat.py Show resolved Hide resolved

dask/tests/test_spark_compat.py Show resolved Hide resolved

dask/tests/test_spark_compat.py Show resolved Hide resolved

Member

jrbourbeau commented Jun 6, 2022

Also cc @MrPowers for visibility


          Address code review

Collaborator Author

ian-r-rose commented Jun 6, 2022

Thanks for the review @jrbourbeau!


          Allow module level

12a82f7

jrbourbeau approved these changes

View reviewed changes

Member

jrbourbeau left a comment

Thanks for your work on this @ian-r-rose! The changes here look like a nice addition to me. I'll plan to merge this in tomorrow unless other have feedback (cc @rjzamora @martindurant as you might enjoy looking at this)

jrbourbeau merged commit 22915dc into dask:main

Member

rjzamora commented Jun 8, 2022

Thanks for this @ian-r-rose !

ian-r-rose mentioned this pull request

Add support for use_nullable_dtypes to dd.read_parquet #9617

Merged

3 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataframe io parquet tests