-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure examples stay updated in CI. #696
Comments
There are a couple of things we'll need to do that immediately come to mind:
I hadn't originally expected the examples to be used in this way, but it's a very good idea. |
I have a branch ready that corrects the examples to match the spec output. Most of it was numerical errors between float and decimal representation, and converting 3 months to 90, 91, or 92 days. When the interval is fixed upstream, we can make that less prone to error. There was one or two queries that needed an algorithmic update. I will hold off on putting in that PR until we first answer if we want to go all in on data validation as part of the example or not. I have this example where I pull the query up into a function and add a validation function. The down side of this is that it makes the example less readable to new users in my opinion. But it does come at the point of adding in both validation and preparing the result to be used by CI. I've asked this question on the discord server, but putting it here for traceability. |
After a little experimenting, I'm pretty sure this will work. Create a def test_q01_pricing_summary_report():
import q01_pricing_summary_report
df = q01_pricing_summary_report.df
# assertion code on `df`
|
We could make it stupid simple by making these snapshot tests and using https://pypi.org/project/pytest-snapshot/ (Which is valid because you've already verified the behavior for current behavior, we just want to ensure upgrades don't change the behavior) |
Ok, made some good progress on this. Will try to wrap up tomorrow. https://github.com/timsaucer/datafusion-python/blob/tsaucer/prepare_tpch_examples_for_ci/examples/tpch/_tests.py I do like your idea of switching this to use I am adding a small set of reduced reference data, which is total of about 1.1 Mb. This will allow users to run the examples without requiring them to generate a full TPC-H data set, but if they want to reproduce the spec result they will have to run the data generator. This is also such a small data set it'll speed up CI by not having to run generator every time. Most importantly, if we don't have a reference data set in the repo it would require all users to run the generator just to get their pyunit tests to pass, which I don't think people would like. |
My default would be not to add the data to the repo, but I'll let @andygrove decide that. Maybe we could add it to the test-data submodule https://github.com/apache/arrow-testing? If not, I can add whatever line/script to generate (and maybe cache) the test-data to the CI |
As an update, I've got the tests written as you describe. I removed the reference files from the repo. Now it's checking against the official answer files. There is one special case I mentioned in discord, but from my web searches it appears to be known issue (and my result matches the spec pdf). All that's left is to work through the CI process. I expect to move my PR to ready on Monday or Tuesday. |
Given the great work @timsaucer has done on the
examples/tpch
, I'd like to ensure they stay up to date and don't break when we upgradedatafusion
versions.Describe the solution you'd like
Re-run / validate the examples on PRs
Describe alternatives you've considered
Open to suggestions!
The text was updated successfully, but these errors were encountered: