-
Notifications
You must be signed in to change notification settings - Fork 45
feat: data to test java substrait consumer #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: data to test java substrait consumer #89
Conversation
|
Could you remove |
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace would it be useful to have more Acero/Substrait testing data like this in apache/arrow?
Also: are there any licensing concerns with including TPC-H queries? (Though for that matter: why are the SQL queries here in the first place? We can't do anything with them.)
Can you include a README to explain the purpose of these files?
We have a lot of hard-coded JSON but its embedded in the test files themselves (e.g. serde_test.cc or test_substrait.py) and not in standalone files. The original concern around hard-coded JSON was that Substrait may evolve quickly and those JSON files would be difficult to maintain. For example, the JSON files in this PR are missing the version field (Isthmus does not yet populate this) and they don't have URIs for the extension functions (almost no one generates these yet). So they may need to change at some point. As a result, I have been waiting for the text format to be ready before I made any attempt to curate a large set of test queries (but that is still a few months off at least). I think SQL is probably a pretty good solution if you have a good SQL->Substrait library (that may be an advantage for Java). In that case I would suggest only storing the SQL and then generating the Substrait on the fly. I don't actually know what the legal ramifications are for TPC-H but it is a good question. |
|
Hmm, ok. Depending on Isthmus is a problem for Java since it requires a newer Java version than what we use, so we have to start doing acrobatics with the test setup. @davisusanibar would you like to start a mailing list discussion about requiring JDK11+ for development & testing in general, but building JARs targeting JDK8? Then we could depend on Isthmus and avoid embedding potentially unstable plans. (Currently, our docs state we can still build on JDK8.) |
|
Another option is maybe that we create some kind of docker / ci script in this repo that creates the proto / json files from SQL files? |
|
Ah, that would work, too. Though to start with I'm not sure if it even needs to be automated per se. |
Agreed. As long as the source SQL files are stored alongside the Substrait files, with a detailed README explaining how to regenerate the Substrait files, then I think we can proceed. |
|
OK, then I think what needs to happen here is just check in a Java file (maybe something that'll work with jshell?) showing how to generate the plans given the SQL files, and address the existing review feedback. |
|
Hi Team, Just close this PR. Let's start Java Substrait Consumer on the same way that Python/C++ Substrait plans were implemented (maintain the json plan on testing files). |
Related to apache/arrow#34227