Skip to content

Conversation

@davisusanibar
Copy link
Contributor

Related to apache/arrow#34227

@kou
Copy link
Member

kou commented Mar 22, 2023

Could you remove .DS_Stores?

Copy link
Member

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace would it be useful to have more Acero/Substrait testing data like this in apache/arrow?

Also: are there any licensing concerns with including TPC-H queries? (Though for that matter: why are the SQL queries here in the first place? We can't do anything with them.)

Can you include a README to explain the purpose of these files?

@westonpace
Copy link
Member

westonpace commented Mar 23, 2023

@westonpace would it be useful to have more Acero/Substrait testing data like this in apache/arrow?

@lidavidm

We have a lot of hard-coded JSON but its embedded in the test files themselves (e.g. serde_test.cc or test_substrait.py) and not in standalone files. The original concern around hard-coded JSON was that Substrait may evolve quickly and those JSON files would be difficult to maintain. For example, the JSON files in this PR are missing the version field (Isthmus does not yet populate this) and they don't have URIs for the extension functions (almost no one generates these yet). So they may need to change at some point.

As a result, I have been waiting for the text format to be ready before I made any attempt to curate a large set of test queries (but that is still a few months off at least).

I think SQL is probably a pretty good solution if you have a good SQL->Substrait library (that may be an advantage for Java). In that case I would suggest only storing the SQL and then generating the Substrait on the fly.

I don't actually know what the legal ramifications are for TPC-H but it is a good question.

@lidavidm
Copy link
Member

Hmm, ok. Depending on Isthmus is a problem for Java since it requires a newer Java version than what we use, so we have to start doing acrobatics with the test setup.

@davisusanibar would you like to start a mailing list discussion about requiring JDK11+ for development & testing in general, but building JARs targeting JDK8? Then we could depend on Isthmus and avoid embedding potentially unstable plans. (Currently, our docs state we can still build on JDK8.)

@westonpace
Copy link
Member

Another option is maybe that we create some kind of docker / ci script in this repo that creates the proto / json files from SQL files?

@lidavidm
Copy link
Member

Ah, that would work, too. Though to start with I'm not sure if it even needs to be automated per se.

@westonpace
Copy link
Member

Though to start with I'm not sure if it even needs to be automated per se.

Agreed. As long as the source SQL files are stored alongside the Substrait files, with a detailed README explaining how to regenerate the Substrait files, then I think we can proceed.

@lidavidm
Copy link
Member

OK, then I think what needs to happen here is just check in a Java file (maybe something that'll work with jshell?) showing how to generate the plans given the SQL files, and address the existing review feedback.

@davisusanibar
Copy link
Contributor Author

Hi Team,

Just close this PR. Let's start Java Substrait Consumer on the same way that Python/C++ Substrait plans were implemented (maintain the json plan on testing files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants