-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds support for SPARQL CDTs (lists and maps as literals) #2501
Adds support for SPARQL CDTs (lists and maps as literals) #2501
Conversation
…jena into UnfoldAndFoldWithCompositeValues
…formed (because their lexical form has already been checked)
…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4
…nges necessary for the query engine to handle this new operator (but without invoking any actual functionality if it is present in a query)
…eate no assignment
…result in a relevant literal, create no assignment
…placeholders for adding special functionality for such types of literals
…terals, and proper integration into the Jena Graph API (via special LiteralLabel implementations)
…er special case for the equals (=) operator in SPARQL
… literals and cdt:Map literals
…hat LiteralLabel is not anymore an interface but a *final* class
…RDER BY of CDT literals didn't work correctly anymore)
…jena into UnfoldAndFoldWithCompositeValues
@afs I am done with rebasing now.
|
Excellent. If I set to "Rebase and Merge" there is something showing. Part of the conflict is
which suggests both parts are for this PR. Several I looked at are the same update, very similar via two different routes. The other github merge strategies do allow merge although it does not guarantee git will pick the right option. It doesn't like a "semantic" (!!) problem -- I'll investigate further. |
The rebasing has made it a clear which files have changed and that makes reviewing easier - thanks. I haven't managed to do a clean "git rebase" against the current However, the rebasing on the PR has also made that unnecessary. When we're ready to merge, simply take diff of the current the PR (the URL is https://github.com/apache/jena/pull/2501.diff), start a new branch in your repo Now, thinking about the contribution -- There are two considerations:
There is a realistic chance that SPARQL-CDTs will evolve and that may be in incompatible ways. By being clear this is "experimental", I don't think that is a problem if not using the features means no impact. I (non-PMC chair opinion) am keen to see new tech being available to users, Performance: There is one place that needs investigation that I have found so far which is the mixes this - The subclass quickly tests whether to invoke the superclass but parsing is sensitive to the JIT ptimizer. Parsing runs at about one/two microseconds per triple which isn't that many instructions and the new may It may well make no measurable difference because the optimizer nowadays has ways to implement virtual method calls as direct function calls. The only way to find out is to try. Functionality: While CDTs are experimental, and likely to change, Jena needs a way to test teh functionality it ships and For the stability, PRs needs tests and there aren't any in this PR. I see that there are AWSlabs SPARQL-CDTs manifest-driven tests (Apache Licensed). One possibility is adding the current state of them to the PR (if they pass). To be clear -- it's a copy, nothing more; AWSlabs retains the defining material and manages its evolution. The W3C RDF/SPARQL tests in th ejena repo are the same status - local copies. Technical points: 1: It may be better to do CDT's as 2: When thinking about persistent storage, the bnode node datastructures may need |
I have some high-level comments
I am particularly curious about the changes to the SPARQL standard without providing a generic solution. Broadly speaking, the CDT proposal suggests two new keyword However, we could use both kinds of types for many purposes. I want to list some
generic UNFOLD:
further links:
Appendix comparison of JenaX syntax vs. CDT
Further ideas present in JenaX:
|
Thanks for the pointer. I knew that Stardog had something similar, but I wasn't aware of this blog post. Looking at it now, I see some relevant differences: The Stardog approach changes the notion of a solution mapping by extending the co-domain of these mappings with two new kinds of elements (namely, solution mappings themselves, which makes the notion recursive, and some kind of array). While I cannot find any formal specification of this approach, I can see that extending the definition of solution mapping in the SPARQL spec in this way would have a wide-ranging impact on most of the formal parts of the spec (the definition of every operator that processes solution mappings would need to be adapted). In contrast, adding a new operator (UNFOLD) and a new aggregation function (FOLD), as done by our approach, would be a comparably modest change/extension to the SPARQL spec (and we already provide all relevant parts of this extension explicitly in our SPARQL CDTs spec, ready to be copied directly into the SPARQL spec; rather than having "only" an implementation of an idea described informally in a blog post). Moreover, the relationship between Stardog's approach (which seems to be only about SPARQL) and the RDF data model is also not clear to me from the blog post (e.g., what happens if a solution mapping that contains such an array or a nested solution mapping is passed to a CONSTRUCT pattern?), whereas our approach is simply based on RDF literals and, thus, is clearly within the realm of RDF as is.
Some vendors may choose to support only one or the other. Moreover, a potential future addition may be a cdt:SortedMap datatype which would likely have the same lexical space as cdt:Map, or a cdt:Set (as you mention JenaX has).
Our spec contains all the relevant parts to define
While, also for this one, our spec contains some spec text relevant for defining a generic "multi-BIND" (à la w3c/sparql-dev#6), in contrast to the |
@afs apologies for the delay; after I came back from traveling, my university responsibilities required all my time.
The reason might be that some of the commits within the PR may be conflicting with one another. Right before creating the PR, I already rebased to the then-latest
Perfect.
Would it be an option to move the code of Another option may be to enable users to switch off support for CDT literals (e.g., via an argument in Yet another option may also be to combine both of these ideas. What do you think?
Of course. I can copy these tests. I assume they would go into a new subdirectory under After I have created the copy, where do I have to add a pointer to them such that they are run automatically during a build? Currently, I run these tests manually using These four tests are list-functions/sameterm-03.rq, list-functions/sameterm-04.rq, map-functions/sameterm-03.rq, and map-functions/sameterm-04.rq. All of them check the behavior of |
@afs Additionally, related to the very last paragraph of my previous comment, I have changed the implementation such that the (I am still waiting for our legal folks to advise me on how to proceed with creating the Software Grant Agreement.) |
ASF now have a Software Grant for this PR. "rebase and merge" and "merge commit" show conflicts. "Squash + merge" is showing as OK. Taking a diff and applying it (a sort of cheap and hacky "squash") is no longer working - in part, due to some early RDF 1.2 work :-) The Apache Jena I've created a new branch in the Jena repo My plan is to squash-merge onto that new branch and see what the state is and we can all test it out. It may take more then one attempt to get this integrated. @hartig - Please treat Don't throw away your copy! |
Thanks Andy! Let me know if you need my help with anything. There is one thing remaining from the discussion above: You mentioned the following performance-related concerns regarding the
Related to this, I had a few proposals, which I am repeating here again: Would it be an option to move the code of Another option may be to enable users to switch off support for CDT literals (e.g., via an argument in Yet another option may also be to combine both of these ideas. What do you think? |
We can't be sure until we have the CDT code combined with the current state of For now, a separate The next step to make the branch on It would also be good if the sparql-dev SEP could be done before finally adopting this code. |
Okay.
Sounds good!
I will ask @rdfguy again. |
Removed "This closes" from the description. This isn't a PR to the default branch anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a Software Grant from AWS on file at ASF.
Approved for merging the PR into a branch in the Jena repository.
The merge causes the PR to be closed. Please continue discussion on the issue #2518. |
Issue #2518.
The lack of built-in support for generic types of composite values such as lists and maps has been a long-standing issue for RDF and SPARQL. Together with a few other colleagues at the Amazon Neptune team we have developed an approach to represent lists and maps as literals in RDF data, and to extend SPARQL with features related to such literals. These extensions of SPARQL include:
We have created a complete formal specification of the approach and a comprehensive test suite for implementers, which can be found in our Github repo: https://github.com/awslabs/SPARQL-CDTs
... and I have implemented a complete integration of this approach into Jena, as can be found in this PR. We would like to contribute this implementation to Jena if you are interested, and I will be more than happy to assist you with getting the PR ready to be merged.
Perhaps before you dive into the aforementioned specification, you may take a look at our short paper, in which we provide a slightly more extensive motivation for this work and a (very!) brief summary of the approach. After that, Section 2 of the specification provides a more detailed informal description of the different parts of the approach.
I am happy to answer any questions that you may have, both about the approach in general and about the implementation in this PR. Also, if you have issues with some parts of the specification, feel free to create an issue in the aforementioned Github repo. (And in case you are wondering, yes we are planning to file the approach as a SPARQL Enhancement Proposal (SEP) for the SPARQL-dev Community Group).
By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.