Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support for SPARQL CDTs (lists and maps as literals) #2501

Merged
merged 333 commits into from
Aug 28, 2024

Conversation

hartig
Copy link
Contributor

@hartig hartig commented May 28, 2024

Issue #2518.

The lack of built-in support for generic types of composite values such as lists and maps has been a long-standing issue for RDF and SPARQL. Together with a few other colleagues at the Amazon Neptune team we have developed an approach to represent lists and maps as literals in RDF data, and to extend SPARQL with features related to such literals. These extensions of SPARQL include:

  • an aggregation function to produce these composite values (FOLD),
  • functions to operate on these composite values in expressions, and
  • a new operator (UNFOLD) to unfold such composite values into their individual components.

We have created a complete formal specification of the approach and a comprehensive test suite for implementers, which can be found in our Github repo: https://github.com/awslabs/SPARQL-CDTs

... and I have implemented a complete integration of this approach into Jena, as can be found in this PR. We would like to contribute this implementation to Jena if you are interested, and I will be more than happy to assist you with getting the PR ready to be merged.

Perhaps before you dive into the aforementioned specification, you may take a look at our short paper, in which we provide a slightly more extensive motivation for this work and a (very!) brief summary of the approach. After that, Section 2 of the specification provides a more detailed informal description of the different parts of the approach.

I am happy to answer any questions that you may have, both about the approach in general and about the implementation in this PR. Also, if you have issues with some parts of the specification, feel free to create an issue in the aforementioned Github repo. (And in case you are wondering, yes we are planning to file the approach as a SPARQL Enhancement Proposal (SEP) for the SPARQL-dev Community Group).


  • Tests are included.
  • Documentation change and updates are provided for the Apache Jena website
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

hartig and others added 30 commits March 4, 2024 14:25
…formed (because their lexical form has already been checked)
…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4
…nges necessary for the query engine to handle this new operator (but without invoking any actual functionality if it is present in a query)
…result in a relevant literal, create no assignment
…placeholders for adding special functionality for such types of literals
…terals, and proper integration into the Jena Graph API (via special LiteralLabel implementations)
…er special case for the equals (=) operator in SPARQL
@hartig
Copy link
Contributor Author

hartig commented Jun 4, 2024

@afs I am done with rebasing now.

mvn clean install still runs without problems, and I can also run our SPARQL CDTs test suite.

@afs
Copy link
Member

afs commented Jun 5, 2024

@afs I am done with rebasing now.

Excellent.

If I set to "Rebase and Merge" there is something showing.
148 commits in (commit 9cc5647).

Part of the conflict is

both added:      jena-arq/src/main/java/org/apache/jena/sparql/syntax/ElementUnfold.java

which suggests both parts are for this PR. Several I looked at are the same update, very similar via two different routes.

The other github merge strategies do allow merge although it does not guarantee git will pick the right option.

It doesn't like a "semantic" (!!) problem -- I'll investigate further.

@afs
Copy link
Member

afs commented Jun 8, 2024

The rebasing has made it a clear which files have changed and that makes reviewing easier - thanks.

I haven't managed to do a clean "git rebase" against the current main (which is the "rebase and merge" github option) which isn't that surprising. The rebased current PR seem to have conflicts all seem to be not between main and the PR but within the PR's history - which I can't explain.

However, the rebasing on the PR has also made that unnecessary.

When we're ready to merge, simply take diff of the current the PR (the URL is https://github.com/apache/jena/pull/2501.diff), start a new branch in your repo git apply the patch and make a new PR - single commit, with you as author.

Now, thinking about the contribution --

There are two considerations:

  • The majority of Jena users are depending on Jena's stability, including performance.
  • There is a need to have ways to get RDF/SPARQL emerging ideas into the wider field for practical experience.

There is a realistic chance that SPARQL-CDTs will evolve and that may be in incompatible ways. By being clear this is "experimental", I don't think that is a problem if not using the features means no impact. I (non-PMC chair opinion) am keen to see new tech being available to users, LATERAL being a recent example. (Disclosure: Olaf and I are both involved in RDF 1.2 where issues of evolution are important.)

Performance:

There is one place that needs investigation that I have found so far which is the mixes this - CDTAwareParserProfile is on the critical path for all parsing whether CDT's are used or not.

The subclass quickly tests whether to invoke the superclass but parsing is sensitive to the JIT ptimizer. Parsing runs at about one/two microseconds per triple which isn't that many instructions and the new may
change what optimizations the JIT performs.

It may well make no measurable difference because the optimizer nowadays has ways to implement virtual method calls as direct function calls. The only way to find out is to try.

Functionality:

While CDTs are experimental, and likely to change, Jena needs a way to test teh functionality it ships and
it does that by running test suites every build. An Apache project is more than open source - the code needs to be able to build. Jena releases the repository source tree - the binary artifacts ("convenience binaries") are secondary.

For the stability, PRs needs tests and there aren't any in this PR. I see that there are AWSlabs SPARQL-CDTs manifest-driven tests (Apache Licensed). One possibility is adding the current state of them to the PR (if they pass). To be clear -- it's a copy, nothing more; AWSlabs retains the defining material and manages its evolution. The W3C RDF/SPARQL tests in th ejena repo are the same status - local copies.

Technical points:
(this is for later - not required for accepting this PR).

1: It may be better to do CDT's as StreamRDF processing rather
than as a parse profile, if the blank node mapping, and while we're at it, the prefix
mapping, can be made available to the stream which they aren't at the moment.

2: When thinking about persistent storage, the bnode node datastructures may need
to a have a longer lifetime that the parser run including persistent storage.
"Test to create a cdt:List with blank nodes via STRDT" for example.

@SimonBin
Copy link
Contributor

I have some high-level comments

I am particularly curious about the changes to the SPARQL standard without providing a generic solution. Broadly speaking, the CDT proposal suggests two new keyword FOLD and UNFOLD, whereas the first is an aggregate function operator with the additional support for ORDER BY, and the latter is a custom operator which seems to me like a special case implementation of multi-binding BIND (w3c/sparql-dev#6) for cdt:Lists.

However, we could use both kinds of types for many purposes. I want to list some

ORDER BY in aggregate functions:

generic UNFOLD:

further links:

Appendix

comparison of JenaX syntax vs. CDT

CDT JenaX
SELECT ?name (FOLD(?person) AS ?list) SELECT ?name (array:collect(?person ) AS ?list)
UNFOLD (?list AS ?element) ?list array:unnest ?element
BIND(cdt:get(?map, "address") AS ?addressSubMap) BIND(json:get(?map, "address") AS ?addressSubMap)
BIND(cdt:put(?addressSubMap, "number", 42) AS ?newAddressSubMap) BIND(json:set(?addressSubMap, "number", 42) AS ?newAddressSubMap)
"[ '1999-08-16'^^<http://www.w3.org/2001/XMLSchema#date>, 42 ]"^^cdt:List "'1999-08-16'^^<http://www.w3.org/2001/XMLSchema#date> 42"^^norse:array

Further ideas present in JenaX:

  • set datatype in addition to list/array
  • support for XML, XPath and JsonPath
  • explode map or array into many variables (not many bindings), shortcut for many BINDs
    Example:
     VALUES ?json { "{'a': 2, 'b': 5}"^^norse:json }
     ?json json:explode ("json_") .
     # now do something with ?json_a and ?json_b
    

@hartig
Copy link
Contributor Author

hartig commented Jun 20, 2024

[@SimonBin] I have some high-level comments

Thanks for the pointer. I knew that Stardog had something similar, but I wasn't aware of this blog post. Looking at it now, I see some relevant differences: The Stardog approach changes the notion of a solution mapping by extending the co-domain of these mappings with two new kinds of elements (namely, solution mappings themselves, which makes the notion recursive, and some kind of array). While I cannot find any formal specification of this approach, I can see that extending the definition of solution mapping in the SPARQL spec in this way would have a wide-ranging impact on most of the formal parts of the spec (the definition of every operator that processes solution mappings would need to be adapted). In contrast, adding a new operator (UNFOLD) and a new aggregation function (FOLD), as done by our approach, would be a comparably modest change/extension to the SPARQL spec (and we already provide all relevant parts of this extension explicitly in our SPARQL CDTs spec, ready to be copied directly into the SPARQL spec; rather than having "only" an implementation of an idea described informally in a blog post). Moreover, the relationship between Stardog's approach (which seems to be only about SPARQL) and the RDF data model is also not clear to me from the blog post (e.g., what happens if a solution mapping that contains such an array or a nested solution mapping is passed to a CONSTRUCT pattern?), whereas our approach is simply based on RDF literals and, thus, is clearly within the realm of RDF as is.

  • why do we need cdt:List and cdt:Map when the syntax already decides the object type (cf. in JenaX we have a single JSON data type)

Some vendors may choose to support only one or the other. Moreover, a potential future addition may be a cdt:SortedMap datatype which would likely have the same lexical space as cdt:Map, or a cdt:Set (as you mention JenaX has).

I am particularly curious about the changes to the SPARQL standard without providing a generic solution. Broadly speaking, the CDT proposal suggests two new keyword FOLD and UNFOLD, whereas the first is an aggregate function operator with the additional support for ORDER BY, and the latter is a custom operator which seems to me like a special case implementation of multi-binding BIND (w3c/sparql-dev#6) for cdt:Lists.

However, we could use both kinds of types for many purposes. I want to list some

ORDER BY in aggregate functions: [...]

Our spec contains all the relevant parts to define ORDER BY as a generic addition to be used also by other aggregation functions, and the code in this PR has everything needed to implement such a generic ORDER BY for arbitrary aggregation functions. For details, see my earlier comment above.

generic UNFOLD: [...]

While, also for this one, our spec contains some spec text relevant for defining a generic "multi-BIND" (à la w3c/sparql-dev#6), in contrast to the ORDER BY case, doing so is a bit more complicated than directly copying over this spec text into the SPARQL spec. Yet, in the long run, I can certainly see the benefit of extending SPARQL with such a generic "multi-BIND" and, then, defining our UNFOLD as an application of the generic definition.

@hartig
Copy link
Contributor Author

hartig commented Jun 23, 2024

@afs apologies for the delay; after I came back from traveling, my university responsibilities required all my time.

I haven't managed to do a clean "git rebase" against the current main (which is the "rebase and merge" github option) which isn't that surprising. The rebased current PR seem to have conflicts all seem to be not between main and the PR but within the PR's history - which I can't explain.

The reason might be that some of the commits within the PR may be conflicting with one another. Right before creating the PR, I already rebased to the then-latest main and realized that some parts of the internal API of Jena had changed quite a bit (e.g., LiteralLabel is a final class now but was an interface in the version that I forked and extended), which meant that I had to change parts of my implementation.

When we're ready to merge, simply take diff of the current the PR (the URL is https://github.com/apache/jena/pull/2501.diff), start a new branch in your repo git apply the patch and make a new PR - single commit, with you as author.

Perfect.

Performance:

There is one place that needs investigation that I have found so far which is the mixes this - CDTAwareParserProfile is on the critical path for all parsing whether CDT's are used or not.

The subclass quickly tests whether to invoke the superclass but parsing is sensitive to the JIT ptimizer. Parsing runs at about one/two microseconds per triple which isn't that many instructions and the new may
change what optimizations the JIT performs.

It may well make no measurable difference because the optimizer nowadays has ways to implement virtual method calls as direct function calls. The only way to find out is to try.

Would it be an option to move the code of CDTAwareParserProfile into ParserProfileStd and, then, remove CDTAwareParserProfile altogether?

Another option may be to enable users to switch off support for CDT literals (e.g., via an argument in ModLangParse, similar to --strict). If switched off, RiotLib would create a ParserProfileStd instance instead of a CDTAwareParserProfile instance.

Yet another option may also be to combine both of these ideas. What do you think?

Functionality:

[...] Jena needs a way to test teh functionality it ships and it does that by running test suites every build. [...]

For the stability, PRs needs tests and there aren't any in this PR. I see that there are AWSlabs SPARQL-CDTs manifest-driven tests (Apache Licensed). One possibility is adding the current state of them to the PR (if they pass). To be clear -- it's a copy, nothing more; AWSlabs retains the defining material and manages its evolution. [...]

Of course. I can copy these tests.

I assume they would go into a new subdirectory under ./jena-arq/testing/, right?

After I have created the copy, where do I have to add a pointer to them such that they are run automatically during a build?

Currently, I run these tests manually using arq.rdftests. They all pass, except for four tests.

These four tests are list-functions/sameterm-03.rq, list-functions/sameterm-04.rq, map-functions/sameterm-03.rq, and map-functions/sameterm-04.rq. All of them check the behavior of SAMETERM. They did pass in my original implementation, but after I had to reimplement some things because the internal API of Jena had changed (the aforementioned change of LiteralLabel, in particular), they fail. I think it would be possible to make them pass again, but that would require adding a new constructor to LiteralLabel---namely, a constructor via which it is possible to provide both the lexical form and the value. Let me know if you want me to push a corresponding commit to this PR in order to demonstrate what exactly I mean, and if you don't like it, I can revert this commit. Alternatively, I may simply comment out the four failing tests for now.

@hartig
Copy link
Contributor Author

hartig commented Jul 5, 2024

@afs
I have now copied the manifest-driven tests from the SPARQL CDTs repo into this PR and integrated them to run automatically as unit tests (see commit 2549435). Of course, that increases the number of files in the PR even more ;-)

Additionally, related to the very last paragraph of my previous comment, I have changed the implementation such that the SAMETERM tests don't fail (see commit 9cb08f1). To this end, I had to add a new constructor to LiteralLabel and a corresponding create-function to LiteralLabelFactory. Please take a look and let me know whether you consider these changes appropriate.

(I am still waiting for our legal folks to advise me on how to proceed with creating the Software Grant Agreement.)

@afs afs changed the base branch from main to gh2518-cdt August 25, 2024 20:28
@afs
Copy link
Member

afs commented Aug 25, 2024

ASF now have a Software Grant for this PR.

"rebase and merge" and "merge commit" show conflicts. "Squash + merge" is showing as OK.

Taking a diff and applying it (a sort of cheap and hacky "squash") is no longer working - in part, due to some early RDF 1.2 work :-)

The Apache Jena main branch is protected - it does not allow forced pushes or uncommitting. Code needs to be finished and clean to put onto main.

I've created a new branch in the Jena repo gh2518-cdt which is the current state of main (2024-08-25) and changed the merge target of this PR branch to that branch (and "merge commit" has gone green which is strange).

My plan is to squash-merge onto that new branch and see what the state is and we can all test it out.

It may take more then one attempt to get this integrated.

@hartig - Please treat hartig:UnfoldAndFoldWithCompositeValues as frozen. If you want to make changes, clone of that branch in your repo.

Don't throw away your copy!

@hartig
Copy link
Contributor Author

hartig commented Aug 26, 2024

Thanks Andy! Let me know if you need my help with anything.

There is one thing remaining from the discussion above: You mentioned the following performance-related concerns regarding the CDTAwareParserProfile introduced in this PR.

There is one place that needs investigation that I have found so far which is the mixes this - CDTAwareParserProfile is on the critical path for all parsing whether CDT's are used or not.

The subclass quickly tests whether to invoke the superclass but parsing is sensitive to the JIT ptimizer. Parsing runs at about one/two microseconds per triple which isn't that many instructions and the new may
change what optimizations the JIT performs.

It may well make no measurable difference because the optimizer nowadays has ways to implement virtual method calls as direct function calls. The only way to find out is to try.

Related to this, I had a few proposals, which I am repeating here again:

Would it be an option to move the code of CDTAwareParserProfile into ParserProfileStd and, then, remove CDTAwareParserProfile altogether?

Another option may be to enable users to switch off support for CDT literals (e.g., via an argument in ModLangParse, similar to --strict). If switched off, RiotLib would create a ParserProfileStd instance instead of a CDTAwareParserProfile instance.

Yet another option may also be to combine both of these ideas. What do you think?

@afs
Copy link
Member

afs commented Aug 26, 2024

We can't be sure until we have the CDT code combined with the current state of main.

For now, a separate CDTAwareParserProfile is better. This is an internal issue so it can change later.

The next step to make the branch on apache/jena then I'll ask you to review it.

It would also be good if the sparql-dev SEP could be done before finally adopting this code.

@hartig
Copy link
Contributor Author

hartig commented Aug 27, 2024

For now, a separate CDTAwareParserProfile is better. This is an internal issue so it can change later.

Okay.

The next step to make the branch on apache/jena then I'll ask you to review it.

Sounds good!

It would also be good if the sparql-dev SEP could be done before finally adopting this code.

I will ask @rdfguy again.

@afs
Copy link
Member

afs commented Aug 27, 2024

Removed "This closes" from the description. This isn't a PR to the default branch anymore.

Copy link
Member

@afs afs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a Software Grant from AWS on file at ASF.
Approved for merging the PR into a branch in the Jena repository.

@afs afs merged commit 037eb9b into apache:gh2518-cdt Aug 28, 2024
@afs
Copy link
Member

afs commented Aug 28, 2024

The merge causes the PR to be closed.

Please continue discussion on the issue #2518.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for SPARQL CDTs (lists and maps as literals)
5 participants