Adds support for SPARQL CDTs (lists and maps as literals) #2501

hartig · 2024-05-28T15:27:25Z

The lack of built-in support for generic types of composite values such as lists and maps has been a long-standing issue for RDF and SPARQL. Together with a few other colleagues at the Amazon Neptune team we have developed an approach to represent lists and maps as literals in RDF data, and to extend SPARQL with features related to such literals. These extensions of SPARQL include:

an aggregation function to produce these composite values (FOLD),
functions to operate on these composite values in expressions, and
a new operator (UNFOLD) to unfold such composite values into their individual components.

We have created a complete formal specification of the approach and a comprehensive test suite for implementers, which can be found in our Github repo: https://github.com/awslabs/SPARQL-CDTs

... and I have implemented a complete integration of this approach into Jena, as can be found in this PR. We would like to contribute this implementation to Jena if you are interested, and I will be more than happy to assist you with getting the PR ready to be merged.

Perhaps before you dive into the aforementioned specification, you may take a look at our short paper, in which we provide a slightly more extensive motivation for this work and a (very!) brief summary of the approach. After that, Section 2 of the specification provides a more detailed informal description of the different parts of the approach.

I am happy to answer any questions that you may have, both about the approach in general and about the implementation in this PR. Also, if you have issues with some parts of the specification, feel free to create an issue in the aforementioned Github repo. (And in case you are wondering, yes we are planning to file the approach as a SPARQL Enhancement Proposal (SEP) for the SPARQL-dev Community Group).

Tests are included.
Documentation change and updates are provided for the Apache Jena website
Commits have been squashed to remove intermediate development commit messages.
Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.

See the Apache Jena "Contributing" guide.

…jena into UnfoldAndFoldWithCompositeValues

…formed (because their lexical form has already been checked)

…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4

…nges necessary for the query engine to handle this new operator (but without invoking any actual functionality if it is present in a query)

…eate no assignment

…result in a relevant literal, create no assignment

…ents

…placeholders for adding special functionality for such types of literals

…terals, and proper integration into the Jena Graph API (via special LiteralLabel implementations)

…er special case for the equals (=) operator in SPARQL

… literals and cdt:Map literals

…hat LiteralLabel is not anymore an interface but a *final* class

…RDER BY of CDT literals didn't work correctly anymore)

…o related to awslabs/SPARQL-CDTs#5

…jena into UnfoldAndFoldWithCompositeValues

hartig · 2024-06-04T22:39:14Z

@afs I am done with rebasing now.

mvn clean install still runs without problems, and I can also run our SPARQL CDTs test suite.

afs · 2024-06-05T12:40:53Z

@afs I am done with rebasing now.

Excellent.

If I set to "Rebase and Merge" there is something showing.
148 commits in (commit 9cc5647).

Part of the conflict is

both added:      jena-arq/src/main/java/org/apache/jena/sparql/syntax/ElementUnfold.java

which suggests both parts are for this PR. Several I looked at are the same update, very similar via two different routes.

The other github merge strategies do allow merge although it does not guarantee git will pick the right option.

It doesn't like a "semantic" (!!) problem -- I'll investigate further.

afs · 2024-06-08T20:17:31Z

The rebasing has made it a clear which files have changed and that makes reviewing easier - thanks.

I haven't managed to do a clean "git rebase" against the current main (which is the "rebase and merge" github option) which isn't that surprising. The rebased current PR seem to have conflicts all seem to be not between main and the PR but within the PR's history - which I can't explain.

However, the rebasing on the PR has also made that unnecessary.

When we're ready to merge, simply take diff of the current the PR (the URL is https://github.com/apache/jena/pull/2501.diff), start a new branch in your repo git apply the patch and make a new PR - single commit, with you as author.

Now, thinking about the contribution --

There are two considerations:

The majority of Jena users are depending on Jena's stability, including performance.
There is a need to have ways to get RDF/SPARQL emerging ideas into the wider field for practical experience.

There is a realistic chance that SPARQL-CDTs will evolve and that may be in incompatible ways. By being clear this is "experimental", I don't think that is a problem if not using the features means no impact. I (non-PMC chair opinion) am keen to see new tech being available to users, LATERAL being a recent example. (Disclosure: Olaf and I are both involved in RDF 1.2 where issues of evolution are important.)

Performance:

There is one place that needs investigation that I have found so far which is the mixes this - CDTAwareParserProfile is on the critical path for all parsing whether CDT's are used or not.

The subclass quickly tests whether to invoke the superclass but parsing is sensitive to the JIT ptimizer. Parsing runs at about one/two microseconds per triple which isn't that many instructions and the new may
change what optimizations the JIT performs.

It may well make no measurable difference because the optimizer nowadays has ways to implement virtual method calls as direct function calls. The only way to find out is to try.

Functionality:

While CDTs are experimental, and likely to change, Jena needs a way to test teh functionality it ships and
it does that by running test suites every build. An Apache project is more than open source - the code needs to be able to build. Jena releases the repository source tree - the binary artifacts ("convenience binaries") are secondary.

For the stability, PRs needs tests and there aren't any in this PR. I see that there are AWSlabs SPARQL-CDTs manifest-driven tests (Apache Licensed). One possibility is adding the current state of them to the PR (if they pass). To be clear -- it's a copy, nothing more; AWSlabs retains the defining material and manages its evolution. The W3C RDF/SPARQL tests in th ejena repo are the same status - local copies.

Technical points:
(this is for later - not required for accepting this PR).

1: It may be better to do CDT's as StreamRDF processing rather
than as a parse profile, if the blank node mapping, and while we're at it, the prefix
mapping, can be made available to the stream which they aren't at the moment.

2: When thinking about persistent storage, the bnode node datastructures may need
to a have a longer lifetime that the parser run including persistent storage.
"Test to create a cdt:List with blank nodes via STRDT" for example.

SimonBin · 2024-06-19T16:12:50Z

I have some high-level comments

related prior art from Stardog: https://www.stardog.com/blog/extending-the-solution
why do we need cdt:List and cdt:Map when the syntax already decides the object type (cf. in JenaX we have a single JSON data type)

I am particularly curious about the changes to the SPARQL standard without providing a generic solution. Broadly speaking, the CDT proposal suggests two new keyword FOLD and UNFOLD, whereas the first is an aggregate function operator with the additional support for ORDER BY, and the latter is a custom operator which seems to me like a special case implementation of multi-binding BIND (w3c/sparql-dev#6) for cdt:Lists.

However, we could use both kinds of types for many purposes. I want to list some

ORDER BY in aggregate functions:

ordering for GROUP_CONCAT
ordering for geof:aggCollect
ordering for geof:aggConcatLines

generic UNFOLD:

further links:

issue 6 : Keyword for functions that can produce multiple bindings w3c/sparql-dev#6
the SEP opened by @hartig : SPARQL CDTs: SPARQL Extensions for Composite Datatypes (Lists and Maps) w3c/sparql-dev#197

Appendix

comparison of JenaX syntax vs. CDT

CDT	JenaX
`SELECT ?name (FOLD(?person) AS ?list)`	`SELECT ?name (array:collect(?person ) AS ?list)`
`UNFOLD (?list AS ?element)`	`?list array:unnest ?element`
`BIND(cdt:get(?map, "address") AS ?addressSubMap)`	`BIND(json:get(?map, "address") AS ?addressSubMap)`
`BIND(cdt:put(?addressSubMap, "number", 42) AS ?newAddressSubMap)`	`BIND(json:set(?addressSubMap, "number", 42) AS ?newAddressSubMap)`
`"[ '1999-08-16'^^<http://www.w3.org/2001/XMLSchema#date>, 42 ]"^^cdt:List`	`"'1999-08-16'^^<http://www.w3.org/2001/XMLSchema#date> 42"^^norse:array`

Further ideas present in JenaX:

set datatype in addition to list/array
support for XML, XPath and JsonPath

explode map or array into many variables (not many bindings), shortcut for many BINDs
Example:

 VALUES ?json { "{'a': 2, 'b': 5}"^^norse:json }
 ?json json:explode ("json_") .
 # now do something with ?json_a and ?json_b

hartig · 2024-06-20T15:46:32Z

[@SimonBin] I have some high-level comments

related prior art from Stardog: https://www.stardog.com/blog/extending-the-solution

Thanks for the pointer. I knew that Stardog had something similar, but I wasn't aware of this blog post. Looking at it now, I see some relevant differences: The Stardog approach changes the notion of a solution mapping by extending the co-domain of these mappings with two new kinds of elements (namely, solution mappings themselves, which makes the notion recursive, and some kind of array). While I cannot find any formal specification of this approach, I can see that extending the definition of solution mapping in the SPARQL spec in this way would have a wide-ranging impact on most of the formal parts of the spec (the definition of every operator that processes solution mappings would need to be adapted). In contrast, adding a new operator (UNFOLD) and a new aggregation function (FOLD), as done by our approach, would be a comparably modest change/extension to the SPARQL spec (and we already provide all relevant parts of this extension explicitly in our SPARQL CDTs spec, ready to be copied directly into the SPARQL spec; rather than having "only" an implementation of an idea described informally in a blog post). Moreover, the relationship between Stardog's approach (which seems to be only about SPARQL) and the RDF data model is also not clear to me from the blog post (e.g., what happens if a solution mapping that contains such an array or a nested solution mapping is passed to a CONSTRUCT pattern?), whereas our approach is simply based on RDF literals and, thus, is clearly within the realm of RDF as is.

why do we need cdt:List and cdt:Map when the syntax already decides the object type (cf. in JenaX we have a single JSON data type)

Some vendors may choose to support only one or the other. Moreover, a potential future addition may be a cdt:SortedMap datatype which would likely have the same lexical space as cdt:Map, or a cdt:Set (as you mention JenaX has).

I am particularly curious about the changes to the SPARQL standard without providing a generic solution. Broadly speaking, the CDT proposal suggests two new keyword FOLD and UNFOLD, whereas the first is an aggregate function operator with the additional support for ORDER BY, and the latter is a custom operator which seems to me like a special case implementation of multi-binding BIND (w3c/sparql-dev#6) for cdt:Lists.

However, we could use both kinds of types for many purposes. I want to list some

ORDER BY in aggregate functions: [...]

Our spec contains all the relevant parts to define ORDER BY as a generic addition to be used also by other aggregation functions, and the code in this PR has everything needed to implement such a generic ORDER BY for arbitrary aggregation functions. For details, see my earlier comment above.

generic UNFOLD: [...]

While, also for this one, our spec contains some spec text relevant for defining a generic "multi-BIND" (à la w3c/sparql-dev#6), in contrast to the ORDER BY case, doing so is a bit more complicated than directly copying over this spec text into the SPARQL spec. Yet, in the long run, I can certainly see the benefit of extending SPARQL with such a generic "multi-BIND" and, then, defining our UNFOLD as an application of the generic definition.

hartig · 2024-06-23T17:19:59Z

@afs apologies for the delay; after I came back from traveling, my university responsibilities required all my time.

I haven't managed to do a clean "git rebase" against the current main (which is the "rebase and merge" github option) which isn't that surprising. The rebased current PR seem to have conflicts all seem to be not between main and the PR but within the PR's history - which I can't explain.

The reason might be that some of the commits within the PR may be conflicting with one another. Right before creating the PR, I already rebased to the then-latest main and realized that some parts of the internal API of Jena had changed quite a bit (e.g., LiteralLabel is a final class now but was an interface in the version that I forked and extended), which meant that I had to change parts of my implementation.

When we're ready to merge, simply take diff of the current the PR (the URL is https://github.com/apache/jena/pull/2501.diff), start a new branch in your repo git apply the patch and make a new PR - single commit, with you as author.

Perfect.

Performance:

There is one place that needs investigation that I have found so far which is the mixes this - CDTAwareParserProfile is on the critical path for all parsing whether CDT's are used or not.

The subclass quickly tests whether to invoke the superclass but parsing is sensitive to the JIT ptimizer. Parsing runs at about one/two microseconds per triple which isn't that many instructions and the new may
change what optimizations the JIT performs.

It may well make no measurable difference because the optimizer nowadays has ways to implement virtual method calls as direct function calls. The only way to find out is to try.

Would it be an option to move the code of CDTAwareParserProfile into ParserProfileStd and, then, remove CDTAwareParserProfile altogether?

Another option may be to enable users to switch off support for CDT literals (e.g., via an argument in ModLangParse, similar to --strict). If switched off, RiotLib would create a ParserProfileStd instance instead of a CDTAwareParserProfile instance.

Yet another option may also be to combine both of these ideas. What do you think?

Functionality:

[...] Jena needs a way to test teh functionality it ships and it does that by running test suites every build. [...]

For the stability, PRs needs tests and there aren't any in this PR. I see that there are AWSlabs SPARQL-CDTs manifest-driven tests (Apache Licensed). One possibility is adding the current state of them to the PR (if they pass). To be clear -- it's a copy, nothing more; AWSlabs retains the defining material and manages its evolution. [...]

Of course. I can copy these tests.

I assume they would go into a new subdirectory under ./jena-arq/testing/, right?

After I have created the copy, where do I have to add a pointer to them such that they are run automatically during a build?

Currently, I run these tests manually using arq.rdftests. They all pass, except for four tests.

These four tests are list-functions/sameterm-03.rq, list-functions/sameterm-04.rq, map-functions/sameterm-03.rq, and map-functions/sameterm-04.rq. All of them check the behavior of SAMETERM. They did pass in my original implementation, but after I had to reimplement some things because the internal API of Jena had changed (the aforementioned change of LiteralLabel, in particular), they fail. I think it would be possible to make them pass again, but that would require adding a new constructor to LiteralLabel---namely, a constructor via which it is possible to provide both the lexical form and the value. Let me know if you want me to push a corresponding commit to this PR in order to demonstrate what exactly I mean, and if you don't like it, I can revert this commit. Alternatively, I may simply comment out the four failing tests for now.

…rwise SAMETERM is wrong

hartig · 2024-07-05T10:01:10Z

@afs
I have now copied the manifest-driven tests from the SPARQL CDTs repo into this PR and integrated them to run automatically as unit tests (see commit 2549435). Of course, that increases the number of files in the PR even more ;-)

Additionally, related to the very last paragraph of my previous comment, I have changed the implementation such that the SAMETERM tests don't fail (see commit 9cb08f1). To this end, I had to add a new constructor to LiteralLabel and a corresponding create-function to LiteralLabelFactory. Please take a look and let me know whether you consider these changes appropriate.

(I am still waiting for our legal folks to advise me on how to proceed with creating the Software Grant Agreement.)

afs · 2024-08-25T20:52:09Z

ASF now have a Software Grant for this PR.

"rebase and merge" and "merge commit" show conflicts. "Squash + merge" is showing as OK.

Taking a diff and applying it (a sort of cheap and hacky "squash") is no longer working - in part, due to some early RDF 1.2 work :-)

The Apache Jena main branch is protected - it does not allow forced pushes or uncommitting. Code needs to be finished and clean to put onto main.

I've created a new branch in the Jena repo gh2518-cdt which is the current state of main (2024-08-25) and changed the merge target of this PR branch to that branch (and "merge commit" has gone green which is strange).

My plan is to squash-merge onto that new branch and see what the state is and we can all test it out.

It may take more then one attempt to get this integrated.

@hartig - Please treat hartig:UnfoldAndFoldWithCompositeValues as frozen. If you want to make changes, clone of that branch in your repo.

Don't throw away your copy!

hartig · 2024-08-26T06:41:33Z

Thanks Andy! Let me know if you need my help with anything.

There is one thing remaining from the discussion above: You mentioned the following performance-related concerns regarding the CDTAwareParserProfile introduced in this PR.

There is one place that needs investigation that I have found so far which is the mixes this - CDTAwareParserProfile is on the critical path for all parsing whether CDT's are used or not.

The subclass quickly tests whether to invoke the superclass but parsing is sensitive to the JIT ptimizer. Parsing runs at about one/two microseconds per triple which isn't that many instructions and the new may
change what optimizations the JIT performs.

It may well make no measurable difference because the optimizer nowadays has ways to implement virtual method calls as direct function calls. The only way to find out is to try.

Related to this, I had a few proposals, which I am repeating here again:

Would it be an option to move the code of CDTAwareParserProfile into ParserProfileStd and, then, remove CDTAwareParserProfile altogether?

Another option may be to enable users to switch off support for CDT literals (e.g., via an argument in ModLangParse, similar to --strict). If switched off, RiotLib would create a ParserProfileStd instance instead of a CDTAwareParserProfile instance.

Yet another option may also be to combine both of these ideas. What do you think?

afs · 2024-08-26T20:21:31Z

We can't be sure until we have the CDT code combined with the current state of main.

For now, a separate CDTAwareParserProfile is better. This is an internal issue so it can change later.

The next step to make the branch on apache/jena then I'll ask you to review it.

It would also be good if the sparql-dev SEP could be done before finally adopting this code.

hartig · 2024-08-27T05:48:03Z

For now, a separate CDTAwareParserProfile is better. This is an internal issue so it can change later.

Okay.

The next step to make the branch on apache/jena then I'll ask you to review it.

Sounds good!

It would also be good if the sparql-dev SEP could be done before finally adopting this code.

I will ask @rdfguy again.

afs · 2024-08-27T08:08:37Z

Removed "This closes" from the description. This isn't a PR to the default branch anymore.

afs

There is a Software Grant from AWS on file at ASF.
Approved for merging the PR into a branch in the Jena repository.

afs · 2024-08-28T09:16:44Z

The merge causes the PR to be closed.

Please continue discussion on the issue #2518.

…2501)

hartig and others added 30 commits March 4, 2024 14:25

adds license comment at the begin oft he files

6e51262

Merge branch 'UnfoldAndFoldWithCompositeValues' of github.com:hartig/…

ad867da

…jena into UnfoldAndFoldWithCompositeValues

permanent URI for the CDT datatypes and functions

5329001

removes the InputStream versions of parseListLiteral and parseMapLiteral

2b29e61

adds the possibility to create CDT literals that are known to be ill-…

9476dec

…formed (because their lexical form has already been checked)

changes the parsing of CDT literals such that the scope of bnode iden…

05efa88

…tifiers extends to all CDT literals within a file as well as to the file itself; as per awslabs/SPARQL-CDTs#2 and awslabs/SPARQL-CDTs#4

Merge branch 'apache:main' into master

c49821c

initial addition of Unfold to the ARQ grammar, including all code cha…

9cc5647

…nges necessary for the query engine to handle this new operator (but without invoking any actual functionality if it is present in a query)

extends ElementTransform to cover ElementUnfold

920ed9f

extends ElementVisitor to cover ElementUnfold

4e12e3d

initial functional version of QueryIterUnfold

08e98d8

enable visiting of ElementUnfold

85a6d6e

fully working version of QueryIterUnfold

a705c97

bug fix: if evaluating the expression of an UNFOLD operator fails, cr…

78aa795

…eate no assignment

bug fix: if evaluating the expression of an UNFOLD operator does not …

73ac3a9

…result in a relevant literal, create no assignment

adds support for UNFOLDing of untyped maps

ff0fd1f

adds FOLD

736b435

extends the implementation of FOLD to actually support the type argum…

47d0b94

…ents

adds support for collection-related functions

54a8f2a

adds POC of a 'concat' function for lists

48ba8a3

creates Node_LiteralWithList and Node_LiteralWithMap, so far only as …

bca1338

…placeholders for adding special functionality for such types of literals

adds an actual parser for the lexical form of cdt:List and cdt:Map li…

e469c7c

…terals, and proper integration into the Jena Graph API (via special LiteralLabel implementations)

updating the URI prefix 'cdt:'

4613165

adds the constructor function cdt:List

4b887b9

adds comparison of cdt:List literals and of cdt:Map literals as anoth…

d5d1f5d

…er special case for the equals (=) operator in SPARQL

correct recursive implementation of 'equals' in 'CDTValue'

5c55d9e

adds the cdt:contains function

61d5108

remove debug printing

ec226f9

adds CDTLiteralParserBase to create special Node objects for cdt:List…

e2afad3

… literals and cdt:Map literals

bug fix: avoid NPE

bd68c99

hartig and others added 8 commits June 4, 2024 21:52

a few more fixes of issues that resulted from rebasing

aaff3ec

after rebasing, changes the CDT implementation to adapt to the fact t…

451a6d9

…hat LiteralLabel is not anymore an interface but a *final* class

further adaption to the reorganized NodeValue class after rebasing (O…

280bb15

…RDER BY of CDT literals didn't work correctly anymore)

a bit more cleaning up after rebasing

4dd0cc2

corrects the handling of = and < for CDT literals after rebasing, als…

6ca6d45

…o related to awslabs/SPARQL-CDTs#5

adds license header to CDTAwareParserProfile

b3bc222

Merge branch 'UnfoldAndFoldWithCompositeValues' of github.com:hartig/…

872d834

…jena into UnfoldAndFoldWithCompositeValues

fixes issues that resulted from rebasing

1670e39

hartig added 2 commits July 4, 2024 22:36

copies the test suite from the SPARQL-CDTs repo

2549435

CDT literals must be create using both their lex.form and value; othe…

9cb08f1

…rwise SAMETERM is wrong

afs changed the base branch from main to gh2518-cdt August 25, 2024 20:28

afs approved these changes Aug 28, 2024

View reviewed changes

afs merged commit 037eb9b into apache:gh2518-cdt Aug 28, 2024

afs pushed a commit that referenced this pull request Aug 28, 2024

GH-2518: Adds support for SPARQL CDTs (lists and maps as literals) (#…

a8db296

…2501)

afs pushed a commit that referenced this pull request Aug 28, 2024

GH-2518: Adds support for SPARQL CDTs (lists and maps as literals) (#…

094befd

…2501)

afs pushed a commit that referenced this pull request Sep 7, 2024

GH-2518: Adds support for SPARQL CDTs (lists and maps as literals) (#…

2dfa403

…2501)

afs pushed a commit that referenced this pull request Sep 10, 2024

GH-2518: Adds support for SPARQL CDTs (lists and maps as literals) (#…

b310680

…2501)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support for SPARQL CDTs (lists and maps as literals) #2501

Adds support for SPARQL CDTs (lists and maps as literals) #2501

hartig commented May 28, 2024 •

edited by afs

Loading

hartig commented Jun 4, 2024

afs commented Jun 5, 2024 •

edited

Loading

afs commented Jun 8, 2024 •

edited

Loading

SimonBin commented Jun 19, 2024

hartig commented Jun 20, 2024

hartig commented Jun 23, 2024

hartig commented Jul 5, 2024

afs commented Aug 25, 2024

hartig commented Aug 26, 2024

afs commented Aug 26, 2024 •

edited

Loading

hartig commented Aug 27, 2024

afs commented Aug 27, 2024

afs left a comment

afs commented Aug 28, 2024

Adds support for SPARQL CDTs (lists and maps as literals) #2501

Adds support for SPARQL CDTs (lists and maps as literals) #2501

Conversation

hartig commented May 28, 2024 • edited by afs Loading

hartig commented Jun 4, 2024

afs commented Jun 5, 2024 • edited Loading

afs commented Jun 8, 2024 • edited Loading

SimonBin commented Jun 19, 2024

hartig commented Jun 20, 2024

hartig commented Jun 23, 2024

hartig commented Jul 5, 2024

afs commented Aug 25, 2024

hartig commented Aug 26, 2024

afs commented Aug 26, 2024 • edited Loading

hartig commented Aug 27, 2024

afs commented Aug 27, 2024

afs left a comment

Choose a reason for hiding this comment

afs commented Aug 28, 2024

hartig commented May 28, 2024 •

edited by afs

Loading

afs commented Jun 5, 2024 •

edited

Loading

afs commented Jun 8, 2024 •

edited

Loading

afs commented Aug 26, 2024 •

edited

Loading