JENA-1430 #314

ajs6f · 2017-11-20T18:51:38Z

Includes #313, plus:

Extend testing to DatasetAssembler
Ensure that DatasetAssembler can also load quads
Correct ja:DatasetTxnMem => ja:MemoryDataset

afs · 2017-11-21T10:32:41Z

jena-fuseki2/examples/fuseki-in-mem-txn.ttl

+    fuseki:dataset <#dataset> ;
+.
+
+<#dataset> rdf:type ja:DatasetTxnMem;


This needs to change to ja:MemoryDataset.

Nice catch, thanks, fixed.

afs · 2017-11-21T11:22:24Z

jena-arq/src/main/java/org/apache/jena/sparql/core/assembler/DatasetAssembler.java

+            // Load data into the default graph or quads into the dataset.
+            multiValueAsString(root, data)
+                .forEach(dataURI -> read(ds, dataURI));
+


This interacts with the setting up of the dft graph/named graphs. It's going to add dft graph just made (which might have content via `ja:content) but also the named graphs. It's a dataset by links to the models where adding a model replaces the model.

I'm not sure what he best balance is here -- options include (1) warning and no action on seeing ja:data (with a note to use TIM) (2) do after all the named models are added because parsing is adding quads, not replacing models (3) allow this or specified models, which would be better done with TIM (4) it's ja:data or specify graphs, not a combination; if ja:data return a TIM.

(4) looks like a good way to encourage TIM usage. If (4) then a plain, empty ja:RDFDataset could be TIM as well.

I'm trying to think of a use case for "load quads into non-TIM" and the one that occurs to me is in an embedded or integrated situation where you have a lot of quads, like so many that you prefer the memory-parsimonious-ness of the general IM dataset, maybe because you have other processes running in the system. Sound likely enough to merit (2)?

That is to load specialised graphs e.g. inference? In which case the graphs themselves need to be pre-created before reading data. We already have a mechanism for graphs with ja:content on the graph description. Having two ways seems bad. Strange things will happen if a graph is in the quads and not explicitly created (it will be a plain graph when inference expected). It could be done be splitting the quads into separate graphs and generating the assembler text.

It might make sense to have some plain graphs and specialist graphs, that is a disjoint set, but to enforce this to stop accidents is not so easy without a lot of additional machinery to enforce disjoint.

How about doing this progressively - we make ja:RDFDataset create either TIM, or linked datasets. And for now, it is one or the other, not a combination. (We can have per graph data loading into TIM as it does at the moment with some care - look into the namedGhraph property to see if it has ja:data+name, and only those, if not, it's a general model).

We leave more exotic combinations until we get a request because that may help choose what to do with the corner cases like whether they overlap or not. If we provide now, the corner cases are fixed by compatibility.

Okay, if I get what you are saying, it's:

Check to see if quads are being loaded, if so, TIM.

Otherwise, check the named graphs. If they are all ja:data guys, then TIM again.

Otherwise, general dataset.

I'll get on this later today.

@afs What's a good idiom for switching to a new assembler? In other words, let's say the code tests for the presence of quads and finds them and it's time to pivot to TIM. Obviously, I could just new InMemDatasetAssembler(), but I'm thinking there must be a more elegant way. I looked at AssemblerUtils but only saw ways to register Assemblers, not retrieve them…

ajs6f · 2017-11-22T16:29:27Z

@afs What do you think of that? It's clearer, I think, along the lines you suggested.

afs · 2017-11-29T10:20:34Z

jena-core/src/main/java/org/apache/jena/assembler/assemblers/AssemblerBase.java

+     * @param root resource to check
+     * @param types types for which to check
+     */
+    protected void checkType( Resource root, Resource... types ) {


I don't understand this. An assembler resource must have one type so that there is a unique mapping to the implementation to call. I don't see in the code anywhere calling with 2+ arguments.

See the changes I made to InMemDatasetAssembler. That's where.

This happens because if you pass assembler RDF with RDFDataset, it could become a general-purpose dataset or it could become a TIM dataset, and so the TIM assembler has to be able to accept both types. IOW, "An assembler resource must have one type", yes, but the more important point is that the assembler code has to be able to accept more than one type.

We can change the semantics, instead, so that there is a one-to-one between types and assembler classes. and I am increasingly convinced we should, because if we are getting a bit confused about this, it's not going to be easy for users!

afs · 2017-11-29T10:20:45Z

jena-core/src/main/java/org/apache/jena/assembler/assemblers/AssemblerBase.java

+     * @param types types for which to check
+     */
+    protected void checkType( Resource root, Resource... types ) {
+        for (Resource type : types) if (root.hasProperty( RDF.type, type )) return;


Everything on one line is a bit confusing!

afs · 2017-11-29T10:47:11Z

jena-arq/src/main/java/org/apache/jena/sparql/core/assembler/InMemDatasetAssembler.java

    }

 	@Override
 	public Dataset open(final Assembler assembler, final Resource root, final Mode mode) {
-		checkType(root, DatasetAssemblerVocab.tDatasetTxnMem);


We ought to also check for use of the RDFDataset vocabulary on a MemoryDataset. e.g. ja:defaultGraph and then ja:namedGraph pointing to ja:graph and a ja:MemoryModel.

The reverse tests ought to be in general assembler.

One of my frustrations with this is that there were no tests of these assemblers at all when I started, so I'm not very sure about the current expected/guaranteed behavior. WRT to our comments above about the use of predicates, should we start migrating people away from ja:graph, or (probably better) instead restricting ja:data to only loading quads into a dataset?

afs · 2017-11-29T10:54:16Z

jena-arq/src/main/java/org/apache/jena/sparql/core/assembler/DatasetAssembler.java

        Dataset ds = createDataset(a, root, mode) ;
        return ds ;
    }

    public Dataset createDataset(Assembler a, Resource root, Mode mode) {
+        checkType(root, DatasetAssemblerVocab.tDataset);
+        // use TIM if quads are loaded or if all named Graphs are loaded via data property
+        final boolean allNamedGraphsLoadViaData = multiValueResource(root, pNamedGraph).stream().allMatch(g -> g.hasProperty(data));


A SPARQL query would be quite nicely here!

The test needs to include checking and rejecting mixed cases like ja:defaultGraph and ja:data on the dataset resource and both ja:graph ja:data from the object of ja:namedGraph.

Similarly, tMemoryDataset needs to check for general vocab.

That is, two tests: "hasGeneralDatasetVocab" and "isMemoryDatasetVocab" then check not true/true or false/false.

i don't understand the comment-- are you suggesting to use SPARQL instead of the API? If we are missing a test here, it means that I did not understand your plan for how to "distribute" the predicates amongst the types of dataset. In fact, I'm starting to wonder if that's what we should do after all, or whether it would be better to deprecate some of them, since they overlap, or confine the meanings of those that overlap so that they don't...

We can't change the predicates due to existing deployed assemblers. We can check for consistency e.g. currently, this works:

<#dataset> rdf:type ja:MemoryDataset ; ja:data <file:D.trig>; ja:namedGraph [ ]; .

(loops on properties of the object of namedGraph so if there are none ...)
as does this:

<#dataset> rdf:type ja:MemoryDataset ; ja:data <file:D.trig>; ja:defaultGraph [ a ja:MemoryModel ]; .

(ignores ja:defaultGraph)

afs · 2017-11-29T10:54:31Z

jena-arq/src/main/java/org/apache/jena/sparql/core/assembler/DatasetAssembler.java

+        checkType(root, DatasetAssemblerVocab.tDataset);
+        // use TIM if quads are loaded or if all named Graphs are loaded via data property
+        final boolean allNamedGraphsLoadViaData = multiValueResource(root, pNamedGraph).stream().allMatch(g -> g.hasProperty(data));
+        if (root.hasProperty(data) || allNamedGraphsLoadViaData) return new InMemDatasetAssembler().open(a, root, mode);


All on one line is really quite unclear. Have to live with Java.

Much clearer than otherwise, to my eye. This is style, and I'm happy to change this to be clearer for you, but it's not an objective question.

afs · 2017-11-29T10:57:59Z

jena-arq/src/main/java/org/apache/jena/sparql/core/assembler/DatasetAssemblerVocab.java

@@ -31,7 +31,7 @@
    // General dataset
    public static final Resource tDataset            = ResourceFactory.createResource(NS+"RDFDataset") ;
    // In-memory dataset
-    public static final Resource tDatasetTxnMem      = ResourceFactory.createResource(NS+"DatasetTxnMem") ;
+    public static final Resource tMemoryDataset      = ResourceFactory.createResource(NS+"MemoryDataset") ;


@ajs6f correctly pointed out that DatasetTxnMem was in the documentation so migration support would be good such as register InMemDatasetAssembler under both names and log a warning if called as ja:DatasetTxnMem

(And text for the release notes)

Yes, we can do that.

I've put back tDatasetTxnMem.

afs · 2017-11-29T11:00:12Z

jena-arq/src/test/java/org/apache/jena/sparql/core/assembler/TestDatasetAssembler.java

+import org.junit.Assert;
+import org.junit.Test;
+
+public class TestDatasetAssembler extends Assert {


Having one or two tests that works from a Turtle string or would make for clear tests of expected use.

With test for bad cases of mixed usage (general and TIM vocabulary together) text written test cases will help generating tests.

Sure, a couple of Turtle assembler files would make for nice examples. Do we have a standard practice about how to load test files from the classpath during a build?

I don't understand the second sentence. Can you restate that?

Generally, tests are loaded from a testing/ directory, not the classpath. Some tests need to read real, normal files and also the W3C SPARQL tests run from on-disk files and manifests. Once some tests do need files, there isn't much value in classpath loading as its extra, not instead of. Assemblers need to read from files, it being the common usecase.

I was lookign for tests that do both ja:data and ja:defaultGraph, which should be an error. I saw someone asking recently trying with ja:defaultGraph and TIM assemble, which just ignores unknown properties. Easy confusion to make - so deliberately testing making an error seems sensible.

afs · 2017-11-29T11:18:16Z

jena-arq/src/main/java/org/apache/jena/sparql/core/assembler/DatasetAssembler.java

-                    Log.warn(this, "Use of old vocabulary: use :graph not :graphData");
-                } else {
-                    throw new DatasetAssemblerException(root, "no graph for: " + gName);
+        Txn.executeWrite(ds, () -> {


Side note: Because the general dataset can't support abort, a Txn is nice but of little help; if anything goes wrong assembling, it's broken.

Yes, I just use it more or less for uniformity. I want our code to model best practices.

afs · 2017-12-07T14:21:03Z

Merged in order to progress Jena 3.6.0.

This closes #314.

ajs6f · 2017-12-07T15:18:40Z

Ouch! I realize I wasn't clear enough at all in my last comment: I was actually -1 on this. I think the confusion that ja:data now has in its semantics is too much, and I completely disagree with:

We can't change the predicates due to existing deployed assemblers

We change the public API when we must, we just do it carefully and after clear depreciation with a clear migration path. What's more, the docs have been wrong about the assembler type for TIM for a year or more, so I don't think there will be users waiting with pitchforks and torches!

I'd like to either revert this merge or add a further commit adding the use of ja:graph to TIM's assembler (just like the general dataset assembler) and depreciating the use of ja:data for anything but quads with the expectation that we will disallow it in a near-future version. I can do that after a release, if that is easier.

asfbot · 2017-12-07T15:25:11Z

Claude Warren on dev@jena.apache.org replies:
Q: What about when the tests are used by subclasses in a different
package. The test files would then need to be loaded from the class path.
I ran into this issue trying to run tests on graph implementations before.
However, it my not apply in this specific case with assemblers. So I am
asking, is there an issue here if the test is subclassed in a different
project and includes these tests via the test-jar packaging?

Claude

afs · 2017-12-07T15:50:40Z

@ajs6f I do not understand your point - it is too abstract talking about "predicates". Some examples maybe?

How do you deprecate in Turtle assembler files?

I don't understand about ja:data and quads, triples - that seems natural name. It would be simple to check the source if triples when loading a graph. The converse, loading quads should accept triples - a single graph in the dataset is the most common use case.

My original PR #313 was just to make {{ja:data "filename"}} work, based on experience with <file:> - that has relative name confusion.

ajs6f · 2017-12-07T16:06:01Z

I'm saying that ja:data is now used for loading quads into two different kinds of datasets, but loading triples into only one. That's confusing at best. I want to make it clearer by getting rid of the second use, because we already have ja:graph, and then what about ja:content? But I'm happy to hear other ways to clarify the semantics-- I just think that we've made things more powerful but more confusing with this PR. I want to keep the power (load quads easily, which we just did) while not making it harder to understand how to build assembler RDF (which I think we just did).

Re: deprecation: we can deprecate with comments and by throwing warnings from inside the assembler code. And again, we have good reason (the mistake in the docs that no one noticed enough to complain about) to think that it won't be too disruptive.

asfbot · 2017-12-07T16:08:33Z

Andy Seaborne on dev@jena.apache.org replies:
Claude - it isn't subclassed.

Some tests are of assemblers themselves reading real, normal RDF data
files because that is what the assembler is supplied to do. It isn't a
case of putting in a jar.

 Andy

afs · 2017-12-07T16:30:07Z

ja:data isn't mentioned in DatasetAssembler except to redirect to TIM. Which two different kinds of dataset?

Would you like to move public static final Property data = property( "data" ); out of JA.java into the assembler vocab definitions file?

We need a way to say:

<#dataset> a ja:MemoryDataset ;
  ja:data "File for the default graph - also data for dataset" ;
  ja:namedGraph [ ja:graphName "http://example/graph1"; ja:data "File for graph1" ] ;
  ja:namedGraph [ ja:graphName "http://example/graph2"; ja:data "File for graph2" ] ;
.

because that gets a fully transactional dataset and I'd hope our preferred way to do have plain data in a memory dataset. Having data-for-triples, data-for-quads gets weird when its an unknown URL to load from for the default graph.

ja:Content is for ja:Models in ja:RDFDataset only. It is too old (= documented and written about outside the project) to drop. It is needed for inference setups.

afs · 2017-12-08T14:17:27Z

Can we separate the different parts of this and unblock 3.6.0?

Is the following acceptable for 3.6.0:

JENA-1430: Read quads for ja:data by filename #313 fixed a specific problem - <file:filename> and "filename" behave differently for resolution when the assembler file is not in the current directory of the JVM as happens in Fuseki.
Make ja:MemoryDataset the class for the TIM assembler, with ja:DatasetTxnMem registered for backwards compatibility.
Add the Fuseki example for using ja:MemoryDataset.

ajs6f · 2017-12-08T19:59:13Z

The two different datasets are TIM and general. The fact that the code for the general dataset assembler doesn't mention ja:data is fine, but because TIM's assembler does understand ja:data and some assembler RDF with the general dataset type can get shunted to TIM, ja:data can end up being successfully used with well-defined semantics in assembler RDF with the general dataset type.

This is what I mean by the whole thing having gotten a bit confusing! 😁

I agree fully that we need to be able to do what you clearly intend to do with your example. I just want the the example to look like:

<#dataset> a ja:MemoryDataset ;
  ja:data "File for the default graph - also data for dataset" ;
  ja:namedGraph [ ja:graphName "http://example/graph1"; ja:graph "File for graph1" ] ;
  ja:namedGraph [ ja:graphName "http://example/graph2"; ja:graph "File for graph2" ] ;
.

instead so that TIM and general look the same from the assembler POV and so that the meanings of ja:data and ja:graph are as clear and simple as practical.

I don't want to block 3.6.0 on this, so what do you say to your three points above, plus (after 3.6.0)

We gracefully deprecate the use of ja:data for anything other than loading quads into datasets and make TIM's assembler RDF look like the general dataset assembler RDF modulo the dataset type.

Sound okay?

afs · 2017-12-08T20:51:08Z

I'm certainly willing to discuss it, that's what JENA-1445 is about and we can rename it to be more accurate.

I'm not able to agree to the details at this stage. I have several outstanding questions and also outstanding points raised with both its name and semantics. There is also the status of ja:externalContent playing into the mix which hasn't been explained.

ajs6f · 2017-12-08T20:58:04Z

Okay, then let's break this thing up, as you describe above. Your number 2 and 3 are uncontroversial. As for number 1, sure, I want that fix too, but this PR (#314) didn't do just that. It also conflated the assembler RDF for these two types of dataset and I now see that TIM's use of ja:data for two purposes is a big part of that. Let's revert this and merge #313 instead and we can take up JENA-1445 after 3.6.0.

afs · 2017-12-11T16:27:51Z

Required reverts done, #313 applied and ja:MemoryDataset made the primary type name for TIM assembler.

afs and others added 2 commits November 20, 2017 17:31

JENA-1430: Read quads for ja:data by filename

d174ec0

JENA-1430: Quad loading for in-memory assemblers

3e13dc6

afs reviewed Nov 21, 2017

View reviewed changes

afs mentioned this pull request Nov 21, 2017

JENA-1430: Read quads for ja:data by filename #313

Closed

ajs6f added 2 commits November 21, 2017 09:04

Correcting dataset type in Fuseki example

462b5a3

Better, clearer handling of mixed assembling

10ff0be

afs reviewed Nov 29, 2017

View reviewed changes

asfgit merged commit 10ff0be into apache:master Dec 7, 2017

asfgit pushed a commit that referenced this pull request Dec 7, 2017

JENA-1340: Merge commit 'refs/pull/314/head' of github.com:apache/jena

74cbc63

This closes #314.

JENA-1430 #314

JENA-1430 #314

Conversation

ajs6f commented Nov 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajs6f commented Nov 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afs commented Dec 7, 2017

ajs6f commented Dec 7, 2017

asfbot commented Dec 7, 2017

afs commented Dec 7, 2017 • edited

ajs6f commented Dec 7, 2017

asfbot commented Dec 7, 2017

afs commented Dec 7, 2017

afs commented Dec 8, 2017 • edited

ajs6f commented Dec 8, 2017 • edited

afs commented Dec 8, 2017

ajs6f commented Dec 8, 2017

afs commented Dec 11, 2017

afs commented Dec 7, 2017 •

edited

afs commented Dec 8, 2017 •

edited

ajs6f commented Dec 8, 2017 •

edited