Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JENA-1430 #314

Merged
merged 4 commits into from Dec 7, 2017
Merged

JENA-1430 #314

merged 4 commits into from Dec 7, 2017

Conversation

ajs6f
Copy link
Member

@ajs6f ajs6f commented Nov 20, 2017

Includes #313, plus:

  • Extend testing to DatasetAssembler
  • Ensure that DatasetAssembler can also load quads
  • Correct ja:DatasetTxnMem => ja:MemoryDataset

fuseki:dataset <#dataset> ;
.

<#dataset> rdf:type ja:DatasetTxnMem;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to change to ja:MemoryDataset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, thanks, fixed.

// Load data into the default graph or quads into the dataset.
multiValueAsString(root, data)
.forEach(dataURI -> read(ds, dataURI));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interacts with the setting up of the dft graph/named graphs. It's going to add dft graph just made (which might have content via `ja:content) but also the named graphs. It's a dataset by links to the models where adding a model replaces the model.

I'm not sure what he best balance is here -- options include (1) warning and no action on seeing ja:data (with a note to use TIM) (2) do after all the named models are added because parsing is adding quads, not replacing models (3) allow this or specified models, which would be better done with TIM (4) it's ja:data or specify graphs, not a combination; if ja:data return a TIM.

(4) looks like a good way to encourage TIM usage. If (4) then a plain, empty ja:RDFDataset could be TIM as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think of a use case for "load quads into non-TIM" and the one that occurs to me is in an embedded or integrated situation where you have a lot of quads, like so many that you prefer the memory-parsimonious-ness of the general IM dataset, maybe because you have other processes running in the system. Sound likely enough to merit (2)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is to load specialised graphs e.g. inference? In which case the graphs themselves need to be pre-created before reading data. We already have a mechanism for graphs with ja:content on the graph description. Having two ways seems bad. Strange things will happen if a graph is in the quads and not explicitly created (it will be a plain graph when inference expected). It could be done be splitting the quads into separate graphs and generating the assembler text.

It might make sense to have some plain graphs and specialist graphs, that is a disjoint set, but to enforce this to stop accidents is not so easy without a lot of additional machinery to enforce disjoint.

How about doing this progressively - we make ja:RDFDataset create either TIM, or linked datasets. And for now, it is one or the other, not a combination. (We can have per graph data loading into TIM as it does at the moment with some care - look into the namedGhraph property to see if it has ja:data+name, and only those, if not, it's a general model).

We leave more exotic combinations until we get a request because that may help choose what to do with the corner cases like whether they overlap or not. If we provide now, the corner cases are fixed by compatibility.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, if I get what you are saying, it's:

  1. Check to see if quads are being loaded, if so, TIM.
  2. Otherwise, check the named graphs. If they are all ja:data guys, then TIM again.
  3. Otherwise, general dataset.

I'll get on this later today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afs What's a good idiom for switching to a new assembler? In other words, let's say the code tests for the presence of quads and finds them and it's time to pivot to TIM. Obviously, I could just new InMemDatasetAssembler(), but I'm thinking there must be a more elegant way. I looked at AssemblerUtils but only saw ways to register Assemblers, not retrieve them…

@ajs6f
Copy link
Member Author

ajs6f commented Nov 22, 2017

@afs What do you think of that? It's clearer, I think, along the lines you suggested.

* @param root resource to check
* @param types types for which to check
*/
protected void checkType( Resource root, Resource... types ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this. An assembler resource must have one type so that there is a unique mapping to the implementation to call. I don't see in the code anywhere calling with 2+ arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the changes I made to InMemDatasetAssembler. That's where.

This happens because if you pass assembler RDF with RDFDataset, it could become a general-purpose dataset or it could become a TIM dataset, and so the TIM assembler has to be able to accept both types. IOW, "An assembler resource must have one type", yes, but the more important point is that the assembler code has to be able to accept more than one type.

We can change the semantics, instead, so that there is a one-to-one between types and assembler classes. and I am increasingly convinced we should, because if we are getting a bit confused about this, it's not going to be easy for users!

* @param types types for which to check
*/
protected void checkType( Resource root, Resource... types ) {
for (Resource type : types) if (root.hasProperty( RDF.type, type )) return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything on one line is a bit confusing!

}

@Override
public Dataset open(final Assembler assembler, final Resource root, final Mode mode) {
checkType(root, DatasetAssemblerVocab.tDatasetTxnMem);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ought to also check for use of the RDFDataset vocabulary on a MemoryDataset. e.g. ja:defaultGraph and then ja:namedGraph pointing to ja:graph and a ja:MemoryModel.

The reverse tests ought to be in general assembler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my frustrations with this is that there were no tests of these assemblers at all when I started, so I'm not very sure about the current expected/guaranteed behavior. WRT to our comments above about the use of predicates, should we start migrating people away from ja:graph, or (probably better) instead restricting ja:data to only loading quads into a dataset?

Dataset ds = createDataset(a, root, mode) ;
return ds ;
}

public Dataset createDataset(Assembler a, Resource root, Mode mode) {
checkType(root, DatasetAssemblerVocab.tDataset);
// use TIM if quads are loaded or if all named Graphs are loaded via data property
final boolean allNamedGraphsLoadViaData = multiValueResource(root, pNamedGraph).stream().allMatch(g -> g.hasProperty(data));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A SPARQL query would be quite nicely here!

The test needs to include checking and rejecting mixed cases like ja:defaultGraph and ja:data on the dataset resource and both ja:graph ja:data from the object of ja:namedGraph.

Similarly, tMemoryDataset needs to check for general vocab.

That is, two tests: "hasGeneralDatasetVocab" and "isMemoryDatasetVocab" then check not true/true or false/false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't understand the comment-- are you suggesting to use SPARQL instead of the API? If we are missing a test here, it means that I did not understand your plan for how to "distribute" the predicates amongst the types of dataset. In fact, I'm starting to wonder if that's what we should do after all, or whether it would be better to deprecate some of them, since they overlap, or confine the meanings of those that overlap so that they don't...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change the predicates due to existing deployed assemblers. We can check for consistency e.g. currently, this works:

<#dataset> rdf:type ja:MemoryDataset ;
   ja:data <file:D.trig>;
   ja:namedGraph [ ];
.

(loops on properties of the object of namedGraph so if there are none ...)
as does this:

<#dataset> rdf:type ja:MemoryDataset ;
   ja:data <file:D.trig>;
   ja:defaultGraph [ a ja:MemoryModel ];
.

(ignores ja:defaultGraph)

checkType(root, DatasetAssemblerVocab.tDataset);
// use TIM if quads are loaded or if all named Graphs are loaded via data property
final boolean allNamedGraphsLoadViaData = multiValueResource(root, pNamedGraph).stream().allMatch(g -> g.hasProperty(data));
if (root.hasProperty(data) || allNamedGraphsLoadViaData) return new InMemDatasetAssembler().open(a, root, mode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All on one line is really quite unclear. Have to live with Java.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much clearer than otherwise, to my eye. This is style, and I'm happy to change this to be clearer for you, but it's not an objective question.

@@ -31,7 +31,7 @@
// General dataset
public static final Resource tDataset = ResourceFactory.createResource(NS+"RDFDataset") ;
// In-memory dataset
public static final Resource tDatasetTxnMem = ResourceFactory.createResource(NS+"DatasetTxnMem") ;
public static final Resource tMemoryDataset = ResourceFactory.createResource(NS+"MemoryDataset") ;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajs6f correctly pointed out that DatasetTxnMem was in the documentation so migration support would be good such as register InMemDatasetAssembler under both names and log a warning if called as ja:DatasetTxnMem

(And text for the release notes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put back tDatasetTxnMem.

import org.junit.Assert;
import org.junit.Test;

public class TestDatasetAssembler extends Assert {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one or two tests that works from a Turtle string or would make for clear tests of expected use.

With test for bad cases of mixed usage (general and TIM vocabulary together) text written test cases will help generating tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, a couple of Turtle assembler files would make for nice examples. Do we have a standard practice about how to load test files from the classpath during a build?

I don't understand the second sentence. Can you restate that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, tests are loaded from a testing/ directory, not the classpath. Some tests need to read real, normal files and also the W3C SPARQL tests run from on-disk files and manifests. Once some tests do need files, there isn't much value in classpath loading as its extra, not instead of. Assemblers need to read from files, it being the common usecase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was lookign for tests that do both ja:data and ja:defaultGraph, which should be an error. I saw someone asking recently trying with ja:defaultGraph and TIM assemble, which just ignores unknown properties. Easy confusion to make - so deliberately testing making an error seems sensible.

Log.warn(this, "Use of old vocabulary: use :graph not :graphData");
} else {
throw new DatasetAssemblerException(root, "no graph for: " + gName);
Txn.executeWrite(ds, () -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Because the general dataset can't support abort, a Txn is nice but of little help; if anything goes wrong assembling, it's broken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just use it more or less for uniformity. I want our code to model best practices.

@afs
Copy link
Member

afs commented Dec 7, 2017

Merged in order to progress Jena 3.6.0.

@asfgit asfgit merged commit 10ff0be into apache:master Dec 7, 2017
asfgit pushed a commit that referenced this pull request Dec 7, 2017
@ajs6f
Copy link
Member Author

ajs6f commented Dec 7, 2017

Ouch! I realize I wasn't clear enough at all in my last comment: I was actually -1 on this. I think the confusion that ja:data now has in its semantics is too much, and I completely disagree with:

We can't change the predicates due to existing deployed assemblers

We change the public API when we must, we just do it carefully and after clear depreciation with a clear migration path. What's more, the docs have been wrong about the assembler type for TIM for a year or more, so I don't think there will be users waiting with pitchforks and torches!

I'd like to either revert this merge or add a further commit adding the use of ja:graph to TIM's assembler (just like the general dataset assembler) and depreciating the use of ja:data for anything but quads with the expectation that we will disallow it in a near-future version. I can do that after a release, if that is easier.

@asfbot
Copy link

asfbot commented Dec 7, 2017

Claude Warren on dev@jena.apache.org replies:
Q: What about when the tests are used by subclasses in a different
package. The test files would then need to be loaded from the class path.
I ran into this issue trying to run tests on graph implementations before.
However, it my not apply in this specific case with assemblers. So I am
asking, is there an issue here if the test is subclassed in a different
project and includes these tests via the test-jar packaging?

Claude

@afs
Copy link
Member

afs commented Dec 7, 2017

@ajs6f I do not understand your point - it is too abstract talking about "predicates". Some examples maybe?

How do you deprecate in Turtle assembler files?

I don't understand about ja:data and quads, triples - that seems natural name. It would be simple to check the source if triples when loading a graph. The converse, loading quads should accept triples - a single graph in the dataset is the most common use case.

My original PR #313 was just to make {{ja:data "filename"}} work, based on experience with <file:> - that has relative name confusion.

@ajs6f
Copy link
Member Author

ajs6f commented Dec 7, 2017

I'm saying that ja:data is now used for loading quads into two different kinds of datasets, but loading triples into only one. That's confusing at best. I want to make it clearer by getting rid of the second use, because we already have ja:graph, and then what about ja:content? But I'm happy to hear other ways to clarify the semantics-- I just think that we've made things more powerful but more confusing with this PR. I want to keep the power (load quads easily, which we just did) while not making it harder to understand how to build assembler RDF (which I think we just did).

Re: deprecation: we can deprecate with comments and by throwing warnings from inside the assembler code. And again, we have good reason (the mistake in the docs that no one noticed enough to complain about) to think that it won't be too disruptive.

@asfbot
Copy link

asfbot commented Dec 7, 2017

Andy Seaborne on dev@jena.apache.org replies:
Claude - it isn't subclassed.

Some tests are of assemblers themselves reading real, normal RDF data
files because that is what the assembler is supplied to do. It isn't a
case of putting in a jar.

 Andy

@afs
Copy link
Member

afs commented Dec 7, 2017

ja:data isn't mentioned in DatasetAssembler except to redirect to TIM. Which two different kinds of dataset?

Would you like to move public static final Property data = property( "data" ); out of JA.java into the assembler vocab definitions file?

We need a way to say:

<#dataset> a ja:MemoryDataset ;
  ja:data "File for the default graph - also data for dataset" ;
  ja:namedGraph [ ja:graphName "http://example/graph1"; ja:data "File for graph1" ] ;
  ja:namedGraph [ ja:graphName "http://example/graph2"; ja:data "File for graph2" ] ;
.

because that gets a fully transactional dataset and I'd hope our preferred way to do have plain data in a memory dataset. Having data-for-triples, data-for-quads gets weird when its an unknown URL to load from for the default graph.

ja:Content is for ja:Models in ja:RDFDataset only. It is too old (= documented and written about outside the project) to drop. It is needed for inference setups.

@afs
Copy link
Member

afs commented Dec 8, 2017

Can we separate the different parts of this and unblock 3.6.0?

Is the following acceptable for 3.6.0:

  1. JENA-1430: Read quads for ja:data by filename #313 fixed a specific problem - <file:filename> and "filename" behave differently for resolution when the assembler file is not in the current directory of the JVM as happens in Fuseki.
  2. Make ja:MemoryDataset the class for the TIM assembler, with ja:DatasetTxnMem registered for backwards compatibility.
  3. Add the Fuseki example for using ja:MemoryDataset.

@ajs6f
Copy link
Member Author

ajs6f commented Dec 8, 2017

The two different datasets are TIM and general. The fact that the code for the general dataset assembler doesn't mention ja:data is fine, but because TIM's assembler does understand ja:data and some assembler RDF with the general dataset type can get shunted to TIM, ja:data can end up being successfully used with well-defined semantics in assembler RDF with the general dataset type.

This is what I mean by the whole thing having gotten a bit confusing! 😁

I agree fully that we need to be able to do what you clearly intend to do with your example. I just want the the example to look like:

<#dataset> a ja:MemoryDataset ;
  ja:data "File for the default graph - also data for dataset" ;
  ja:namedGraph [ ja:graphName "http://example/graph1"; ja:graph "File for graph1" ] ;
  ja:namedGraph [ ja:graphName "http://example/graph2"; ja:graph "File for graph2" ] ;
.

instead so that TIM and general look the same from the assembler POV and so that the meanings of ja:data and ja:graph are as clear and simple as practical.

I don't want to block 3.6.0 on this, so what do you say to your three points above, plus (after 3.6.0)

  1. We gracefully deprecate the use of ja:data for anything other than loading quads into datasets and make TIM's assembler RDF look like the general dataset assembler RDF modulo the dataset type.

Sound okay?

@afs
Copy link
Member

afs commented Dec 8, 2017

I'm certainly willing to discuss it, that's what JENA-1445 is about and we can rename it to be more accurate.

I'm not able to agree to the details at this stage. I have several outstanding questions and also outstanding points raised with both its name and semantics. There is also the status of ja:externalContent playing into the mix which hasn't been explained.

@ajs6f
Copy link
Member Author

ajs6f commented Dec 8, 2017

Okay, then let's break this thing up, as you describe above. Your number 2 and 3 are uncontroversial. As for number 1, sure, I want that fix too, but this PR (#314) didn't do just that. It also conflated the assembler RDF for these two types of dataset and I now see that TIM's use of ja:data for two purposes is a big part of that. Let's revert this and merge #313 instead and we can take up JENA-1445 after 3.6.0.

@afs
Copy link
Member

afs commented Dec 11, 2017

Required reverts done, #313 applied and ja:MemoryDataset made the primary type name for TIM assembler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants