New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JENA-1430 #314
JENA-1430 #314
Conversation
fuseki:dataset <#dataset> ; | ||
. | ||
|
||
<#dataset> rdf:type ja:DatasetTxnMem; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to change to ja:MemoryDataset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, thanks, fixed.
// Load data into the default graph or quads into the dataset. | ||
multiValueAsString(root, data) | ||
.forEach(dataURI -> read(ds, dataURI)); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interacts with the setting up of the dft graph/named graphs. It's going to add dft graph just made (which might have content via `ja:content) but also the named graphs. It's a dataset by links to the models where adding a model replaces the model.
I'm not sure what he best balance is here -- options include (1) warning and no action on seeing ja:data
(with a note to use TIM) (2) do after all the named models are added because parsing is adding quads, not replacing models (3) allow this or specified models, which would be better done with TIM (4) it's ja:data
or specify graphs, not a combination; if ja:data
return a TIM.
(4) looks like a good way to encourage TIM usage. If (4) then a plain, empty ja:RDFDataset
could be TIM as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think of a use case for "load quads into non-TIM" and the one that occurs to me is in an embedded or integrated situation where you have a lot of quads, like so many that you prefer the memory-parsimonious-ness of the general IM dataset, maybe because you have other processes running in the system. Sound likely enough to merit (2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is to load specialised graphs e.g. inference? In which case the graphs themselves need to be pre-created before reading data. We already have a mechanism for graphs with ja:content
on the graph description. Having two ways seems bad. Strange things will happen if a graph is in the quads and not explicitly created (it will be a plain graph when inference expected). It could be done be splitting the quads into separate graphs and generating the assembler text.
It might make sense to have some plain graphs and specialist graphs, that is a disjoint set, but to enforce this to stop accidents is not so easy without a lot of additional machinery to enforce disjoint.
How about doing this progressively - we make ja:RDFDataset create either TIM, or linked datasets. And for now, it is one or the other, not a combination. (We can have per graph data loading into TIM as it does at the moment with some care - look into the namedGhraph property to see if it has ja:data
+name, and only those, if not, it's a general model).
We leave more exotic combinations until we get a request because that may help choose what to do with the corner cases like whether they overlap or not. If we provide now, the corner cases are fixed by compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, if I get what you are saying, it's:
- Check to see if quads are being loaded, if so, TIM.
- Otherwise, check the named graphs. If they are all
ja:data
guys, then TIM again. - Otherwise, general dataset.
I'll get on this later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afs What's a good idiom for switching to a new assembler? In other words, let's say the code tests for the presence of quads and finds them and it's time to pivot to TIM. Obviously, I could just new InMemDatasetAssembler()
, but I'm thinking there must be a more elegant way. I looked at AssemblerUtils
but only saw ways to register Assembler
s, not retrieve them…
@afs What do you think of that? It's clearer, I think, along the lines you suggested. |
* @param root resource to check | ||
* @param types types for which to check | ||
*/ | ||
protected void checkType( Resource root, Resource... types ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. An assembler resource must have one type so that there is a unique mapping to the implementation to call. I don't see in the code anywhere calling with 2+ arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the changes I made to InMemDatasetAssembler. That's where.
This happens because if you pass assembler RDF with RDFDataset
, it could become a general-purpose dataset or it could become a TIM dataset, and so the TIM assembler has to be able to accept both types. IOW, "An assembler resource must have one type", yes, but the more important point is that the assembler code has to be able to accept more than one type.
We can change the semantics, instead, so that there is a one-to-one between types and assembler classes. and I am increasingly convinced we should, because if we are getting a bit confused about this, it's not going to be easy for users!
* @param types types for which to check | ||
*/ | ||
protected void checkType( Resource root, Resource... types ) { | ||
for (Resource type : types) if (root.hasProperty( RDF.type, type )) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything on one line is a bit confusing!
} | ||
|
||
@Override | ||
public Dataset open(final Assembler assembler, final Resource root, final Mode mode) { | ||
checkType(root, DatasetAssemblerVocab.tDatasetTxnMem); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We ought to also check for use of the RDFDataset vocabulary on a MemoryDataset. e.g. ja:defaultGraph
and then ja:namedGraph
pointing to ja:graph
and a ja:MemoryModel
.
The reverse tests ought to be in general assembler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of my frustrations with this is that there were no tests of these assemblers at all when I started, so I'm not very sure about the current expected/guaranteed behavior. WRT to our comments above about the use of predicates, should we start migrating people away from ja:graph
, or (probably better) instead restricting ja:data
to only loading quads into a dataset?
Dataset ds = createDataset(a, root, mode) ; | ||
return ds ; | ||
} | ||
|
||
public Dataset createDataset(Assembler a, Resource root, Mode mode) { | ||
checkType(root, DatasetAssemblerVocab.tDataset); | ||
// use TIM if quads are loaded or if all named Graphs are loaded via data property | ||
final boolean allNamedGraphsLoadViaData = multiValueResource(root, pNamedGraph).stream().allMatch(g -> g.hasProperty(data)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A SPARQL query would be quite nicely here!
The test needs to include checking and rejecting mixed cases like ja:defaultGraph
and ja:data
on the dataset resource and both ja:graph
ja:data
from the object of ja:namedGraph
.
Similarly, tMemoryDataset
needs to check for general vocab.
That is, two tests: "hasGeneralDatasetVocab" and "isMemoryDatasetVocab" then check not true/true or false/false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't understand the comment-- are you suggesting to use SPARQL instead of the API? If we are missing a test here, it means that I did not understand your plan for how to "distribute" the predicates amongst the types of dataset. In fact, I'm starting to wonder if that's what we should do after all, or whether it would be better to deprecate some of them, since they overlap, or confine the meanings of those that overlap so that they don't...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't change the predicates due to existing deployed assemblers. We can check for consistency e.g. currently, this works:
<#dataset> rdf:type ja:MemoryDataset ;
ja:data <file:D.trig>;
ja:namedGraph [ ];
.
(loops on properties of the object of namedGraph
so if there are none ...)
as does this:
<#dataset> rdf:type ja:MemoryDataset ;
ja:data <file:D.trig>;
ja:defaultGraph [ a ja:MemoryModel ];
.
(ignores ja:defaultGraph
)
checkType(root, DatasetAssemblerVocab.tDataset); | ||
// use TIM if quads are loaded or if all named Graphs are loaded via data property | ||
final boolean allNamedGraphsLoadViaData = multiValueResource(root, pNamedGraph).stream().allMatch(g -> g.hasProperty(data)); | ||
if (root.hasProperty(data) || allNamedGraphsLoadViaData) return new InMemDatasetAssembler().open(a, root, mode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All on one line is really quite unclear. Have to live with Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much clearer than otherwise, to my eye. This is style, and I'm happy to change this to be clearer for you, but it's not an objective question.
@@ -31,7 +31,7 @@ | |||
// General dataset | |||
public static final Resource tDataset = ResourceFactory.createResource(NS+"RDFDataset") ; | |||
// In-memory dataset | |||
public static final Resource tDatasetTxnMem = ResourceFactory.createResource(NS+"DatasetTxnMem") ; | |||
public static final Resource tMemoryDataset = ResourceFactory.createResource(NS+"MemoryDataset") ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ajs6f correctly pointed out that DatasetTxnMem
was in the documentation so migration support would be good such as register InMemDatasetAssembler
under both names and log a warning if called as ja:DatasetTxnMem
(And text for the release notes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put back tDatasetTxnMem
.
import org.junit.Assert; | ||
import org.junit.Test; | ||
|
||
public class TestDatasetAssembler extends Assert { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having one or two tests that works from a Turtle string or would make for clear tests of expected use.
With test for bad cases of mixed usage (general and TIM vocabulary together) text written test cases will help generating tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, a couple of Turtle assembler files would make for nice examples. Do we have a standard practice about how to load test files from the classpath during a build?
I don't understand the second sentence. Can you restate that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, tests are loaded from a testing/
directory, not the classpath. Some tests need to read real, normal files and also the W3C SPARQL tests run from on-disk files and manifests. Once some tests do need files, there isn't much value in classpath loading as its extra, not instead of. Assemblers need to read from files, it being the common usecase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was lookign for tests that do both ja:data
and ja:defaultGraph
, which should be an error. I saw someone asking recently trying with ja:defaultGraph
and TIM assemble, which just ignores unknown properties. Easy confusion to make - so deliberately testing making an error seems sensible.
Log.warn(this, "Use of old vocabulary: use :graph not :graphData"); | ||
} else { | ||
throw new DatasetAssemblerException(root, "no graph for: " + gName); | ||
Txn.executeWrite(ds, () -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: Because the general dataset can't support abort, a Txn
is nice but of little help; if anything goes wrong assembling, it's broken.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just use it more or less for uniformity. I want our code to model best practices.
Merged in order to progress Jena 3.6.0. |
Ouch! I realize I wasn't clear enough at all in my last comment: I was actually -1 on this. I think the confusion that
We change the public API when we must, we just do it carefully and after clear depreciation with a clear migration path. What's more, the docs have been wrong about the assembler type for TIM for a year or more, so I don't think there will be users waiting with pitchforks and torches! I'd like to either revert this merge or add a further commit adding the use of |
Claude Warren on dev@jena.apache.org replies: Claude |
@ajs6f I do not understand your point - it is too abstract talking about "predicates". Some examples maybe? How do you deprecate in Turtle assembler files? I don't understand about My original PR #313 was just to make {{ja:data "filename"}} work, based on experience with |
I'm saying that Re: deprecation: we can deprecate with comments and by throwing warnings from inside the assembler code. And again, we have good reason (the mistake in the docs that no one noticed enough to complain about) to think that it won't be too disruptive. |
Andy Seaborne on dev@jena.apache.org replies: Some tests are of assemblers themselves reading real, normal RDF data
|
Would you like to move We need a way to say:
because that gets a fully transactional dataset and I'd hope our preferred way to do have plain data in a memory dataset. Having data-for-triples, data-for-quads gets weird when its an unknown URL to load from for the default graph.
|
Can we separate the different parts of this and unblock 3.6.0? Is the following acceptable for 3.6.0:
|
The two different datasets are TIM and general. The fact that the code for the general dataset assembler doesn't mention This is what I mean by the whole thing having gotten a bit confusing! 😁 I agree fully that we need to be able to do what you clearly intend to do with your example. I just want the the example to look like:
instead so that TIM and general look the same from the assembler POV and so that the meanings of I don't want to block 3.6.0 on this, so what do you say to your three points above, plus (after 3.6.0)
Sound okay? |
I'm certainly willing to discuss it, that's what JENA-1445 is about and we can rename it to be more accurate. I'm not able to agree to the details at this stage. I have several outstanding questions and also outstanding points raised with both its name and semantics. There is also the status of |
Okay, then let's break this thing up, as you describe above. Your number 2 and 3 are uncontroversial. As for number 1, sure, I want that fix too, but this PR (#314) didn't do just that. It also conflated the assembler RDF for these two types of dataset and I now see that TIM's use of |
Required reverts done, #313 applied and |
Includes #313, plus:
DatasetAssembler
DatasetAssembler
can also load quadsja:DatasetTxnMem
=>ja:MemoryDataset