New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a skip scan based iterator for listing graph names in TDB2 (GH-1639) #1655
Conversation
It is great to see this but it is now very close to 4.7.0. |
Thanks for this work! I recently had to fight with postgresql because native skip scans are a missing issue there. The performance improvement from several seconds/minutes down to a few milliseconds is what I experienced there. Also, e.g. one of the postgres extensions greatly advertised their skip scan implementation in this blog post; this may be an inspiration for how to advertise this feature when its done. |
Yeah I think this one will have to push out and be given more time to be refined and settled down. The change breaks a few tests so there's clearly some corner cases I am not catching correctly yet! |
very nice work!!
0.2 seconds
0.1 seconds
|
3793388
to
c809c00
Compare
jena-tdb2/src/main/java/org/apache/jena/tdb2/store/DatasetGraphTDB.java
Outdated
Show resolved
Hide resolved
didn't see any? |
Already fixed them with my force push |
the quick + dirty way to make select ?g { graph ?g {} } fast: AKSW@2b064df?w=1#diff-e82707e0fbbe7ca027b63ecbac2d24d5607432e492061ff8d1cc3f5c6318e2feR162-R167 not sure what to do about the filter though....... |
Imho there should also be some optimization for mapping the common query For example, SELECT DISTINCT ?g { GRAPH ?g { ?s ?p ?o } } should be evaluated efficiently w.r.t. the presence of possibly empty graphs as SELECT ?g {
GRAPH ?g { }
FILTER EXISTS { GRAPH ?g { ?s ?p ?o } } # Note: Won't be expand infinitely because
# we are not requesting DISTINCT ?g in the filter element
} Of course this should then be managed as a follow-up issue to this one, but my questions right now are:
|
Let's not get ahead of ourselves here, there's still some bugs in this feature to be ironed out before it's ready for merging i.e. DON'T expect it for 4.7.0 I've been working on some low level test cases for the new skip scan and it's definitely broken for some cases right now and until that's been addressed you won't see this in |
Is it? Or is it a mistake for
Adding too many "maybe" optimizations slows down fast, small queries. (We know this from BSBM.)
The implementation exception might be In
so it looks like it is a matter of copying that to
The right thing to do is address in the implementations in a separate PR, starting with some test cases. |
I really meant
So the "portable" query is the spo variant. |
The SPARQL endpoint or youR local load into TDB2 with this PR applied? |
On the SPARQL endpoint - the graph patterns in my post are links to DBpedia. |
Yes, it might be useful as an opt-in though; OpGraphNames typically results in listGraphNames so an algebra transform that injects OpGraphNames and |
It's not portable - it's a workaround. It might be a poor choice on another store. There's an issues list for Virtuoso and the user list for Virtuoso is on SourceForge - has it been reported? We're not here to fix DBpedia. It has several deviations from the specification. Jena is open and you can submit reports - that can be overused. Email would be better. |
f614bfe
to
7cf6dc8
Compare
Adds a new BPTreeDistinctKeyPrefixIterator that allows iterating only records which are considered distinct based on a portion of their key. This is used to improve performance of DatasetGraphTDB.listGraphNodes(), in some cases dramatically so.
Apply the usage of the new distinct by key iterator to SolverLibTDB.graphNames() path ensuring that the rest of the logical flow there continues to work as before. Added explanatory comments about the choices and optimisations involved. Moved repeated logic for selecting a suitable index actually into the TupleTable class and simplified some code as a result
Adds low level test cases for validating the behaviour of the distinct by key prefix iterator
425c14b
to
538b2b3
Compare
jena-tdb2/src/main/java/org/apache/jena/tdb2/solver/SolverLibTDB.java
Outdated
Show resolved
Hide resolved
...oe-trans-data/src/test/java/org/apache/jena/dboe/trans/bplustree/TestBPTreeDistinctKeys.java
Outdated
Show resolved
Hide resolved
...oe-trans-data/src/test/java/org/apache/jena/dboe/trans/bplustree/TestBPTreeDistinctKeys.java
Outdated
Show resolved
Hide resolved
...oe-trans-data/src/test/java/org/apache/jena/dboe/trans/bplustree/TestBPTreeDistinctKeys.java
Show resolved
Hide resolved
538b2b3
to
35e8743
Compare
Now 4.7.0 is out we should be able to get this reviewed and merged so that users have time to start testing the updated SNAPSHOTs with this improvement |
We have the skip scan in use in a dataset with around 1 billion triples and graph listings are super fast 馃憤 What would be eventually needed is also make this feature publicly accessible in the various Tuple/DatasetGraph interfaces. // D = domain tuple type (e.g. Quad or Tuple<NodeId), C = component type (e.g. Node or NodeId)
interface TupleMatcher4<D, C> {
TupleStreamer<D, C> find(C g, C s, C p, C o, boolean distinct, int ... projectedColumns);
}
interface TupleStreamer<D, C> {
Iterator<C> asComponents(); // e.g. Node or NodeIds
Iterator<D> asDomainTuples(); // e.g. Quad
Iterator<Tuple<C>> asGenericTuples();
} This way a request for e.g. distinct predicates: SELECT DISTINCT ?p { GRAPH ?g { ?s ?p ?o } } could then map to a
Furthermore, I wonder if the way TDB indexes data would be suitable for a skip scan for the case to retrieve a resource's distinct predicates: SELECT DISTINCT ?p { GRAPH ?g { <concreteS> ?p ?o } } The background is, that we have resources with 4mio+ statements (yeah not that usual) where a scan for distinct predicates takes seconds - maybe with the skip scan it would also be possible to speed this case up? |
I don't suspect that this sort of low level execution optimisation is ever going to bubble up into the high level end-user APIs like I don't disagree that this iterator could be used to optimise execution of other query patterns but the goal here is to start small and incrementally improve.
So as I used to tell a solution architect I worked closely with in a past $dayjob that optimisation is fundamentally a trade off. The goal of a general purpose optimiser is to apply optimisations that are generally useful to most users and most data yielding performance improvements for the general case. Detecting whether an optimisation is applicable or not has a non-zero cost to it and for some query/dataset patterns that are unusual, e.g. a subject with 4 million statements, a general purpose optimiser shouldn't try to optimise for that because it's so outside it's normal expectations. I would suggest for this kind of optimisation, where you have a specific query pattern that runs poorly on your dataset(s), you consider creating your own custom optimiser (based off ARQ's default one) adding in optimisations for specific query patterns you need specialised optimisation for. You can transform those into custom That gives you a way to experiment with some of these things outside of the Jena codebase and then potentially contribute them back later if they prove to be of more general value. But right now it seems like there's a lot of stuff that's very specific to your use cases that may not be generally applicable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One unused import.
...data/src/main/java/org/apache/jena/dboe/trans/bplustree/BPTreeDistinctKeyPrefixIterator.java
Show resolved
Hide resolved
Moves some of the up front validation checks into the static create() method. Also does some short circuit checking for cases where it can return a null/singleton iterator immediately without needing to actually create an iterator.
35e8743
to
993fc6c
Compare
Adds a new
BPTreeDistinctKeyPrefixIterator
that allows iterating only records which are considered distinct based on a portion of their key. It is effectively a skip scan based iterator that can avoid reading portions of the B+Tree where all records share the same key prefix. This is used to improve performance ofDatasetGraphTDB.listGraphNodes()
andSolverLibTDB.graphNames()
, in some cases dramatically so.Used a couple of different test scenarios with calling
listGraphNodes()
:This resolves #1639