Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a skip scan based iterator for listing graph names in TDB2 (GH-1639) #1655

Merged
merged 4 commits into from Mar 20, 2023

Conversation

rvesse
Copy link
Member

@rvesse rvesse commented Dec 1, 2022

Adds a new BPTreeDistinctKeyPrefixIterator that allows iterating only records which are considered distinct based on a portion of their key. It is effectively a skip scan based iterator that can avoid reading portions of the B+Tree where all records share the same key prefix. This is used to improve performance of DatasetGraphTDB.listGraphNodes() and SolverLibTDB.graphNames(), in some cases dramatically so.

Used a couple of different test scenarios with calling listGraphNodes():

  • 100k named graphs comprising a total of 125 million quads
    • On 4.6.1 this takes an average of 21,480 milliseconds
    • With this patch this is reduced to 18,235 milliseconds
  • 10 named graphs comprising a total of 177 million quads

This resolves #1639

@afs
Copy link
Member

afs commented Dec 1, 2022

It is great to see this but it is now very close to 4.7.0.

@Aklakan
Copy link
Contributor

Aklakan commented Dec 1, 2022

Thanks for this work! I recently had to fight with postgresql because native skip scans are a missing issue there. The performance improvement from several seconds/minutes down to a few milliseconds is what I experienced there. Also, e.g. one of the postgres extensions greatly advertised their skip scan implementation in this blog post; this may be an inspiration for how to advertise this feature when its done.

@rvesse
Copy link
Member Author

rvesse commented Dec 1, 2022

It is great to see this but it is now very close to 4.7.0.

Yeah I think this one will have to push out and be given more time to be refined and settled down. The change breaks a few tests so there's clearly some corner cases I am not catching correctly yet!

@SimonBin
Copy link
Contributor

SimonBin commented Dec 1, 2022

very nice work!!

SELECT ?e {
  GRAPH ?g {
    ?e spatial:withinBoxGeom ("POLYGON((19.49 50.62,26.87 50.626,26.87 46.43,19.49 46.43,19.49 50.62))"^^geo:wktLiteral 10)
  }
}

0.2 seconds

SELECT (count(distinct ?thing) as ?count)
WHERE {
    graph ?g {
        ?thing a/rdfs:subClassOf* coy:Powerplant .
    }
}

0.1 seconds

select distinct ?g { graph ?g {} } unchanged

@SimonBin
Copy link
Contributor

SimonBin commented Dec 1, 2022

The change breaks a few tests

didn't see any?

@rvesse
Copy link
Member Author

rvesse commented Dec 2, 2022

The change breaks a few tests

didn't see any?

Already fixed them with my force push

@SimonBin
Copy link
Contributor

SimonBin commented Dec 5, 2022

the quick + dirty way to make select ?g { graph ?g {} } fast:

AKSW@2b064df?w=1#diff-e82707e0fbbe7ca027b63ecbac2d24d5607432e492061ff8d1cc3f5c6318e2feR162-R167

not sure what to do about the filter though.......

@Aklakan
Copy link
Contributor

Aklakan commented Dec 6, 2022

Imho there should also be some optimization for mapping the common query DISTINCT ?g { GRAPH ?g { ?s ?p ?o } } (and possibly common variants of it - such as counting distinct graph names) to effectively OpDatasetNames so that the skip scan can be leveraged further.

For example,

SELECT DISTINCT ?g { GRAPH ?g { ?s ?p ?o } }

should be evaluated efficiently w.r.t. the presence of possibly empty graphs as

SELECT ?g {
  GRAPH ?g { }
  FILTER EXISTS { GRAPH ?g { ?s ?p ?o } } # Note: Won't be expand infinitely because
                                          # we are not requesting DISTINCT ?g in the filter element
}

Of course this should then be managed as a follow-up issue to this one, but my questions right now are:

  • Would it make sense to handle it as a query rewrite (alternatively it could be e.g. part of the OpExecutor but then the logic is less reusable)?
  • Where would be the best place for that? Maybe a new optimizer under org.apache.jena.sparql.algebra.optimize?
  • Is there a mechanism to ask a DatasetGraph's metadata for whether empty graphs are disallowed? In that case the FILTER EXISTS could be omitted although I suppose it'd only be a minor improvement and maybe not really worth the effort. Yet, maybe there is already a context attribute?

@rvesse
Copy link
Member Author

rvesse commented Dec 6, 2022

Let's not get ahead of ourselves here, there's still some bugs in this feature to be ironed out before it's ready for merging i.e. DON'T expect it for 4.7.0

I've been working on some low level test cases for the new skip scan and it's definitely broken for some cases right now and until that's been addressed you won't see this in main

@afs
Copy link
Member

afs commented Dec 6, 2022

the common query DISTINCT ?g { GRAPH ?g { ?s ?p ?o } }

Is it? Or is it a mistake for GRAPH ?g {}?

Would it make sense to handle it as a query rewrite

Adding too many "maybe" optimizations slows down fast, small queries. (We know this from BSBM.)

GRAPH ?g {} is distinct graph names.

The implementation exception might be DatasetGraphMapLink and DynamicDatasetGraph.

In DatasetGraphMap.listGraphNodes:

    // Hide empty graphs.

so it looks like it is a matter of copying that to DatasetGraphMapLink (caveat inference graphs).

DynamicDatasetGraph might be able to do better:: do listGraphNames on the wrapped dataset and filter. But this might be worse. Numbers matter.

The right thing to do is address in the implementations in a separate PR, starting with some test cases.

@Aklakan
Copy link
Contributor

Aklakan commented Dec 6, 2022

Is it? Or is it a mistake for GRAPH ?g {}?

I really meant GRAPH ?g { ?s ?p ?o} . On *cough* DBpedia:

So the "portable" query is the spo variant.

@afs
Copy link
Member

afs commented Dec 6, 2022

On cough DBpedia:

The SPARQL endpoint or youR local load into TDB2 with this PR applied?

@Aklakan
Copy link
Contributor

Aklakan commented Dec 6, 2022

The SPARQL endpoint or you local load into TDB2 with this PR applied?

On the SPARQL endpoint - the graph patterns in my post are links to DBpedia.

@Aklakan
Copy link
Contributor

Aklakan commented Dec 6, 2022

Adding too many "maybe" optimizations slows down fast, small queries. (We know this from BSBM.)

Yes, it might be useful as an opt-in though; OpGraphNames typically results in listGraphNames so an algebra transform that injects OpGraphNames and FILTER EXIST to filter out empty graphs should perform well independent of the implementation - but I see that this is becoming a bigger discussion - so let's continue then in a separate issue.

@afs
Copy link
Member

afs commented Dec 6, 2022

So the "portable" query is the spo variant.

It's not portable - it's a workaround. It might be a poor choice on another store.

There's an issues list for Virtuoso and the user list for Virtuoso is on SourceForge - has it been reported?
The users list is where they ask for bug reports judging by the SO responses of "use the emailing list".

We're not here to fix DBpedia. It has several deviations from the specification.

Jena is open and you can submit reports - that can be overused. Email would be better.
Time spent looking at Jena code was wasted. Sigh. That's another step to 4.7.0 delayed.

@rvesse rvesse changed the title Fast TDB2 graph listing prototype (GH-1639) Use a skip scan based iterator for listing graph names in TDB2 (GH-1639) Dec 9, 2022
@rvesse rvesse self-assigned this Dec 9, 2022
@rvesse rvesse added the enhancement Incrementally add new feature label Dec 9, 2022
Adds a new BPTreeDistinctKeyPrefixIterator that allows iterating only
records which are considered distinct based on a portion of their key.
This is used to improve performance of DatasetGraphTDB.listGraphNodes(),
in some cases dramatically so.
Apply the usage of the new distinct by key iterator to
SolverLibTDB.graphNames() path ensuring that the rest of the logical
flow there continues to work as before. Added explanatory comments about
the choices and optimisations involved.

Moved repeated logic for selecting a suitable index actually into the
TupleTable class and simplified some code as a result
Adds low level test cases for validating the behaviour of the distinct
by key prefix iterator
@rvesse rvesse marked this pull request as ready for review January 3, 2023 15:03
@rvesse
Copy link
Member Author

rvesse commented Jan 3, 2023

Now 4.7.0 is out we should be able to get this reviewed and merged so that users have time to start testing the updated SNAPSHOTs with this improvement

@Aklakan
Copy link
Contributor

Aklakan commented Jan 29, 2023

We have the skip scan in use in a dataset with around 1 billion triples and graph listings are super fast 馃憤

What would be eventually needed is also make this feature publicly accessible in the various Tuple/DatasetGraph interfaces.
My proposal looks like this but maybe someone has already better ideas:

// D = domain tuple type (e.g. Quad or Tuple<NodeId), C = component type (e.g. Node or NodeId)
interface TupleMatcher4<D, C> {
  TupleStreamer<D, C> find(C g, C s, C p, C o, boolean distinct, int ... projectedColumns);
}

interface TupleStreamer<D, C> {
  Iterator<C> asComponents(); // e.g. Node or NodeIds
  Iterator<D> asDomainTuples(); // e.g. Quad
  Iterator<Tuple<C>> asGenericTuples();
}

This way a request for e.g. distinct predicates:

SELECT DISTINCT ?p { GRAPH ?g { ?s ?p ?o } }

could then map to a find() call such as

Iterator<Node> distinctPredicates = datasetGraph.find(ANY, ANY, ANY, ANY, true, 2).asComponents();

Furthermore, I wonder if the way TDB indexes data would be suitable for a skip scan for the case to retrieve a resource's distinct predicates:

SELECT DISTINCT ?p { GRAPH ?g { <concreteS> ?p ?o } }

The background is, that we have resources with 4mio+ statements (yeah not that usual) where a scan for distinct predicates takes seconds - maybe with the skip scan it would also be possible to speed this case up?

@rvesse
Copy link
Member Author

rvesse commented Feb 1, 2023

We have the skip scan in use in a dataset with around 1 billion triples and graph listings are super fast 馃憤

What would be eventually needed is also make this feature publicly accessible in the various Tuple/DatasetGraph interfaces. My proposal looks like this but maybe someone has already better ideas:

// D = domain tuple type (e.g. Quad or Tuple<NodeId), C = component type (e.g. Node or NodeId)
interface TupleMatcher4<D, C> {
  TupleStreamer<D, C> find(C g, C s, C p, C o, boolean distinct, int ... projectedColumns);
}

interface TupleStreamer<D, C> {
  Iterator<C> asComponents(); // e.g. Node or NodeIds
  Iterator<D> asDomainTuples(); // e.g. Quad
  Iterator<Tuple<C>> asGenericTuples();
}

I don't suspect that this sort of low level execution optimisation is ever going to bubble up into the high level end-user APIs like DatasetGraph

I don't disagree that this iterator could be used to optimise execution of other query patterns but the goal here is to start small and incrementally improve.

This way a request for e.g. distinct predicates:

SELECT DISTINCT ?p { GRAPH ?g { ?s ?p ?o } }

could then map to a find() call such as

Iterator<Node> distinctPredicates = datasetGraph.find(ANY, ANY, ANY, ANY, true, 2).asComponents();

Furthermore, I wonder if the way TDB indexes data would be suitable for a skip scan for the case to retrieve a resource's distinct predicates:

SELECT DISTINCT ?p { GRAPH ?g { <concreteS> ?p ?o } }

The background is, that we have resources with 4mio+ statements (yeah not that usual) where a scan for distinct predicates takes seconds - maybe with the skip scan it would also be possible to speed this case up?

So as I used to tell a solution architect I worked closely with in a past $dayjob that optimisation is fundamentally a trade off. The goal of a general purpose optimiser is to apply optimisations that are generally useful to most users and most data yielding performance improvements for the general case.

Detecting whether an optimisation is applicable or not has a non-zero cost to it and for some query/dataset patterns that are unusual, e.g. a subject with 4 million statements, a general purpose optimiser shouldn't try to optimise for that because it's so outside it's normal expectations.

I would suggest for this kind of optimisation, where you have a specific query pattern that runs poorly on your dataset(s), you consider creating your own custom optimiser (based off ARQ's default one) adding in optimisations for specific query patterns you need specialised optimisation for. You can transform those into custom Op instances (derived from ARQs OpExt extension point) and provide suitable eval() implementations that can call into the relevant low level APIs as appropriate.

That gives you a way to experiment with some of these things outside of the Jena codebase and then potentially contribute them back later if they prove to be of more general value. But right now it seems like there's a lot of stuff that's very specific to your use cases that may not be generally applicable.

Copy link
Member

@afs afs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One unused import.

Moves some of the up front validation checks into the static create()
method.  Also does some short circuit checking for cases where it can
return a null/singleton iterator immediately without needing to actually
create an iterator.
@rvesse rvesse merged commit f3a229b into apache:main Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Incrementally add new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants