Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow property path query for disjunctive form #725

Closed
rubensworks opened this issue Aug 26, 2020 · 12 comments
Closed

Slow property path query for disjunctive form #725

rubensworks opened this issue Aug 26, 2020 · 12 comments

Comments

@rubensworks
Copy link
Member

rubensworks commented Aug 26, 2020

Issue type:

  • 馃悓 Performance issue

Description:

This property path query is very slow:
http://query.linkeddatafragments.org/#transientDatasources=http%3A%2F%2Ffragments.dbpedia.org%2F2016-04%2Fen&query=SELECT%20%3Fperson%0AWHERE%20%7B%0A%20%20%5B%20rdfs%3Alabel%20%22Bruce%20Willis%22%40en%20%5D%0A%20%20%20%20(dbpedia-owl%3Aspouse%7Cdbpedia-owl%3Achild)%20%5B%20rdfs%3Alabel%20%3Fperson%20%5D.%0A%7D

Related to #709.


Environment:

Comunica version: 1.16.2

@rubensworks rubensworks added this to To Do (prio:medium) in Maintenance via automation Aug 26, 2020
@rubensworks rubensworks moved this from To Do (prio:medium) to To Do (prio:high) in Maintenance Aug 26, 2020
@rubensworks rubensworks self-assigned this Sep 26, 2020
@rubensworks
Copy link
Member Author

Just looked into this some more, and it's not a problem after all.
The identification of file source is intended after the first page.
The issue is simply performance due to a sub-optimal query plan.
So I'm changing the scope of this issue.

@rubensworks rubensworks moved this from To Do (prio:high) to To Do (prio:low) in Maintenance Oct 27, 2020
@github-actions
Copy link

Thanks for reporting!

@github-actions
Copy link

Thanks for reporting!

@rubensworks rubensworks changed the title TPF sometimes incorrectly being identified as file sources Slow property path query for disjunctive form Oct 27, 2020
@rubensworks rubensworks removed their assignment Oct 27, 2020
@github-actions
Copy link

Thanks for reporting!

@sroze
Copy link

sroze commented Jun 14, 2021

The issue is simply performance due to a sub-optimal query plan.

@rubensworks I'm currently exploring using Comunica and I have sub-optimal query plans and can't find any documentation about anything that would enable me to tell Cumunica specific costs, is that even a thing? If yes/no, how would you advise I explore having more optimal plans?

To be more specific, I have the following query:

SELECT ?foo WHERE {
  ?foo edge:one ?something.
  ?identity edge:two ?something.
}

?identity is a bound variable. I'm expecting the following matches:

  • <identity-value> edge:two {null}
  • {null} edge:one {values-from-above}

However, I get the following first match instead:

  • {null} edge:two {null}.

Given the cardinality is very high, it results in very slow queries.

@rubensworks
Copy link
Member Author

@sroze Your issues sounds more related to #548.

Comunica internally has a cost-based plan optimizer using the totalItems cardinality estimate metadata.
This can not be influenced by end-users though.
So this issue essentially requires the implementation of one or more actors to more intelligently plan the query.

Currently, BGPs are always handled by this actor: https://github.com/comunica/comunica/tree/master/packages/actor-query-operation-bgp-left-deep-smallest

@sroze
Copy link

sroze commented Jun 15, 2021

@rubensworks thank you for your answer; I've managed to make it work well with cost metadata for the above query 馃帀

However, I end up with the same problem with the following query:

SELECT ?foo WHERE {
  ?foo edge:one ?something.
  ?identity edge:two | edge:three ?something.
}

Am I right thinking that this is because the join operation (between the 1st bgp and 2nd path) does not take cost into account? If so, is it a matter of implementing another join strategy like described in #552?

@rubensworks
Copy link
Member Author

I've managed to make it work well with cost metadata for the above query 馃帀

Nice, how did you end up doing it?

Am I right thinking that this is because the join operation (between the 1st bgp and 2nd path) does not take cost into account? If so, is it a matter of implementing another join strategy like described in #552?

Indeed, that join will likely be sub-optimal. #552 will indeed be required to optimize that (many good techniques exist already, so it's just a matter of implementing them here).

@sroze
Copy link

sroze commented Jun 15, 2021

Nice, how did you end up doing it?

I'm using a RDF/JS store so I had to implement the countQuads method on it (to return rough estimates, given I know the shape of my data) and it did the trick (as well as never starting to produce the stream until it's actually being read).

many good techniques exist already, so it's just a matter of implementing them here

Do you have anything to point me towards? I'd be very keen to work on such an implementation. In terms of concrete steps, shall I be looking at replacing ActorQueryOperationJoin with another implementation?

@rubensworks
Copy link
Member Author

I'm using a RDF/JS store so I had to implement the countQuads method on it (to return rough estimates, given I know the shape of my data) and it did the trick (as well as never starting to produce the stream until it's actually being read).

Great, that's a good way to do it indeed.

Do you have anything to point me towards? I'd be very keen to work on such an implementation.

Not sure what your background is, but this may be helpfull to get started in the domain of query optimization: https://www.rubensworks.net/raw/slides/2021/ugent-webfundamentals-sparqlengines/

I think a good first step would be create a new BGP actor that is able to work using different join algorithms (#427).

Happy to give you further guidance if your willing to take this up! (other communication mediums may be better in that case...)

@sroze
Copy link

sroze commented Jun 15, 2021

Happy to give you further guidance if your willing to take this up! (other communication mediums may be better in that case...)

@rubensworks that's wonderful, I reached out to you on Gitter.

@rubensworks rubensworks moved this from To Do (prio:low) to To Do (prio:high) in Maintenance Jul 2, 2021
rubensworks added a commit that referenced this issue Oct 12, 2021
This significantly improves performance of property path queries

Closes #725
@rubensworks
Copy link
Member Author

Will be much faster in Comunica 2.x

Maintenance automation moved this from To Do (prio:high) to Done Oct 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Maintenance
  
Done
Development

No branches or pull requests

2 participants