-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test the distributed meta-index before #38
Comments
We could have something like the following. It would tell if an instance have some users for some skills or cities. But this won't tell if these users are the same. If the different targeted instances have users for the same skills and cities this would likely be useless as we would still query the whole distributed federation. The index, for the instance https://instance1.example.org could look like: ? a ex:SourceSelectionIndex.
:php a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasSkill;
ex:forValue "php";
ex:instancesIn <path/to/php/index>;
ex:hasSource <https://instance1.example.org>.
:css a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasSkill;
ex:forValue "css";
ex:instancesIn <path/to/css/index>;
ex:hasSource <https://instance1.example.org>.
:javascript a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasSkill;
ex:forValue "javascript";
ex:instancesIn <path/to/javascript/index>;
ex:hasSource <https://instance1.example.org>.
:paris a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasLocation;
ex:forValue "paris";
ex:instancesIn <path/to/paris/index>
ex:hasSource <https://instance1.example.org>. The SPARQL query to find instances (and their indexes) having users with some skills or cities: SELECT ?source, ?indexPhp, ?indexCss, ?indexJavascript, ?indexParis WHERE {
?hasPhp a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasSkill;
ex:forValue "php";
ex:instancesIn ?indexPhp;
ex:hasSource ?source.
?hasCss a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasSkill;
ex:forValue "css";
ex:instancesIn ?indexCss;
ex:hasSource ?source.
?hasJavascript a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasSkill;
ex:forValue "javascript";
ex:instancesIn ?indexJavascript;
ex:hasSource ?source.
?hasCity1 a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasLocation;
ex:forValue "paris";
ex:instancesIn ?indexParis;
ex:hasSource ?source.
} |
what's the added value of |
We discussed about it today. The |
I think the link traversal plays a huge role in these bad perfs especially because it dereferences every indexes on every instance (the meta indexes contain a link to all these indexes and Comunica assumes it can find some interesting triples in these too)! So we should really use named graphs or at least split our query into several queries (to simulate named graphs waiting for Comunica to be fixed - by us).
|
Thanks @lecoqlibre for the tests, not a good news I would say !
Any ideas of the steps to achieve that ? |
My suggestion to @lecoqlibre (in a mattermost discussion) is to split the query into a list of queries, where each query (apart from the last one) provides the sources against which the next query will be executed. |
So I think there is already a blocker at the level of source selection. One hybrid approach (distributed/federated) could be to have a federated index to do source selection. This index would allow us to know which instances could potentially have valid results (for example which instances have users for the skill 23, 24 and 25 and for Paris). Here is an example of what that federated index could look like: @prefix : <#>.
@prefix ex: <http://example.org#>.
<> a ex:SourceSelectionIndex.
:23 a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasSkill;
ex:forValue "23";
ex:instancesIn <http://localhost:8001>, <http://localhost:8003>, <http://localhost:8013>, <http://localhost:8022>. We could even have something more detailed with some stats like the number of matched users: @prefix : <#>.
@prefix ex: <http://example.org#>.
<> a ex:SourceSelectionIndex.
:23 a ex:SourceSelectionIndexRegistration;
ex:forProperty ex:hasSkill;
ex:forValue "23";
ex:results
[
ex:instancesIn <http://localhost:8001>;
ex:hasCount "8"
],
[
ex:instancesIn <http://localhost:8003>;
ex:hasCount "43"
],
[
ex:instancesIn <http://localhost:8013>;
ex:hasCount "12"
],
[
ex:instancesIn <http://localhost:8022>;
ex:hasCount "1"
]. Note: I used predicates which might be changed. |
I fixed the graphical interface so now I get a new instance every 2 seconds (1 skill + 1 city) and 7 seconds (3 skills + 1 city) in the results. We should not wait for the end of a previous query to execute the next one. With streams we can start the distributed exploration as soon as we get a result. So as soon as I get a link to a selected instance from the source selection query I can launch a query on this instance to get the first users. I will test this but I suppose that the idea of having a federated source selection index could still improve the perfs. |
I succeeded (ae6b5b5) in testing the source selection without link traversal starting from the 32 distributed meta indexes (one on each instance):
Note: here we are testing in a distributed fashion only. The federated index to do source selection has not been tested. I tried on 1 skill (23) + 1 city (Paris). The total execution time was about 35 sec. I got the first result at 22 sec. and the second result at 34 sec. This can be explained by the fact that the two results are coming from the 7th and the 11th instance on the list (selected instances are 1, 8, 10, 12, 14, 16, 23, 25, 27, 29, 31). The list is currently sorted by instance number. To get the results faster we should have some mechanism to sort the instances by relevance. This way, the most relevant instances will be queried first and the results should come faster. Also using a federated index to do source selection like proposed above can help to be faster. We can even link to specific distributed indexes directly from there to save a traversal. Any thoughts @FabienGandon @pchampin @balessan ? |
Are you waiting for all the results of one instance before starting to query the next one? If that's what you are doing, I believe that you could get much better performances by running the queries in parallel, and showing results as they arrive, regardless of the source. This can be achievable by storing multiple promises in an array, and using |
No I'm (normally) not. Like I said, as soon as I get a source I can execute a query on this instance. The
I'm indeed using an array of promises and wait for all to resolve to end the global query using |
It does indeed look like you are running all "subqueries" in parallel, which is good. Improvement needs to come from somewhere else then :) |
We want to have a meta index on each instance to try to have faster distributed indexes querying.
The text was updated successfully, but these errors were encountered: