Test the distributed meta-index before #38

lecoqlibre · 2024-04-15T08:02:01Z

We want to have a meta index on each instance to try to have faster distributed indexes querying.

lecoqlibre · 2024-04-15T09:24:47Z

We could have something like the following. It would tell if an instance have some users for some skills or cities. But this won't tell if these users are the same. If the different targeted instances have users for the same skills and cities this would likely be useless as we would still query the whole distributed federation.

The index, for the instance https://instance1.example.org could look like:

? a ex:SourceSelectionIndex.

:php a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "php";
    ex:instancesIn <path/to/php/index>;
    ex:hasSource <https://instance1.example.org>.

:css a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "css";
    ex:instancesIn <path/to/css/index>;
    ex:hasSource <https://instance1.example.org>.

:javascript a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "javascript";
    ex:instancesIn <path/to/javascript/index>;
    ex:hasSource <https://instance1.example.org>.

:paris a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasLocation;
    ex:forValue "paris";
    ex:instancesIn <path/to/paris/index>
    ex:hasSource <https://instance1.example.org>.

The SPARQL query to find instances (and their indexes) having users with some skills or cities:

SELECT ?source, ?indexPhp, ?indexCss, ?indexJavascript, ?indexParis WHERE {
    ?hasPhp a ex:SourceSelectionIndexRegistration;
        ex:forProperty ex:hasSkill;
        ex:forValue "php";
        ex:instancesIn ?indexPhp;
        ex:hasSource ?source.

    ?hasCss a ex:SourceSelectionIndexRegistration;
        ex:forProperty ex:hasSkill;
        ex:forValue "css";
        ex:instancesIn ?indexCss;
        ex:hasSource ?source.

    ?hasJavascript a ex:SourceSelectionIndexRegistration;
        ex:forProperty ex:hasSkill;
        ex:forValue "javascript";
        ex:instancesIn ?indexJavascript;
        ex:hasSource ?source.

    ?hasCity1 a ex:SourceSelectionIndexRegistration;
        ex:forProperty ex:hasLocation;
        ex:forValue "paris";
        ex:instancesIn ?indexParis;
        ex:hasSource ?source.
}

pchampin · 2024-04-16T12:41:48Z

what's the added value of ex:hasSource? Given that the value of ex:instancesIn is an IRI (I assume!), it will also contain the source...

lecoqlibre · 2024-04-17T13:23:55Z

We discussed about it today. The ex:hasSource allows to do a join to get the results from the same instance. We can avoid it by rewriting the query to make a join based on the extraction of the source from the ex:instancesIn. So we have to choose between a more complex SPARQL query or a more verbose index.

lecoqlibre · 2024-05-22T15:58:59Z

I tried the distributed meta indexes with the link traversal (ebfcedf) and the perfs are very bad. It's completely inusable in term of elapsed time and consumed resources (CPU and memory).

I think the link traversal plays a huge role in these bad perfs especially because it dereferences every indexes on every instance (the meta indexes contain a link to all these indexes and Comunica assumes it can find some interesting triples in these too)!

So we should really use named graphs or at least split our query into several queries (to simulate named graphs waiting for Comunica to be fixed - by us).

Re-try with a simulation of named graphs by executing several queries sequentially.

@FabienGandon @pchampin @balessan

balessan · 2024-05-23T08:04:25Z

Thanks @lecoqlibre for the tests, not a good news I would say !

So we should really use named graphs or at least split our query into several queries (to simulate named graphs waiting for Comunica to be fixed - by us).

Any ideas of the steps to achieve that ?

pchampin · 2024-05-23T11:41:09Z

Any ideas of the steps to achieve that ?

My suggestion to @lecoqlibre (in a mattermost discussion) is to split the query into a list of queries, where each query (apart from the last one) provides the sources against which the next query will be executed.

lecoqlibre · 2024-05-24T11:46:57Z

In order to split our query into a list of queries I first executed a query to try source selection: to know which instances have the selected skills and cities. This gives bad perfs and we are not using the link traversal. Indeed, for a single skill and a single city it took 25 sec. to complete the execution. For 3 skills and one city it was about 90 sec. (see below).

Here is the SPARQL query:

SELECT DISTINCT ?source WHERE {
  ?23 a <http://example.org#SourceSelectionIndexRegistration>;
    <http://example.org#forProperty> <http://example.org#hasSkill>;
    <http://example.org#forValue> "23";
    <http://example.org#hasSource> ?source.

  ?24 a <http://example.org#SourceSelectionIndexRegistration>;
    <http://example.org#forProperty> <http://example.org#hasSkill>;
    <http://example.org#forValue> "24";
    <http://example.org#hasSource> ?source.

  ?25 a <http://example.org#SourceSelectionIndexRegistration>;
    <http://example.org#forProperty> <http://example.org#hasSkill>;
    <http://example.org#forValue> "25";
    <http://example.org#hasSource> ?source.

  ?paris a <http://example.org#SourceSelectionIndexRegistration>;
    <http://example.org#forProperty> <http://example.org#hasLocation>;
    <http://example.org#forValue> "paris";
    <http://example.org#hasSource> ?source.
} LIMIT 100

The results of this query are:

The targeted sources are the distributed meta indexes:

http://localhost:8001/indexes/root
http://localhost:8002/indexes/root
http://localhost:8003/indexes/root
http://localhost:8004/indexes/root
http://localhost:8005/indexes/root
http://localhost:8006/indexes/root
http://localhost:8007/indexes/root
http://localhost:8008/indexes/root
http://localhost:8009/indexes/root
http://localhost:8010/indexes/root
http://localhost:8011/indexes/root
http://localhost:8012/indexes/root
http://localhost:8013/indexes/root
http://localhost:8014/indexes/root
http://localhost:8015/indexes/root
http://localhost:8016/indexes/root
http://localhost:8017/indexes/root
http://localhost:8018/indexes/root
http://localhost:8019/indexes/root
http://localhost:8020/indexes/root
http://localhost:8021/indexes/root
http://localhost:8022/indexes/root
http://localhost:8023/indexes/root
http://localhost:8024/indexes/root
http://localhost:8025/indexes/root
http://localhost:8026/indexes/root
http://localhost:8027/indexes/root
http://localhost:8028/indexes/root
http://localhost:8029/indexes/root
http://localhost:8030/indexes/root
http://localhost:8031/indexes/root
http://localhost:8032/indexes/root

lecoqlibre · 2024-05-24T11:58:43Z

So I think there is already a blocker at the level of source selection.

One hybrid approach (distributed/federated) could be to have a federated index to do source selection. This index would allow us to know which instances could potentially have valid results (for example which instances have users for the skill 23, 24 and 25 and for Paris).

Here is an example of what that federated index could look like:

@prefix : <#>.
@prefix ex: <http://example.org#>.

<> a ex:SourceSelectionIndex.
    
:23 a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "23";
    ex:instancesIn <http://localhost:8001>, <http://localhost:8003>, <http://localhost:8013>, <http://localhost:8022>.

We could even have something more detailed with some stats like the number of matched users:

@prefix : <#>.
@prefix ex: <http://example.org#>.

<> a ex:SourceSelectionIndex.
    
:23 a ex:SourceSelectionIndexRegistration;
    ex:forProperty ex:hasSkill;
    ex:forValue "23";
    ex:results 
        [
            ex:instancesIn <http://localhost:8001>;
            ex:hasCount "8"
        ],
        [
            ex:instancesIn <http://localhost:8003>;
            ex:hasCount "43"
        ],
        [
            ex:instancesIn <http://localhost:8013>;
            ex:hasCount "12"
        ],
        [
            ex:instancesIn <http://localhost:8022>;
            ex:hasCount "1"
        ].

Note: I used predicates which might be changed.

@FabienGandon @pchampin @balessan

lecoqlibre · 2024-05-24T15:34:18Z

I fixed the graphical interface so now I get a new instance every 2 seconds (1 skill + 1 city) and 7 seconds (3 skills + 1 city) in the results. We should not wait for the end of a previous query to execute the next one. With streams we can start the distributed exploration as soon as we get a result. So as soon as I get a link to a selected instance from the source selection query I can launch a query on this instance to get the first users.

I will test this but I suppose that the idea of having a federated source selection index could still improve the perfs.

lecoqlibre · 2024-05-28T12:34:37Z

I succeeded (ae6b5b5) in testing the source selection without link traversal starting from the 32 distributed meta indexes (one on each instance):

one query is used to select the relevant indexes on the distributed instances;
a second query is executed on each relevant distributed instances (using the previously selected indexes as sources).

Note: here we are testing in a distributed fashion only. The federated index to do source selection has not been tested.

I tried on 1 skill (23) + 1 city (Paris). The total execution time was about 35 sec. I got the first result at 22 sec. and the second result at 34 sec. This can be explained by the fact that the two results are coming from the 7th and the 11th instance on the list (selected instances are 1, 8, 10, 12, 14, 16, 23, 25, 27, 29, 31). The list is currently sorted by instance number.

To get the results faster we should have some mechanism to sort the instances by relevance. This way, the most relevant instances will be queried first and the results should come faster.

Also using a federated index to do source selection like proposed above can help to be faster. We can even link to specific distributed indexes directly from there to save a traversal.

Any thoughts @FabienGandon @pchampin @balessan ?

pchampin · 2024-05-29T12:56:37Z

Are you waiting for all the results of one instance before starting to query the next one?

If that's what you are doing, I believe that you could get much better performances by running the queries in parallel, and showing results as they arrive, regardless of the source.

This can be achievable by storing multiple promises in an array, and using Promise.any to get the result from the first one to succeeds, and repeating this until none of them has anything on stock.

lecoqlibre · 2024-05-29T17:51:12Z

Are you waiting for all the results of one instance before starting to query the next one?

No I'm (normally) not. Like I said, as soon as I get a source I can execute a query on this instance. The bindingsStream.on('data', callback) of the used strategy executes a new query as soon as a source is found. So several queries are normally running in parallel. You can look at the previously mentioned commit ae6b5b5, especially the line 56 of the StrategyComunicaSourceSelection class.

This can be achievable by storing multiple promises in an array

I'm indeed using an array of promises and wait for all to resolve to end the global query using Promise.all, see line 61.

pchampin · 2024-05-29T20:02:43Z

It does indeed look like you are running all "subqueries" in parallel, which is good. Improvement needs to come from somewhere else then :)
Taking advantage of the hasCount, as suggested above, looks promising.

lecoqlibre mentioned this issue Apr 15, 2024

SIB Next-steps after demonstration #36

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test the distributed meta-index before #38

Test the distributed meta-index before #38

lecoqlibre commented Apr 15, 2024 •

edited

Loading

lecoqlibre commented Apr 15, 2024 •

edited

Loading

pchampin commented Apr 16, 2024

lecoqlibre commented Apr 17, 2024

lecoqlibre commented May 22, 2024 •

edited

Loading

balessan commented May 23, 2024

pchampin commented May 23, 2024

lecoqlibre commented May 24, 2024 •

edited

Loading

lecoqlibre commented May 24, 2024 •

edited

Loading

lecoqlibre commented May 24, 2024 •

edited

Loading

lecoqlibre commented May 28, 2024

pchampin commented May 29, 2024

lecoqlibre commented May 29, 2024

pchampin commented May 29, 2024

Test the distributed meta-index before #38

Test the distributed meta-index before #38

Comments

lecoqlibre commented Apr 15, 2024 • edited Loading

lecoqlibre commented Apr 15, 2024 • edited Loading

pchampin commented Apr 16, 2024

lecoqlibre commented Apr 17, 2024

lecoqlibre commented May 22, 2024 • edited Loading

balessan commented May 23, 2024

pchampin commented May 23, 2024

lecoqlibre commented May 24, 2024 • edited Loading

lecoqlibre commented May 24, 2024 • edited Loading

lecoqlibre commented May 24, 2024 • edited Loading

lecoqlibre commented May 28, 2024

pchampin commented May 29, 2024

lecoqlibre commented May 29, 2024

pchampin commented May 29, 2024

lecoqlibre commented Apr 15, 2024 •

edited

Loading

lecoqlibre commented Apr 15, 2024 •

edited

Loading

lecoqlibre commented May 22, 2024 •

edited

Loading

lecoqlibre commented May 24, 2024 •

edited

Loading

lecoqlibre commented May 24, 2024 •

edited

Loading

lecoqlibre commented May 24, 2024 •

edited

Loading