Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad JOIN ORDER when joining English Wikipedia names to entities (42 vs. 2 seconds) #310

Open
hannahbast opened this issue Jan 18, 2020 · 1 comment

Comments

@hannahbast
Copy link
Member

The following query is currenty executed with a very sub-optimal join order on Wikidata Full, more details on this below the query:

PREFIX schema: <http://schema.org/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?sw ?so WHERE {
  ?s wdt:P17 ?o .
  ?sw schema:about ?s .
  ?sw schema:isPartOf <https://en.wikipedia.org/> .
  ?so schema:about ?o .
  ?so schema:isPartOf <https://en.wikipedia.org/> .
}

The optimal join order would be as for the following equivalent query, which forces the join order by using subqueries and needs less than 2 seconds

PREFIX schema: <http://schema.org/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?sw ?so WHERE {
  ?s wdt:P17 ?o .
  { SELECT ?s ?sw WHERE {  
    ?sw schema:about ?s .
    ?sw schema:isPartOf <https://en.wikipedia.org/> . } }
  { SELECT ?o ?so WHERE {  
    ?so schema:about ?o .
    ?so schema:isPartOf <https://en.wikipedia.org/> . }}
}

Instead, QLever chooses the join order expressed by the following query, which takes 42 seconds to execute:

PREFIX schema: <http://schema.org/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?sw ?so WHERE {
  { SELECT ?s ?sw ?o ?so WHERE {
      ?s wdt:P17 ?o .
      { SELECT ?s ?sw WHERE {  
        ?sw schema:about ?s .
        ?sw schema:isPartOf <https://en.wikipedia.org/> . } }
    ?so schema:about ?o . } }
  ?so schema:isPartOf <https://en.wikipedia.org/> .
}
@hannahbast
Copy link
Member Author

Update from a meeting of @hannahbast and @joka921 on 22.01.2020:

We considered the inner workings of the query optimizer for the following query (in fact, we looked at a slightly more complex query before, but it turns out that you get all the necessary insight already from this query):

PREFIX schema: <http://schema.org/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?sw ?so WHERE {
  ?s wdt:P17 ?o .
  ?sw schema:about ?s .
}

For the triple ?s wdt:P17 ?o, the number of matching triples is s_1 ≈ 12M and the number of distinct elements for the subject is d_1 ≈ 2K and hence the average multiplicity of a subject is m_1 ≈ 6K.

For the triple ?sw schema:about ?s, the number of matching triples is s_2 ≈ 147M and the number of distinct elements for the object is d_2 ≈ 76M and hence the average multiplicity of an object is m_2 ≈ 2.

From this, the query optimizer estimates the result size of the join as s_12 = α · m_1 · m_2 · min(d_1, d_2) ≈ α · 6K · 2 · min(2K, 76M) = α · 24M ≈ 17M.

Note that α = 0.7 in the code, see src/engine/QueryPlanningCostFactors.cpp . This corresponds to the (rather crude) assumption that in a join operation, 70% of the smaller set of elements are contained in the set of elements from the other join column (the worst case would be 100% and if the elements were randomly and independently distributed, the percentage would be much much smaller).

However, the actual query result has 76M rows, which is a factor of almost six larger. The reasons for this bad estimate are twofold. First, for this particular join, a value of α = 1.0 would have been appropriate because the schema:about predicate contains an ?so for every ?o, not just for 70% of them. Second, it is true that the average ?o has only m_2 = 2 ?so in schema:about, but the ?o from the first triple are all countries and these have an average number of six ?so (namely: one meta-node and five Wikipedia articles).

It is not possible to get a better estimate for this join from the currently avalailable predicate metadata (number of triples, multiplicities, functional or not).

One relatively simple type of information that would have helped here is that the schema:about predicate is almost "complete" in a given ID range, meaning that all objects in that range have at least one subject. If the IDs from the other side of the join are included in that range (which is easy to check), this could be used to set α = 1.0.

What would also help here is sampling: If the intersection of two subsets from the join column show the property that one sample is contained in the other (to a certain percentage), the same is true with high probability for the whole sets and these subsets could also be used to better estimate the multiplicity of the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant