Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong skiplist index usage in edge collection #2876

Closed
5 tasks done
ThomasWeiss opened this issue Jul 26, 2017 · 14 comments
Closed
5 tasks done

Wrong skiplist index usage in edge collection #2876

ThomasWeiss opened this issue Jul 26, 2017 · 14 comments
Assignees
Labels
2 Fixed Resolution 3 AQL Query language related 3 Optimizer AQL query optimizer is involved performance
Milestone

Comments

@ThomasWeiss
Copy link

my environment running ArangoDB

I'm using the latest ArangoDB of the respective release series:

  • 3.1

Mode:

  • Single-Server

Storage-Engine:

  • mmfiles

On this operating system:

  • Windows, version: 10

this is an AQL-related issue:

I'm issuing AQL via:

  • web interface with this browser: running on this OS: Windows

  • In an edge collection named hasPublished, I create a skiplist index on the following fields: ['_from', 'publishedAt']

  • I explain the following query: for p in hasPublished filter p._from == 'from' sort p.publishedAt desc return p and get:

Execution plan:
 Id   NodeType          Est.   Comment
  1   SingletonNode        1   * ROOT
  8   IndexNode            1     - FOR p IN hasPublished   /* edge index scan */
  5   CalculationNode      1       - LET #3 = p.`publishedAt`   /* attribute expression */   /* collections used: p : hasPublished */
  6   SortNode             1       - SORT #3 DESC
  7   ReturnNode           1       - RETURN p

Indexes used:
 By   Type   Collection     Unique   Sparse   Selectivity   Fields               Ranges
  8   edge   hasPublished   false    false        76.09 %   [ `_from`, `_to` ]   (p.`_from` == "from")
  • I would expect to get the following:
Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  8   IndexNode          2     - FOR p IN hasPublished   /* reverse skiplist index scan */
  7   ReturnNode         2       - RETURN p

Indexes used:
 By   Type       Collection   Unique   Sparse   Selectivity  Fields                      Ranges
  8   skiplist   hasPublished false    false            n/a  [ `_from`, `publishedAt` ]  (s.`_from` == "user")
@mchacki
Copy link
Member

mchacki commented Jul 26, 2017

I would expect that too, thanks for reporting, managed to reproduce.
Need some time to investigate.

As a quick fix you could do the following:

FOR p IN hasPublished
  FILTER p._from == 'from'
  AND p.publishedAt >= null // << NOTE HERE, an always true condition
  SORT p.publishedAt DESC
  RETURN p

@mchacki mchacki self-assigned this Jul 26, 2017
@mchacki mchacki added 3 Optimizer AQL query optimizer is involved performance labels Jul 26, 2017
@ThomasWeiss
Copy link
Author

Your quick fix returns the same execution plan and doesn't use the right index either (3.1.21 on Windows)

@mchacki
Copy link
Member

mchacki commented Jul 26, 2017

Wierd...

arangod> db._version()
3.1.21
arangod> db._explain('FOR p IN col FILTER p._from == "v/1" && p.publishedAt >= null SORT p.publishedAt DESC RETURN p')
Query string:
 FOR p IN col FILTER p._from == "v/1" && p.publishedAt >= null SORT p.publishedAt DESC RETURN p

Execution plan:
 Id   NodeType        Est.   Comment
  1   SingletonNode      1   * ROOT
  8   IndexNode          2     - FOR p IN col   /* reverse skiplist index scan */
  7   ReturnNode         2       - RETURN p

Indexes used:
 By   Type       Collection   Unique   Sparse   Selectivity   Fields                       Ranges
  8   skiplist   col       false    false            n/a   [ `_from`, `publishedAt` ]   ((p.`_from` == "v/1") && (p.`publishedAt` >= null))

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index
  7   use-index-for-sort
  8   remove-unnecessary-calculations-2

Seems to work on my end...
Can you try and and modify the SORT statement:

SORT p._from DESC, p.publishedAt DESC

@mchacki
Copy link
Member

mchacki commented Jul 26, 2017

Ah is your Skiplist sparse?
If so than replace p.publishedAt >= null with p.publishedAt > null

@ThomasWeiss
Copy link
Author

Tried all your recommendations but for some reasons, the query is analyzed differently on my machine :)

Query string:
 FOR p IN hasPublished FILTER p._from == "v/1" && p.publishedAt >= null SORT p.publishedAt DESC 
 RETURN p

Execution plan:
 Id   NodeType          Est.   Comment
  1   SingletonNode        1   * ROOT
  8   IndexNode            1     - FOR p IN hasPublished   /* edge index scan */
  9   CalculationNode      1       - LET #1 = (p.`publishedAt` >= null)   /* simple expression */   /* collections used: p : hasPublished */
  4   FilterNode           1       - FILTER #1
  5   CalculationNode      1       - LET #3 = p.`publishedAt`   /* attribute expression */   /* collections used: p : hasPublished */
  6   SortNode             1       - SORT #3 DESC
  7   ReturnNode           1       - RETURN p

Indexes used:
 By   Type   Collection     Unique   Sparse   Selectivity   Fields               Ranges
  8   edge   hasPublished   false    false        76.09 %   [ `_from`, `_to` ]   (p.`_from` == "v/1")

Optimization rules applied:
 Id   RuleName
  1   move-calculations-up
  2   move-filters-up
  3   move-calculations-up-2
  4   move-filters-up-2
  5   use-indexes
  6   remove-filter-covered-by-index

@mchacki
Copy link
Member

mchacki commented Jul 26, 2017

OK thanks for checking, then i need time to investigate, no more quick fix ideas in my head sorry.

@ThomasWeiss
Copy link
Author

No worries, thanks a lot for your prompt reply and trying to find a workaround. I don't know if the query parser takes index selectivity into account? (which may explain the difference between your results and mine)

@mchacki
Copy link
Member

mchacki commented Jul 26, 2017

Yes it takes the selectivity into account, but afaik the SORTING should always overrule FILTER only.
Especially in the case where FILTER and SORT both use _from and publishedAt attributes.
Indexes always get an "upvote" for every attribute they cover.

@mchacki
Copy link
Member

mchacki commented Jul 26, 2017

So i get exactly your plan if and only if my skiplist is sparse
In this case it decides that it cannot be used for sorting.
Is this the case on your side?

@ThomasWeiss
Copy link
Author

Nope my skiplist isn't parse:
image

@jsteemann
Copy link
Contributor

I think which index is picked here may depend on the amount of documents in the collection and the selectivity of the indexes.
If these values are different in your setups, then different indexes may be picked. What is the actual number of documents in the collection, and how "unique" are "_from" and the combination of "_from" and "publishedAt" in this collection?

@ThomasWeiss
Copy link
Author

True that I was trying the query on my dev setup with a nearly empty DB, so:

  1. I tried on the staging DB, where the hasPublished collection contains 2,600+ documents and:
  • the [_from, _to] edge index selectivity is around 46%
  • most _from fields are the same
  • most publishedAt fields are different (they are timestamps)

-> so the combination of the 2 would be rather unique. Still, it uses the [_from, _to] edge index and not the skiplist index.

  1. I went back to my dev DB and purged the collection, which brought the [_from, _to] edge index selectivity to 100% and tried the query again -> this time, the skiplist index got used!

So it would appear that the skiplist index would only be used when the selectivity of the [_from, _to] edge index is very high?

jsteemann added a commit that referenced this issue Jul 28, 2017
@jsteemann jsteemann added the 3 AQL Query language related label Jul 28, 2017
@jsteemann jsteemann assigned jsteemann and unassigned mchacki Jul 28, 2017
@jsteemann jsteemann added this to the 3.2.1 milestone Jul 28, 2017
@jsteemann
Copy link
Contributor

Should be better with version 3.2.1.
However, the optimizer may still "guess wrong" when multiple indexes are candidates for the query, and at least one of them has no selectivity estimate, such as the skiplist index.
But it seems that skiplist indexes have been overly penalized before.

@jsteemann jsteemann added the 2 Fixed Resolution label Jul 28, 2017
fceller pushed a commit that referenced this issue Jul 30, 2017
ObiWahn added a commit that referenced this issue Aug 3, 2017
…ture/js_to_cpp_transaction_handler

* 'devel' of https://github.com/arangodb/arangodb:
  fixed some issues detected by coverity scan (#2915)
  remove dependency on MMFiles features from non-MMFiles files (#2925)
  Converted rest handler for explain from JS to C++. (#2907)
  added startsBefore() for ApplicationFeature (#2913)
  Converted a portion of the admin routing API from JS to C++ (#2919)
  Bug fix/predictable results data modifcation multiple fors (#2921)
  Feature/cpp aql char length (#2883)
  slightly move responsibility for recovery (#2922)
  Feature/issue 387 cluster index estimates (#2866)
  Converted endpoint handler from JS to C++ (#2905)
  switch to trusty
  fixed issue #2876 (#2896)
  fix a race on shutdown (#2897)
  fixed issue #2868: cname missing from logger-follow results in rocksdb (#2901)
  make the V8 feature depend on the authentication feature (#2902)
  change "parameters" to "bindVars" (#2772)
  Do not allow replication to create/drop collections (#2898)
  don't mask errors with fake OOM messages (#2872)
@jsteemann
Copy link
Contributor

3.2.1 has been released and is available for download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 Fixed Resolution 3 AQL Query language related 3 Optimizer AQL query optimizer is involved performance
Projects
None yet
Development

No branches or pull requests

3 participants