DLS search performance/canMatch impact #46817

henningandersen · 2019-09-18T12:45:32Z

During IndexShard.acquireSearcher, the readerWrapper is applied, which populates the DLS bitsets. This causes a performance issue, since IndexShard.acquireSearcher is called during the canMatch phase. In this context it seems unnecessary, since we are only after whether the shard could match or not, which does not require DLS to kick in. When searching for a short time range in a large number of time based indices, this causes a performance impact which seems unnecessary.

Concrete observations: A query that takes 30ms can take 26s for a DLS filtered user. The user has many DLS roles applied, which seems to increase the impact of the issue. The hot threads are dominated by variants of this stack trace:

app//org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc(DisjunctionDISIApproximation.java:55)
app//org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc(DisjunctionDISIApproximation.java:55)
app//org.apache.lucene.util.BitSet.or(BitSet.java:95)
app//org.apache.lucene.util.FixedBitSet.or(FixedBitSet.java:271)
app//org.apache.lucene.util.BitSet.of(BitSet.java:41)
org.elasticsearch.xpack.core.security.authz.accesscontrol.DocumentSubsetBitsetCache.lambda$getBitSet$2(DocumentSubsetBitsetCache.java:153)
org.elasticsearch.xpack.core.security.authz.accesscontrol.DocumentSubsetBitsetCache$$Lambda$5134/0x000000080204e840.load(Unknown Source)
app//org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
org.elasticsearch.xpack.core.security.authz.accesscontrol.DocumentSubsetBitsetCache.getBitSet(DocumentSubsetBitsetCache.java:135)
org.elasticsearch.xpack.core.security.authz.accesscontrol.DocumentSubsetReader.<init>(DocumentSubsetReader.java:160)
org.elasticsearch.xpack.core.security.authz.accesscontrol.DocumentSubsetReader.<init>(DocumentSubsetReader.java:34)
org.elasticsearch.xpack.core.security.authz.accesscontrol.DocumentSubsetReader$DocumentSubsetDirectoryReader$1.wrap(DocumentSubsetReader.java:120)
app//org.apache.lucene.index.FilterDirectoryReader$SubReaderWrapper.wrap(FilterDirectoryReader.java:62)
app//org.apache.lucene.index.FilterDirectoryReader.<init>(FilterDirectoryReader.java:91)
org.elasticsearch.xpack.core.security.authz.accesscontrol.DocumentSubsetReader$DocumentSubsetDirectoryReader.<init>(DocumentSubsetReader.java:116)
org.elasticsearch.xpack.core.security.authz.accesscontrol.DocumentSubsetReader.wrap(DocumentSubsetReader.java:38)
org.elasticsearch.xpack.core.security.authz.accesscontrol.SecurityIndexReaderWrapper.apply(SecurityIndexReaderWrapper.java:86)
org.elasticsearch.xpack.core.security.authz.accesscontrol.SecurityIndexReaderWrapper.apply(SecurityIndexReaderWrapper.java:42)
app//org.elasticsearch.index.shard.IndexShard.wrapSearcher(IndexShard.java:1262)
app//org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:1240)
app//org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:1224)
app//org.elasticsearch.search.SearchService.createSearchContext(SearchService.java:632)
app//org.elasticsearch.search.SearchService.canMatch(SearchService.java:1007)
app//org.elasticsearch.search.SearchService.canMatch(SearchService.java:1020)
app//org.elasticsearch.action.search.SearchTransportService.lambda$registerRequestHandler$14(SearchTransportService.java:396)
app//org.elasticsearch.action.search.SearchTransportService$$Lambda$3005/0x0000000801b4d840.messageReceived(Unknown Source)

These all run directly in the transport thread.

Encountered on ES v7.3.0, running on linux OS.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-18T12:45:33Z

Pinging @elastic/es-security

gwbrown · 2019-09-18T15:25:05Z

I have seen this as well, with similar differences between filtered and unfiltered queries. The hot threads stack traces look almost exactly the same - in the case I saw, the query filters on the role were fairly straightforward - most were a boolean query of two match queries, although there were a few with wildcard queries. I'll update here if there's anything more I can add.

henningandersen · 2019-09-19T06:15:46Z

We discussed this on another channel and while it seems there could be room for improvement here, it is not a bug. I have converted this into an enhancement request instead.

cpmoore · 2019-10-08T03:11:24Z

I have seen similar slowness in our environment with users for whom DSL roles are applied. I would be very happy to get this enhancement in a future release.

jimczi · 2019-10-09T14:38:24Z

I wonder if we can load the bitset eagerly like we do for nested fields for instance. This would slow down the refresh of readers but since the cache is per segment it shouldn't make a difference unless a big merge happened. This wouldn't eliminate the possibility for the bitset to be regenerated since the security cache is bounded but this could help for the majority of cases (where the number of role query is under control).
It is really a bug that such a costly operation can happen during the can_match phase but I don't see how to fix it nicely. We need to run the costly operations to get simple things like number of docs and number of deleted docs and these informations are used during the can_match phase so we cannot skip this operation completely.
We could skip the can_match entirely if indices have dls enabled but that doesn't feel right too.
@henningandersen can you explain why we don't consider this issue a bug ?

cpmoore · 2019-10-09T15:32:25Z

To give a better idea of the performance impact of DLS, i ran some simple queries as a test.
service.group is a keyword field in our mapping. I redacted the actual values in the query, but you get the gist.

First as my user without any access restrictions:

POST /filebeat-*/_search
{
    "query": {
       "match_all":{}
    }
}

{
    "took": 39,
     ...
}

POST /filebeat-*/_search
{
    "query": {
        "terms": {
            "service.group": [
                "group1",
                "group2",
                "group3",
                "group4"
            ]
        }
    }
}

{
    "took": 470,
     ...
}

Then as a test user, with the terms query from above as a DLS query

POST /filebeat-*/_search
{
  "query":{
  	"match_all":{}
  }	
}

{
    "took": 5383,
     ...
}

POST /filebeat-*/_search
{
    "query": {
        "terms": {
            "service.group": [
                "group1",
                "group2",
                "group3",
                "group4"
            ]
        }
    }
}

{
    "took": 5677,
     ...
}

Am i wrong in believing that the terms query from my request should take roughly the same amount of time as a match_all query from a user with terms query as a DLS?
Instead it takes 5 seconds longer.

henningandersen · 2019-10-09T16:02:04Z

@jimczi I guess this is borderline. ES works fine if configured with a high enough xpack.security.dls.bitset.cache.size, though there is room for lowering some of the search tail latencies by not having to initialize this at search time as new segments are created or data is aged out of the cache.
On the other hand, this is doing extraordinary long operations (not sure if it includes IO?) on transport which could be regarded a bug, but is not unprecedented.
I am open to either interpretation.

henningandersen · 2019-10-09T16:07:08Z

@cpmoore which release are you on? Above 7.3 you might want to set xpack.security.dls.bitset.cache.size to a higher value than default 50MB when using DLS.

jimczi · 2019-10-09T16:27:07Z

On the other hand, this is doing extraordinary long operations (not sure if it includes IO?) on transport which could be regarded a bug, but is not unprecedented.

That's the part that I care about. I think we should fix this and I am working on a pr to propose a simple solution. So to be clear my worry is not that requests running with DLS are slower than normal requests. This is expected and sizing the cache is important even though 50MB (the default) gives some room to cache a fairly large number of documents/bitsets.
My big worry is that we always run the loading of this cache in the can_match phase and that's it's unexpected.

Am i wrong in believing that the terms query from my request should take roughly the same amount of time as a match_all query from a user with terms query as a DLS?
Instead it takes 5 seconds longer.

It is expected since the first execution of a DLS query will eagerly build the cache version of the role query. As explained above the result is cached per segment so the following execution should be comparable to the non-DLS case.

This change modifies the local execution of the `can_match` phase to not apply the plugin's reader wrapper if it is configured when acquiring the searcher. We must ensure that the phase runs quickly and since we don't know the cost of applying the wrapper it is preferrable to avoid it entirely. The can_match phase can aford false positives so it is also safe for the builtin plugins that use this functionality. Closes elastic#46817

This change modifies the local execution of the `can_match` phase to not apply the plugin's reader wrapper if it is configured when acquiring the searcher. We must ensure that the phase runs quickly and since we don't know the cost of applying the wrapper it is preferrable to avoid it entirely. The can_match phase can aford false positives so it is also safe for the builtin modules that use this functionality. Closes elastic#46817

This change modifies the local execution of the `can_match` phase to **not** apply the plugin's reader wrapper (if it is configured) when acquiring the searcher. We must ensure that the phase runs quickly and since we don't know the cost of applying the wrapper it is preferable to avoid it entirely. The can_match phase can aford false positives so it is also safe for the builtin plugins that use this functionality. Closes elastic#46817

jimczi · 2019-10-09T17:27:47Z

I opened #47816 to propose a solution but I am open to alternatives. We should also consider not running the can_match phase when dls is activated as a last resort (if we cannot find an agreement).

cpmoore · 2019-10-09T18:06:26Z

@henningandersen

I'm on version 7.4.0. Just upgraded this week.

I believe if it was a problem with the cache size being too small then subsequent requests from the same user for the same query would be at least a little faster as the query DLS should be in the cache having just been used. Right? However each search consistently takes 5+ seconds.

henningandersen · 2019-10-09T19:27:01Z

@cpmoore That is a bit hard to conclude on based on the information available. If the hot_threads while running the 5 sec query look similar to the stacktrace above, I think increasing the cache size should help. I suggest to try it out, for instance using 256MB if you have the heap space available for it.

cpmoore · 2019-10-09T20:19:17Z

@henningandersen
I tried this on our test cluster and it did seem to help. Granted our test cluster is a lot smaller than production.
Am i right in assuming that the more documents the query matches the faster the cache will fill up?

This change modifies the local execution of the `can_match` phase to **not** apply the plugin's reader wrapper (if it is configured) when acquiring the searcher. We must ensure that the phase runs quickly and since we don't know the cost of applying the wrapper it is preferable to avoid it entirely. The can_match phase can aford false positives so it is also safe for the builtin plugins that use this functionality. Closes #46817

henningandersen added :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC v7.3.0 labels Sep 18, 2019

henningandersen added the >bug label Sep 18, 2019

henningandersen added >enhancement and removed >bug labels Sep 19, 2019

cpmoore mentioned this issue Oct 8, 2019

[Document Level Security] Use filter context instead of query context #47675

Closed

jimczi mentioned this issue Oct 9, 2019

Don't apply the plugin's reader wrapper in can_match phase #47816

Merged

jimczi closed this as completed in #47816 Oct 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DLS search performance/canMatch impact #46817

DLS search performance/canMatch impact #46817

henningandersen commented Sep 18, 2019 •

edited

Loading

elasticmachine commented Sep 18, 2019

gwbrown commented Sep 18, 2019

henningandersen commented Sep 19, 2019

cpmoore commented Oct 8, 2019

jimczi commented Oct 9, 2019

cpmoore commented Oct 9, 2019

henningandersen commented Oct 9, 2019

henningandersen commented Oct 9, 2019

jimczi commented Oct 9, 2019

jimczi commented Oct 9, 2019

cpmoore commented Oct 9, 2019

henningandersen commented Oct 9, 2019

cpmoore commented Oct 9, 2019

DLS search performance/canMatch impact #46817

DLS search performance/canMatch impact #46817

Comments

henningandersen commented Sep 18, 2019 • edited Loading

elasticmachine commented Sep 18, 2019

gwbrown commented Sep 18, 2019

henningandersen commented Sep 19, 2019

cpmoore commented Oct 8, 2019

jimczi commented Oct 9, 2019

cpmoore commented Oct 9, 2019

henningandersen commented Oct 9, 2019

henningandersen commented Oct 9, 2019

jimczi commented Oct 9, 2019

jimczi commented Oct 9, 2019

cpmoore commented Oct 9, 2019

henningandersen commented Oct 9, 2019

cpmoore commented Oct 9, 2019

henningandersen commented Sep 18, 2019 •

edited

Loading