Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the need to reload roles from index #33205

Closed
jaymode opened this issue Aug 28, 2018 · 6 comments
Closed

Reduce the need to reload roles from index #33205

jaymode opened this issue Aug 28, 2018 · 6 comments
Assignees
Labels
>enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC

Comments

@jaymode
Copy link
Member

jaymode commented Aug 28, 2018

If the roles.yml file is changed, the Security code takes the heavy handed approach of invalidating all cache entries rather than those that use changed roles. This should be a rare occurrence but in a busy cluster this is problematic since any roles stored in a index will need to be reloaded, which means a search (or get if only one name is requested) needs to be executed and shares the same threadpool/queue as other searches. If the search is rejected, then we use our negative lookup cache to prevent this role from being queried for again. This leads to poor behavior for the user as users do not have the expected permissions.

We should look at how we can improve this behavior through the need to reduce which roles get reloaded from an index. This may mean a change in our caching or how we deal with changes to the roles.yml and reloading.

@jaymode jaymode added >enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC labels Aug 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security

@jaymode jaymode self-assigned this Aug 28, 2018
@jaymode
Copy link
Member Author

jaymode commented Sep 6, 2018

Another item to consider is to limit role combination loading to one per unique role combination. This would be similar to what was done for concurrent authentications of a user.

jaymode added a commit to jaymode/elasticsearch that referenced this issue Sep 7, 2018
In order to optimize the use of the role cache, when the roles.yml file
is reloaded we now calculate the names of removed, changed, and added
roles so that they may be passed to any listeners. This allows a
listener to selectively clear cache for only the roles that have been
modified. The CompositeRolesStore has been adapted to do exactly that
so that we limit the need to reload roles from sources such as the
native roles stores or external role providers.

See elastic#33205
jaymode added a commit to jaymode/elasticsearch that referenced this issue Sep 7, 2018
The native roles store previously would issue a search if attempting to
retrieve more than a single role. This is fine when we are attempting
to retrieve all of the roles to list them in the api, but could cause
issues when attempting to find roles for a user. The search is not
prioritized over other search requests, so heavy aggregations/searches
or overloaded nodes could cause roles to be cached as missing if the
search is rejected.

When attempting to load specific roles, we know the document id for the
role that we are trying to load, which allows us to use the multi-get
api for loading these roles. This change makes use of the multi-get api
when attempting to load more than one role by name. This api is also
constrained by a threadpool but the tasks in the GET threadpool should
be quicker than searches.

See elastic#33205
@s1monw s1monw self-assigned this Sep 13, 2018
s1monw added a commit to s1monw/elasticsearch that referenced this issue Sep 15, 2018
Today all searches happen on the search threadpool which is the correct
behavior in almost any case. Yet, there are exceptions where for instance
searches are required to succeed ie. in the case of a .security index.
These searches should not be rejected if a node is under load. There are other
more specialized usecases were searches should be passed through a single-thread
threadpool to reduce impact on a node.

Relates to elastic#33205
jaymode added a commit that referenced this issue Sep 26, 2018
In order to optimize the use of the role cache, when the roles.yml file
is reloaded we now calculate the names of removed, changed, and added
roles so that they may be passed to any listeners. This allows a
listener to selectively clear cache for only the roles that have been
modified. The CompositeRolesStore has been adapted to do exactly that
so that we limit the need to reload roles from sources such as the
native roles stores or external role providers.

See #33205
jaymode added a commit that referenced this issue Sep 26, 2018
In order to optimize the use of the role cache, when the roles.yml file
is reloaded we now calculate the names of removed, changed, and added
roles so that they may be passed to any listeners. This allows a
listener to selectively clear cache for only the roles that have been
modified. The CompositeRolesStore has been adapted to do exactly that
so that we limit the need to reload roles from sources such as the
native roles stores or external role providers.

See #33205
@jasontedor
Copy link
Member

jasontedor commented Sep 27, 2018

I have been thinking about this issue on and off since @s1monw opened the pull request to enable to route these requests to the generic thread pool. That led to a discussion about how we should avoid that because that thread pool is large and unbounded, and we do not want to be shuttling tasks from user-initiated actions to an unbounded thread pool. We reversed that portion of the change and I had proposed that we force put these requests on to the search thread pool, so that they can not be rejected. When @s1monw pinged me this morning that that is basically the same as allowing these to grow unbounded again, that triggered me to take another look at the underlying issue.

If the search is rejected, then we use our negative lookup cache to prevent this role from being queried for again.

I think this is the key point that we need to revisit. I do not think we should be entering role lookups that fail (not fail negatively, but fail exceptionally) into the negative lookup cache. Rather, an exceptional failure should be an indication that we do not know the result of the role lookup. So it's neither a positive hit, nor a negative hit, but we simply do not know and we should not be caching this result. In particular, rejections of role lookups on the search thread pool should cause the initial request to fail, we can notify the client who can retry the request.

Separately, I think that we can consider migrating these role lookups to a separate thread pool but still with a bounded queue so that they are not competing with search requests.

What do you think @jaymode and @s1monw?

@jaymode
Copy link
Member Author

jaymode commented Sep 27, 2018

I do not think we should be entering role lookups that fail (not fail negatively, but fail exceptionally) into the negative lookup cache.

I am already working on this change, so we are on the same page regarding this aspect. @tvernum and @albertzaharovits also agree #33531 (comment).

Separately, I think that we can consider migrating these role lookups to a separate thread pool but still with a bounded queue so that they are not competing with search requests.

I think that this will be helpful. I did not have in mind that the search on a different threadpool change would be the ultimate solution to the problem but it does help avoid security searches being stuck behind large aggregations. Rejections will still be possible so the handling definitely needs to be improved around that. We could also consider adding a built-in retry with exponential backoff, but I haven't given too much thought to that yet.

@jasontedor
Copy link
Member

We could also consider adding a built-in retry with exponential backoff, but I haven't given too much thought to that yet.

It is worth considering. Yet, for a first version I do not think that it is necessary, we can see how large of a problem it is in practice first after we make these initial changes before we commit to something like that.

atript8 pushed a commit to atript8/elasticsearch that referenced this issue Sep 28, 2018
In order to optimize the use of the role cache, when the roles.yml file
is reloaded we now calculate the names of removed, changed, and added
roles so that they may be passed to any listeners. This allows a
listener to selectively clear cache for only the roles that have been
modified. The CompositeRolesStore has been adapted to do exactly that
so that we limit the need to reload roles from sources such as the
native roles stores or external role providers.

See elastic#33205
jaymode added a commit to jaymode/elasticsearch that referenced this issue Oct 1, 2018
Security caches the result of role lookups and negative lookups are
cached indefinitely. In the case of transient failures this leads to a
bad experience as the roles could truly exist. The CompositeRolesStore
needs to know if a failure occurred in one of the roles stores in order
to make the appropriate decision as it relates to caching. In order to
provide this information to the CompositeRolesStore, the return type of
methods to retrieve roles has changed to a new class,
RoleRetrievalResult. This class provides the ability to pass back an
exception to the roles store. This exception does not mean that a
request should be failed but instead serves as a signal to the roles
store that missing roles should not be cached and neither should the
combined role if there are missing roles.

As part of this, the negative lookup cache was also changed from an
unbounded cache to a cache with a configurable limit.

Relates elastic#33205
jaymode added a commit to jaymode/elasticsearch that referenced this issue Oct 2, 2018
The security native stores follow a pattern where
`SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls
made for the security index. The reasoning behind this was to check if
the security index had been upgraded to the latest version in a
consistent manner. However, this has the potential side effect that a
read will trigger the creation of the security index or an updating of
its mappings, which can lead to issues such as failures due to put
mapping requests timing out even though we might have been able to read
from the index and get the data necessary.

This change introduces a new method, `checkIndexVersionThenExecute`,
that provides the consistent checking of the security index to make
sure it has been upgraded. That is the only check that this method
performs prior to running the passed in operation, which removes the
possible triggering of index creation and mapping updates for reads.

Additionally, areas where we do reads now check the availability of the
security index and can short circuit requests. Availability in this
context means that the index exists and all primaries are active.

Relates elastic#33205
jaymode added a commit that referenced this issue Oct 15, 2018
Security caches the result of role lookups and negative lookups are
cached indefinitely. In the case of transient failures this leads to a
bad experience as the roles could truly exist. The CompositeRolesStore
needs to know if a failure occurred in one of the roles stores in order
to make the appropriate decision as it relates to caching. In order to
provide this information to the CompositeRolesStore, the return type of
methods to retrieve roles has changed to a new class,
RoleRetrievalResult. This class provides the ability to pass back an
exception to the roles store. This exception does not mean that a
request should be failed but instead serves as a signal to the roles
store that missing roles should not be cached and neither should the
combined role if there are missing roles.

As part of this, the negative lookup cache was also changed from an
unbounded cache to a cache with a configurable limit.

Relates #33205
jaymode added a commit that referenced this issue Oct 15, 2018
Security caches the result of role lookups and negative lookups are
cached indefinitely. In the case of transient failures this leads to a
bad experience as the roles could truly exist. The CompositeRolesStore
needs to know if a failure occurred in one of the roles stores in order
to make the appropriate decision as it relates to caching. In order to
provide this information to the CompositeRolesStore, the return type of
methods to retrieve roles has changed to a new class,
RoleRetrievalResult. This class provides the ability to pass back an
exception to the roles store. This exception does not mean that a
request should be failed but instead serves as a signal to the roles
store that missing roles should not be cached and neither should the
combined role if there are missing roles.

As part of this, the negative lookup cache was also changed from an
unbounded cache to a cache with a configurable limit.

Relates #33205
jaymode added a commit that referenced this issue Oct 16, 2018
The security native stores follow a pattern where
`SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls
made for the security index. The reasoning behind this was to check if
the security index had been upgraded to the latest version in a
consistent manner. However, this has the potential side effect that a
read will trigger the creation of the security index or an updating of
its mappings, which can lead to issues such as failures due to put
mapping requests timing out even though we might have been able to read
from the index and get the data necessary.

This change introduces a new method, `checkIndexVersionThenExecute`,
that provides the consistent checking of the security index to make
sure it has been upgraded. That is the only check that this method
performs prior to running the passed in operation, which removes the
possible triggering of index creation and mapping updates for reads.

Additionally, areas where we do reads now check the availability of the
security index and can short circuit requests. Availability in this
context means that the index exists and all primaries are active.

Relates #33205
jaymode added a commit that referenced this issue Oct 16, 2018
The security native stores follow a pattern where
`SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls
made for the security index. The reasoning behind this was to check if
the security index had been upgraded to the latest version in a
consistent manner. However, this has the potential side effect that a
read will trigger the creation of the security index or an updating of
its mappings, which can lead to issues such as failures due to put
mapping requests timing out even though we might have been able to read
from the index and get the data necessary.

This change introduces a new method, `checkIndexVersionThenExecute`,
that provides the consistent checking of the security index to make
sure it has been upgraded. That is the only check that this method
performs prior to running the passed in operation, which removes the
possible triggering of index creation and mapping updates for reads.

Additionally, areas where we do reads now check the availability of the
security index and can short circuit requests. Availability in this
context means that the index exists and all primaries are active.

Relates #33205
matriv pushed a commit to matriv/elasticsearch that referenced this issue Oct 22, 2018
The security native stores follow a pattern where
`SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls
made for the security index. The reasoning behind this was to check if
the security index had been upgraded to the latest version in a
consistent manner. However, this has the potential side effect that a
read will trigger the creation of the security index or an updating of
its mappings, which can lead to issues such as failures due to put
mapping requests timing out even though we might have been able to read
from the index and get the data necessary.

This change introduces a new method, `checkIndexVersionThenExecute`,
that provides the consistent checking of the security index to make
sure it has been upgraded. That is the only check that this method
performs prior to running the passed in operation, which removes the
possible triggering of index creation and mapping updates for reads.

Additionally, areas where we do reads now check the availability of the
security index and can short circuit requests. Availability in this
context means that the index exists and all primaries are active.

This is the fixed version of elastic#34246, which was reverted.

Relates elastic#33205
jaymode added a commit that referenced this issue Oct 22, 2018
The security native stores follow a pattern where
`SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls
made for the security index. The reasoning behind this was to check if
the security index had been upgraded to the latest version in a
consistent manner. However, this has the potential side effect that a
read will trigger the creation of the security index or an updating of
its mappings, which can lead to issues such as failures due to put
mapping requests timing out even though we might have been able to read
from the index and get the data necessary.

This change introduces a new method, `checkIndexVersionThenExecute`,
that provides the consistent checking of the security index to make
sure it has been upgraded. That is the only check that this method
performs prior to running the passed in operation, which removes the
possible triggering of index creation and mapping updates for reads.

Additionally, areas where we do reads now check the availability of the
security index and can short circuit requests. Availability in this
context means that the index exists and all primaries are active.

This is the fixed version of #34246, which was reverted.

Relates #33205
kcm pushed a commit that referenced this issue Oct 30, 2018
In order to optimize the use of the role cache, when the roles.yml file
is reloaded we now calculate the names of removed, changed, and added
roles so that they may be passed to any listeners. This allows a
listener to selectively clear cache for only the roles that have been
modified. The CompositeRolesStore has been adapted to do exactly that
so that we limit the need to reload roles from sources such as the
native roles stores or external role providers.

See #33205
kcm pushed a commit that referenced this issue Oct 30, 2018
Security caches the result of role lookups and negative lookups are
cached indefinitely. In the case of transient failures this leads to a
bad experience as the roles could truly exist. The CompositeRolesStore
needs to know if a failure occurred in one of the roles stores in order
to make the appropriate decision as it relates to caching. In order to
provide this information to the CompositeRolesStore, the return type of
methods to retrieve roles has changed to a new class,
RoleRetrievalResult. This class provides the ability to pass back an
exception to the roles store. This exception does not mean that a
request should be failed but instead serves as a signal to the roles
store that missing roles should not be cached and neither should the
combined role if there are missing roles.

As part of this, the negative lookup cache was also changed from an
unbounded cache to a cache with a configurable limit.

Relates #33205
kcm pushed a commit that referenced this issue Oct 30, 2018
The security native stores follow a pattern where
`SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls
made for the security index. The reasoning behind this was to check if
the security index had been upgraded to the latest version in a
consistent manner. However, this has the potential side effect that a
read will trigger the creation of the security index or an updating of
its mappings, which can lead to issues such as failures due to put
mapping requests timing out even though we might have been able to read
from the index and get the data necessary.

This change introduces a new method, `checkIndexVersionThenExecute`,
that provides the consistent checking of the security index to make
sure it has been upgraded. That is the only check that this method
performs prior to running the passed in operation, which removes the
possible triggering of index creation and mapping updates for reads.

Additionally, areas where we do reads now check the availability of the
security index and can short circuit requests. Availability in this
context means that the index exists and all primaries are active.

Relates #33205
kcm pushed a commit that referenced this issue Oct 30, 2018
The security native stores follow a pattern where
`SecurityIndexManager#prepareIndexIfNeededThenExecute` wraps most calls
made for the security index. The reasoning behind this was to check if
the security index had been upgraded to the latest version in a
consistent manner. However, this has the potential side effect that a
read will trigger the creation of the security index or an updating of
its mappings, which can lead to issues such as failures due to put
mapping requests timing out even though we might have been able to read
from the index and get the data necessary.

This change introduces a new method, `checkIndexVersionThenExecute`,
that provides the consistent checking of the security index to make
sure it has been upgraded. That is the only check that this method
performs prior to running the passed in operation, which removes the
possible triggering of index creation and mapping updates for reads.

Additionally, areas where we do reads now check the availability of the
security index and can short circuit requests. Availability in this
context means that the index exists and all primaries are active.

This is the fixed version of #34246, which was reverted.

Relates #33205
jaymode added a commit that referenced this issue Oct 30, 2018
The native roles store previously would issue a search if attempting to
retrieve more than a single role. This is fine when we are attempting
to retrieve all of the roles to list them in the api, but could cause
issues when attempting to find roles for a user. The search is not
prioritized over other search requests, so heavy aggregations/searches
or overloaded nodes could cause roles to be cached as missing if the
search is rejected.

When attempting to load specific roles, we know the document id for the
role that we are trying to load, which allows us to use the multi-get
api for loading these roles. This change makes use of the multi-get api
when attempting to load more than one role by name. This api is also
constrained by a threadpool but the tasks in the GET threadpool should
be quicker than searches.

See #33205
jaymode added a commit that referenced this issue Oct 30, 2018
The native roles store previously would issue a search if attempting to
retrieve more than a single role. This is fine when we are attempting
to retrieve all of the roles to list them in the api, but could cause
issues when attempting to find roles for a user. The search is not
prioritized over other search requests, so heavy aggregations/searches
or overloaded nodes could cause roles to be cached as missing if the
search is rejected.

When attempting to load specific roles, we know the document id for the
role that we are trying to load, which allows us to use the multi-get
api for loading these roles. This change makes use of the multi-get api
when attempting to load more than one role by name. This api is also
constrained by a threadpool but the tasks in the GET threadpool should
be quicker than searches.

See #33205
@jaymode
Copy link
Member Author

jaymode commented Jan 15, 2019

I think we've made good progress here with improved logic when we encounter an error (#34197), not attempting to create index or put mapping on reads (#34246), switching to mget for roles instead of search (#33531), and calculating changed roles on file reloads (#33525). I'm going to close this issue and we can open another if there are more issues.

@jaymode jaymode closed this as completed Jan 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Security/Authorization Roles, Privileges, DLS/FLS, RBAC/ABAC
Projects
None yet
Development

No branches or pull requests

4 participants