Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Ignoring failures for indices that have just been created #65846

Open
benwtrent opened this issue Dec 3, 2020 · 8 comments · May be fixed by #66853
Open

Allow Ignoring failures for indices that have just been created #65846

benwtrent opened this issue Dec 3, 2020 · 8 comments · May be fixed by #66853
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@benwtrent
Copy link
Member

Description of the problem including expected versus actual behavior:

When a search expands to a recently created index, and the index is not fully initialized, there is a window of time where the search will fail due to missing shards.

For many uses inside of the machine learning plugin, this is problematic as we don't know if it is a "true" failure or not.

Current ML failures caused by this scenario.

Steps to reproduce:

Reproducing this is exceptionally difficult.

Probably the best chance to reproduce this locally at the moment is with DeleteExpiredDataIT.testDeleteExpiredDataNoThrottle or DeleteExpiredDataIT.testDeleteExpiredDataWithStandardThrottle. However, it’s tricky because these tests are much more likely to fail if other tests have run before them so they start immediately after the cluster cleaning that runs in between tests. You can unmute those two by reverting 309684f. Then I guess if you run all the tests in DeleteExpiredDataIT repeatedly then one of them will have to run after another’s cleanup.

This situation might be addressable via allowing an option to wait for the primary shard to be assigned (if it is being initialized).

@benwtrent benwtrent added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Dec 3, 2020
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Dec 3, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@benwtrent
Copy link
Member Author

//CC @elastic/ml-core

@benwtrent
Copy link
Member Author

Other related issues:

#45250

elastic/elasticsearch-net#4280

@droberts195
Copy link
Contributor

/cc @henningandersen and @fcofdez because I think some work the distributed team is doing might be able to solve this

@fcofdez
Copy link
Contributor

fcofdez commented Dec 9, 2020

Yes, we've been exploring different approaches to overcome these kinds of issues. I wrote a small POC (#64315) to explore more the problem but we have been hesitant to change how search currently works and we're leaning towards a different model for async searches. We think that for blocking searches it makes sense to fail fast as it does now.

Additionally, even if we end up implementing a way to wait for shards to be allocated, we should bound that wait with a timeout and there's no guarantee that these kinds of spurious failures won't happen again. 

Maybe what we can do is provide some kind of primitive for tests to wait until the shards are allocated, so failures caused by a delayed shard allocation are easier to find. 

/cc @henningandersen @jimczi

@droberts195
Copy link
Contributor

Maybe what we can do is provide some kind of primitive for tests to wait until the shards are allocated

But what about end users? It seems like we're going in the wrong direction if we're writing special functionality to make the tests pass. It's as though the reason Elasticsearch exists is to make the tests pass rather than as a useful product for end users.

The scenario where it breaks is:

  1. Thread 1 wants to know if a particular document exists in a particular set of indices, which may or may not exist.
  2. Thread 2 creates an index that matches the pattern
  3. Thread 1 searches for the document with ignore_unavailable=true&allow_no_indices=true
  4. At this point the index exists but doesn’t have any shards allocated, so thread 1 gets an “all shards failed” exception with an empty list of shards that failed
  5. An ML test fails, or, in production, a user gets an error response that doesn’t make sense

I think it will be hard to add test functionality that solves that. One way to solve it in tests is to proactively create all indices up-front, then wait for yellow status on the cluster before proceeding to the rest of the test code. But if that is not what the production system does then the test isn't testing what end users observe. Some pieces of ML functionality create indices only when they're needed, and if a different thread is doing searches against those indices not caring if they don't exist then that is where the problem arises.

it makes sense to fail fast as it does now

If it's going to fail fast, could it fail fast with a specific exception type meaning "all shards failed because this index has only just been created"? At the moment we get the search exception with "all shards failed", but we have no idea if this is because the index was only just created or because of some major problem in the cluster that caused both primaries and replicas to be inaccessible.

This would solve the problem for both tests and end users because then the ML code that doesn't care if there are no results from a search because the index being searched does not exist could treat that specific type of exception as meaning "no results".

@henningandersen
Copy link
Contributor

I do think it is possible to solve this without waiting, since an empty index should result in no hits. I am mainly worried about the potential edge cases that could be the result of such a check up front (like if the coordinator is isolated, it may return no hits when there really are hits). I agree that we should not solve this for tests only (unless it turns out to be a test only issue). We could make the check when receiving the failure, which would mitigate at least some of that.

@fcofdez
Copy link
Contributor

fcofdez commented Dec 9, 2020

If it's going to fail fast, could it fail fast with a specific exception type meaning "all shards failed because this index has only just been created"? At the moment we get the search exception with "all shards failed", but we have no idea if this is because the index was only just created or because of some major problem in the cluster that caused both primaries and replicas to be inaccessible.

I think this is related to some configuration on the http layer, maybe http.detailed_errors.enabled, since the exception generated by the search phase provides information about the failed shard executions. i.e.

{
   "type":"search_phase_execution_exception",
   "reason":"all shards failed",
   "phase":"query",
   "grouped":true,
   "failed_shards":[
      {
         "shard":0,
         "index":"bsvfqshqqy",
         "node":null,
         "reason":{
            "type":"no_shard_available_action_exception",
            "reason":null,
            "index_uuid":"hByB2qCKT8yXdRjrxDZqIw",
            "shard":"0",
            "index":"bsvfqshqqy"
         }
      }
   ]
}

I am mainly worried about the potential edge cases that could be the result of such a check up front (like if the coordinator is isolated, it may return no hits when there really are hits)

That should result in shard search failures too, as long as the connection is closed at some point (since the internal search
requests don't have a timeout). But maybe I'm missing something here.

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Dec 29, 2020
When creating an index, the primary will be either unassigned or
initializing for a short while. This causes issues when concurrent
processes does searches hitting those indices, either explicitly,
through patterns, aliases or data streams. This commit changes the
behavior to disregard shards where the primary is inactive due to
having just been created.

Closes elastic#65846
@henningandersen henningandersen linked a pull request Dec 29, 2020 that will close this issue
@bpintea bpintea assigned bpintea and unassigned bpintea Nov 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants