Allow Ignoring failures for indices that have just been created #65846

benwtrent · 2020-12-03T18:58:02Z

Description of the problem including expected versus actual behavior:

When a search expands to a recently created index, and the index is not fully initialized, there is a window of time where the search will fail due to missing shards.

For many uses inside of the machine learning plugin, this is problematic as we don't know if it is a "true" failure or not.

Current ML failures caused by this scenario.

[CI] XPackRestIT.test {p0=ml/data_frame_analytics_crud/Test get given multiple analytics} failed #63361
[CI] JobResultsProviderIT testMultipleSimultaneousJobCreations fails #40134
[CI] DeleteExpiredDataIT testDeleteExpiredDataWithStandardThrottle fails with "all shards failed" #62699
[CI] Deleting data frame analytics jobs can throw all shards failed #60462
[CI] Test cat data frame analytics single job with header failing #58841

Steps to reproduce:

Reproducing this is exceptionally difficult.

Probably the best chance to reproduce this locally at the moment is with DeleteExpiredDataIT.testDeleteExpiredDataNoThrottle or DeleteExpiredDataIT.testDeleteExpiredDataWithStandardThrottle. However, it’s tricky because these tests are much more likely to fail if other tests have run before them so they start immediately after the cluster cleaning that runs in between tests. You can unmute those two by reverting 309684f. Then I guess if you run all the tests in DeleteExpiredDataIT repeatedly then one of them will have to run after another’s cleanup.

This situation might be addressable via allowing an option to wait for the primary shard to be assigned (if it is being initialized).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-12-03T18:58:05Z

Pinging @elastic/es-search (Team:Search)

benwtrent · 2020-12-03T18:58:23Z

//CC @elastic/ml-core

benwtrent · 2020-12-03T18:59:39Z

Other related issues:

#45250

elastic/elasticsearch-net#4280

droberts195 · 2020-12-07T11:33:15Z

/cc @henningandersen and @fcofdez because I think some work the distributed team is doing might be able to solve this

fcofdez · 2020-12-09T10:30:06Z

Yes, we've been exploring different approaches to overcome these kinds of issues. I wrote a small POC (#64315) to explore more the problem but we have been hesitant to change how search currently works and we're leaning towards a different model for async searches. We think that for blocking searches it makes sense to fail fast as it does now.

Additionally, even if we end up implementing a way to wait for shards to be allocated, we should bound that wait with a timeout and there's no guarantee that these kinds of spurious failures won't happen again.

Maybe what we can do is provide some kind of primitive for tests to wait until the shards are allocated, so failures caused by a delayed shard allocation are easier to find.

/cc @henningandersen @jimczi

droberts195 · 2020-12-09T10:54:25Z

Maybe what we can do is provide some kind of primitive for tests to wait until the shards are allocated

But what about end users? It seems like we're going in the wrong direction if we're writing special functionality to make the tests pass. It's as though the reason Elasticsearch exists is to make the tests pass rather than as a useful product for end users.

The scenario where it breaks is:

Thread 1 wants to know if a particular document exists in a particular set of indices, which may or may not exist.
Thread 2 creates an index that matches the pattern
Thread 1 searches for the document with ignore_unavailable=true&allow_no_indices=true
At this point the index exists but doesn’t have any shards allocated, so thread 1 gets an “all shards failed” exception with an empty list of shards that failed
An ML test fails, or, in production, a user gets an error response that doesn’t make sense

I think it will be hard to add test functionality that solves that. One way to solve it in tests is to proactively create all indices up-front, then wait for yellow status on the cluster before proceeding to the rest of the test code. But if that is not what the production system does then the test isn't testing what end users observe. Some pieces of ML functionality create indices only when they're needed, and if a different thread is doing searches against those indices not caring if they don't exist then that is where the problem arises.

it makes sense to fail fast as it does now

If it's going to fail fast, could it fail fast with a specific exception type meaning "all shards failed because this index has only just been created"? At the moment we get the search exception with "all shards failed", but we have no idea if this is because the index was only just created or because of some major problem in the cluster that caused both primaries and replicas to be inaccessible.

This would solve the problem for both tests and end users because then the ML code that doesn't care if there are no results from a search because the index being searched does not exist could treat that specific type of exception as meaning "no results".

henningandersen · 2020-12-09T12:39:03Z

I do think it is possible to solve this without waiting, since an empty index should result in no hits. I am mainly worried about the potential edge cases that could be the result of such a check up front (like if the coordinator is isolated, it may return no hits when there really are hits). I agree that we should not solve this for tests only (unless it turns out to be a test only issue). We could make the check when receiving the failure, which would mitigate at least some of that.

fcofdez · 2020-12-09T13:12:16Z

If it's going to fail fast, could it fail fast with a specific exception type meaning "all shards failed because this index has only just been created"? At the moment we get the search exception with "all shards failed", but we have no idea if this is because the index was only just created or because of some major problem in the cluster that caused both primaries and replicas to be inaccessible.

I think this is related to some configuration on the http layer, maybe http.detailed_errors.enabled, since the exception generated by the search phase provides information about the failed shard executions. i.e.

{
   "type":"search_phase_execution_exception",
   "reason":"all shards failed",
   "phase":"query",
   "grouped":true,
   "failed_shards":[
      {
         "shard":0,
         "index":"bsvfqshqqy",
         "node":null,
         "reason":{
            "type":"no_shard_available_action_exception",
            "reason":null,
            "index_uuid":"hByB2qCKT8yXdRjrxDZqIw",
            "shard":"0",
            "index":"bsvfqshqqy"
         }
      }
   ]
}

I am mainly worried about the potential edge cases that could be the result of such a check up front (like if the coordinator is isolated, it may return no hits when there really are hits)

That should result in shard search failures too, as long as the connection is closed at some point (since the internal search
requests don't have a timeout). But maybe I'm missing something here.

When creating an index, the primary will be either unassigned or initializing for a short while. This causes issues when concurrent processes does searches hitting those indices, either explicitly, through patterns, aliases or data streams. This commit changes the behavior to disregard shards where the primary is inactive due to having just been created. Closes elastic#65846

benwtrent added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Dec 3, 2020

elasticmachine added the Team:Search Meta label for search team label Dec 3, 2020

benwtrent mentioned this issue Dec 3, 2020

Better handling of ResourceAlreadyExistsException for security index #46214

Open

henningandersen linked a pull request Dec 29, 2020 that will close this issue

Search during index create #66853

Open

droberts195 mentioned this issue Jan 5, 2021

ML REST test fails with NoShardAvailableActionException #66931

Closed

droberts195 mentioned this issue Oct 29, 2021

[CI] RestSqlIT testAsyncTextPaginated failing #80089

Open

bpintea assigned bpintea and unassigned bpintea Nov 1, 2021

This was referenced Nov 11, 2021

[CI] XPackRestIT test {p0=ml/data_frame_analytics_crud/Test get stats given multiple analytics} failing #71317

Open

[CI] XPackRestIT test {p0=ml/inference_crud/Test force delete given model referenced by pipeline} failing #80703

Open

darnautov mentioned this issue Jun 27, 2023

[ML] Enable trained models tests elastic/kibana#160271

Merged

1 task

jsoriano mentioned this issue Nov 9, 2023

Review handling of status code in HTTP requests elastic/elastic-package#1549

Merged

This was referenced Apr 25, 2024

[CI] XPackRestIT test {p0=ml/inference_crud/Test force delete given model with alias referenced by pipeline} failing #106652

Open

[CI] XPackRestIT test {p0=ml/inference_crud/Test force delete given model with alias referenced by pipeline} failing #107505

Open

davidkyle mentioned this issue May 2, 2024

[ML] Wait for all shards to be active when creating the ML stats index #108202

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow Ignoring failures for indices that have just been created #65846

Allow Ignoring failures for indices that have just been created #65846

benwtrent commented Dec 3, 2020

elasticmachine commented Dec 3, 2020

benwtrent commented Dec 3, 2020

benwtrent commented Dec 3, 2020

droberts195 commented Dec 7, 2020

fcofdez commented Dec 9, 2020

droberts195 commented Dec 9, 2020

henningandersen commented Dec 9, 2020

fcofdez commented Dec 9, 2020 •

edited

Loading

Allow Ignoring failures for indices that have just been created #65846

Allow Ignoring failures for indices that have just been created #65846

Comments

benwtrent commented Dec 3, 2020

elasticmachine commented Dec 3, 2020

benwtrent commented Dec 3, 2020

benwtrent commented Dec 3, 2020

droberts195 commented Dec 7, 2020

fcofdez commented Dec 9, 2020

droberts195 commented Dec 9, 2020

henningandersen commented Dec 9, 2020

fcofdez commented Dec 9, 2020 • edited Loading

fcofdez commented Dec 9, 2020 •

edited

Loading