New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove I/O pool blocking sniffing call from onFailure callback, add some logic around host exclusion #27985

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
6 participants
@reembs

reembs commented Dec 25, 2017

Addressing: #27984

In addition to the bugfix, I tried making the mechanism a bit clearer in a mutlithread context. Beforehand the onFailure execution stack would call the sniffing code in the callback (bad). In that context it kinda made sense to exclude a single host, but since we're now delaying the execution (and because other requests might be running concurrently) it's clearer to use a list of hosts that need to be excluded.

This list expires its values after a certain number of sniffing cycles (sniffing calls) had passed. This number defaults to 1 to mimic the existing behavior more closely.

@elasticmachine

This comment has been minimized.

Show comment
Hide comment
@elasticmachine

elasticmachine Dec 25, 2017

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

elasticmachine commented Dec 25, 2017

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

1 similar comment
@elasticmachine

This comment has been minimized.

Show comment
Hide comment
@elasticmachine

elasticmachine Dec 25, 2017

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

elasticmachine commented Dec 25, 2017

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Dec 26, 2017

Contributor

@elasticmachine please test this

Contributor

DaveCTurner commented Dec 26, 2017

@elasticmachine please test this

@DaveCTurner DaveCTurner added the v7.0.0 label Dec 26, 2017

@reembs

This comment has been minimized.

Show comment
Hide comment
@reembs

reembs Dec 27, 2017

addressing earlier issue from July reported by @ssesha #25701

@DaveCTurner I fixed the styling issues and added documentation to the new public method. Can you trigger the build again?

reembs commented Dec 27, 2017

addressing earlier issue from July reported by @ssesha #25701

@DaveCTurner I fixed the styling issues and added documentation to the new public method. Can you trigger the build again?

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Jan 3, 2018

Contributor

@elasticmachine please test this

Contributor

DaveCTurner commented Jan 3, 2018

@elasticmachine please test this

@DaveCTurner

This comment has been minimized.

Show comment
Hide comment
@DaveCTurner

DaveCTurner Jan 3, 2018

Contributor

@reembs I can't recall what the styling issue was now, and it looks like you rewrote history to remove the commit that I looked at earlier so any comments I made are now lost. It makes it easier on reviewers if you just add commits to your branch. We squash the whole branch when it's merged so don't worry about making a noisy log at this stage. Please could you avoid rewriting history in future?

I have no further comments on the code, so over to @javanna for a proper review as he knows this area better than I do.

Contributor

DaveCTurner commented Jan 3, 2018

@reembs I can't recall what the styling issue was now, and it looks like you rewrote history to remove the commit that I looked at earlier so any comments I made are now lost. It makes it easier on reviewers if you just add commits to your branch. We squash the whole branch when it's merged so don't worry about making a noisy log at this stage. Please could you avoid rewriting history in future?

I have no further comments on the code, so over to @javanna for a proper review as he knows this area better than I do.

@reembs

This comment has been minimized.

Show comment
Hide comment
@reembs

reembs Jan 3, 2018

@DaveCTurner sorry about that habit. Issue was line length, I think. Your comments were on the PR so they are all still here.

reembs commented Jan 3, 2018

@DaveCTurner sorry about that habit. Issue was line length, I think. Your comments were on the PR so they are all still here.

@s1monw

This comment has been minimized.

Show comment
Hide comment
@s1monw

s1monw Jan 15, 2018

Contributor

@javanna or @tlrx can you take a look at this please? @reembs sorry for the delay

Contributor

s1monw commented Jan 15, 2018

@javanna or @tlrx can you take a look at this please? @reembs sorry for the delay

@tlrx

This comment has been minimized.

Show comment
Hide comment
@tlrx

tlrx Jan 16, 2018

Member

Thanks a lot @reembs for this contribution and the detailed issue, it's greatly appreciated. And sorry for the time it took us to reply.

I looked at the changes and I feel like managing a map of failed hosts at the Sniffer's task level will be confusing because the RestClient already takes care of handling failed hosts and takes care of reusing them after a short or long period of time. I'm referring to the RestClient.nextHost() method and its usage of DeadHostState, where a host is retried shortly after it failed but if it failed multiple times it will have to wait a longer time before being reused again. I think that RestClient correctly handles these cases, and also correctly remove a failed host from the backlist/failed hosts list once a host responded correctly to a request.

Instead, I think that the Sniffer.Task.sniffOnFailure() method could use its existing ScheduledExecutorService to schedule a new round of sniffing but this time removing the failed host (at it is already the case, but in a non blocking way using the ScheduledExecutorService).

What do you think?

Member

tlrx commented Jan 16, 2018

Thanks a lot @reembs for this contribution and the detailed issue, it's greatly appreciated. And sorry for the time it took us to reply.

I looked at the changes and I feel like managing a map of failed hosts at the Sniffer's task level will be confusing because the RestClient already takes care of handling failed hosts and takes care of reusing them after a short or long period of time. I'm referring to the RestClient.nextHost() method and its usage of DeadHostState, where a host is retried shortly after it failed but if it failed multiple times it will have to wait a longer time before being reused again. I think that RestClient correctly handles these cases, and also correctly remove a failed host from the backlist/failed hosts list once a host responded correctly to a request.

Instead, I think that the Sniffer.Task.sniffOnFailure() method could use its existing ScheduledExecutorService to schedule a new round of sniffing but this time removing the failed host (at it is already the case, but in a non blocking way using the ScheduledExecutorService).

What do you think?

@tlrx

This comment has been minimized.

Show comment
Hide comment
@tlrx

tlrx Jan 17, 2018

Member

I'd be happy to have @javanna 's opinion too

Member

tlrx commented Jan 17, 2018

I'd be happy to have @javanna 's opinion too

@reembs

This comment has been minimized.

Show comment
Hide comment
@reembs

reembs Jan 17, 2018

Instead, I think that the Sniffer.Task.sniffOnFailure() method could use its existing ScheduledExecutorService to schedule a new round of sniffing but this time removing the failed host (at it is already the case, but in a non blocking way using the ScheduledExecutorService).

@tlrx I get what you're saying about moving the the async ScheduledExecutorService call from Sniffer.sniffOnFailure() to Sniffer.Task.sniffOnFailure(), from an encapsulation perspective. Regarding the single/many host exclusion issue, I'm not sure I understand the reasoning of excluding just a single host. If the REST Client already has a dead-host concept, no host exclusion should be done on the Sniffer level. If we do choose to exclude hosts independently in the Sniffer, especially since the sniffing call on-failure is now async, excluding just a single host seems arbitrary. The PR changes the behavior from "exclude last host that failed" to "exclude all hosts that failed since the last sniffing cycle", which is a more reasonable approach IMO considering that cluster sizes often change in increments larger than 1.

reembs commented Jan 17, 2018

Instead, I think that the Sniffer.Task.sniffOnFailure() method could use its existing ScheduledExecutorService to schedule a new round of sniffing but this time removing the failed host (at it is already the case, but in a non blocking way using the ScheduledExecutorService).

@tlrx I get what you're saying about moving the the async ScheduledExecutorService call from Sniffer.sniffOnFailure() to Sniffer.Task.sniffOnFailure(), from an encapsulation perspective. Regarding the single/many host exclusion issue, I'm not sure I understand the reasoning of excluding just a single host. If the REST Client already has a dead-host concept, no host exclusion should be done on the Sniffer level. If we do choose to exclude hosts independently in the Sniffer, especially since the sniffing call on-failure is now async, excluding just a single host seems arbitrary. The PR changes the behavior from "exclude last host that failed" to "exclude all hosts that failed since the last sniffing cycle", which is a more reasonable approach IMO considering that cluster sizes often change in increments larger than 1.

@tlrx

This comment has been minimized.

Show comment
Hide comment
@tlrx

tlrx Jan 17, 2018

Member

@tlrx I get what you're saying about moving the the async ScheduledExecutorService call from Sniffer.sniffOnFailure() to Sniffer.Task.sniffOnFailure(), from an encapsulation perspective.

Good! I think this is the main point of this pull request, ie fix the bug that blocks the execution. That would be great if we can come up with a test for this too.

Regarding the single/many host exclusion issue, I'm not sure I understand the reasoning of excluding just a single host.

Sorry if I wasn't clear enough. I didn't mean to have a reasoning about excluding just a single host. This is true that the Sniffer adds back all the nodes it just sniffed except the one that just failed. But this is also true that the RestClient already has the logic of taking care of failing nodes and adding them back, so I think we must not duplicate a similar logic at the Sniffer level but instead we could (thinking out loud) expose a DeadHostStrategy/ReconnectPolicy/Whatever that would indicate when a failed node can be retried - whether because a given time as elapsed or because it failed during N sniffing rounds etc

Member

tlrx commented Jan 17, 2018

@tlrx I get what you're saying about moving the the async ScheduledExecutorService call from Sniffer.sniffOnFailure() to Sniffer.Task.sniffOnFailure(), from an encapsulation perspective.

Good! I think this is the main point of this pull request, ie fix the bug that blocks the execution. That would be great if we can come up with a test for this too.

Regarding the single/many host exclusion issue, I'm not sure I understand the reasoning of excluding just a single host.

Sorry if I wasn't clear enough. I didn't mean to have a reasoning about excluding just a single host. This is true that the Sniffer adds back all the nodes it just sniffed except the one that just failed. But this is also true that the RestClient already has the logic of taking care of failing nodes and adding them back, so I think we must not duplicate a similar logic at the Sniffer level but instead we could (thinking out loud) expose a DeadHostStrategy/ReconnectPolicy/Whatever that would indicate when a failed node can be retried - whether because a given time as elapsed or because it failed during N sniffing rounds etc

@tlrx

This comment has been minimized.

Show comment
Hide comment
@tlrx

tlrx Mar 13, 2018

Member

@reembs Would you like to revisit this?

Member

tlrx commented Mar 13, 2018

@reembs Would you like to revisit this?

@reembs

This comment has been minimized.

Show comment
Hide comment
@reembs

reembs Mar 14, 2018

I would like to get this merged, yes. Can we decide on host exclusion logic in context of this PR and bugfix?

Currently the PR introduces miltiple host exclusions on the sniffer level.

Since the PR is a bugfix and isn't meant to modify behavior, a reasonable solution would be to revert back to single host exclusion in the sniffer, although I find that strange in a mutli threaded object like the sniffer.

Second option is the one in the PR currently, multiple exclusions for all failing hosts from the previous cycle.

A thind option would be to remove host exclusion from the sniffer altogether.

Lets decide on one of these and I'll be happy to implement.

IMO, more sophisticated methods are outsude the scope of this bugfix.

@tlrx

reembs commented Mar 14, 2018

I would like to get this merged, yes. Can we decide on host exclusion logic in context of this PR and bugfix?

Currently the PR introduces miltiple host exclusions on the sniffer level.

Since the PR is a bugfix and isn't meant to modify behavior, a reasonable solution would be to revert back to single host exclusion in the sniffer, although I find that strange in a mutli threaded object like the sniffer.

Second option is the one in the PR currently, multiple exclusions for all failing hosts from the previous cycle.

A thind option would be to remove host exclusion from the sniffer altogether.

Lets decide on one of these and I'll be happy to implement.

IMO, more sophisticated methods are outsude the scope of this bugfix.

@tlrx

@javanna

This comment has been minimized.

Show comment
Hide comment
@javanna

javanna Mar 30, 2018

Member

heya @reembs, first of all sorry, once again, because it took me ages to get to this. I see the bug and your fix is good, I would love to get it in. Thanks a lot for taking the time to debug the problem and send a fix.

I just read the comments above. It is true that the REST client has its own blacklist like @tlrx said above. When sniffing is enabled though, every time the sniffer runs it replaces the hosts with the sniffed ones and clears the blacklist. The idea is that when nodes are listed in the cluster state, they are available. The blacklist though is maintained based on availability of nodes since the next sniffing round, assuming that nodes that are unresponsive will also fall out of the cluster.

If you sniff on failure, it may happen that you get a failure for a node that is still part of the cluster hence returned when sniffing. You then want to filter that one node out as you know that it just failed, otherwise it will be immediately retried which causes more failures which causes more sniffing rounds etc. Let's say that when using sniff on failure, the blacklist becomes less important, as we will re-sniff for each single failure which means replacing the hosts and clearing the blacklist every time.

Why one single host? The assumption is that if that host fails, we want to make sure not set it back to the client even though still part of the cluster state, and it can only be set back later if the following sniff run (which will happen with a different delay, called sniffAfterFailureDelay) will again return that node, which should happen only once the node is back. If another node fails, the same happens but the previous failed nodes should not be returned as part of the cluster state unless back alive, assuming that enough time has passed and the node has fallen out of the cluster. To summarize, because sniffing on failure kind of overrides the internal blacklist, filtering out the failed node immediately addresses the case where the cluster didn't realize yet that such node is failing. But nothing more should be needed if you sniff at every failure.

I would appreciate if you could update your PR and remove the new list of hosts to be excluded. Also, although the sniffer is not well tested today (I am working on improving this), it would be great to have a test for this. Do you think that you could reproduce the bug either in a unit test or sending requests to a mock http server like we do in other rest client tests?

Member

javanna commented Mar 30, 2018

heya @reembs, first of all sorry, once again, because it took me ages to get to this. I see the bug and your fix is good, I would love to get it in. Thanks a lot for taking the time to debug the problem and send a fix.

I just read the comments above. It is true that the REST client has its own blacklist like @tlrx said above. When sniffing is enabled though, every time the sniffer runs it replaces the hosts with the sniffed ones and clears the blacklist. The idea is that when nodes are listed in the cluster state, they are available. The blacklist though is maintained based on availability of nodes since the next sniffing round, assuming that nodes that are unresponsive will also fall out of the cluster.

If you sniff on failure, it may happen that you get a failure for a node that is still part of the cluster hence returned when sniffing. You then want to filter that one node out as you know that it just failed, otherwise it will be immediately retried which causes more failures which causes more sniffing rounds etc. Let's say that when using sniff on failure, the blacklist becomes less important, as we will re-sniff for each single failure which means replacing the hosts and clearing the blacklist every time.

Why one single host? The assumption is that if that host fails, we want to make sure not set it back to the client even though still part of the cluster state, and it can only be set back later if the following sniff run (which will happen with a different delay, called sniffAfterFailureDelay) will again return that node, which should happen only once the node is back. If another node fails, the same happens but the previous failed nodes should not be returned as part of the cluster state unless back alive, assuming that enough time has passed and the node has fallen out of the cluster. To summarize, because sniffing on failure kind of overrides the internal blacklist, filtering out the failed node immediately addresses the case where the cluster didn't realize yet that such node is failing. But nothing more should be needed if you sniff at every failure.

I would appreciate if you could update your PR and remove the new list of hosts to be excluded. Also, although the sniffer is not well tested today (I am working on improving this), it would be great to have a test for this. Do you think that you could reproduce the bug either in a unit test or sending requests to a mock http server like we do in other rest client tests?

@javanna

hi @reembs thanks again, in addition to my previous comment I reviewed your change to make the sniff on failure round async and left a couple of comments.

}
} catch (Exception e) {
logger.error("error while sniffing nodes", e);
nextSniffDelay = sniffAfterFailureDelayMillis;

This comment has been minimized.

@javanna

javanna Mar 30, 2018

Member

The idea behind sniffAfterFailureDelay is that once a node fails, you want to re-sniff, but you may also want to check if the node has come back through a new sniffing round earlier than you'd usually do. So in general sniffAfterFailureDelay would be lower than sniffInterval. I see that here you are using now a different interval for cases where sniffing itself has failed, but that was not the purpose of the sniffAfterFailureDelay. It is not a bad idea to actually re-sniff earlier if something goes wrong, but it should be a different concept.

@javanna

javanna Mar 30, 2018

Member

The idea behind sniffAfterFailureDelay is that once a node fails, you want to re-sniff, but you may also want to check if the node has come back through a new sniffing round earlier than you'd usually do. So in general sniffAfterFailureDelay would be lower than sniffInterval. I see that here you are using now a different interval for cases where sniffing itself has failed, but that was not the purpose of the sniffAfterFailureDelay. It is not a bad idea to actually re-sniff earlier if something goes wrong, but it should be a different concept.

}
/**
* Triggers a new sniffing round and explicitly takes out the failed host provided as argument
*/
public void sniffOnFailure(HttpHost failedHost) {
this.task.sniffOnFailure(failedHost);
this.task.failedHosts.putIfAbsent(failedHost, 0L);
this.task.scheduleNextRun(0);

This comment has been minimized.

@javanna

javanna Mar 30, 2018

Member

I suspect that having to pass the next interval which differs from the usual interval is the reason why this was not async in the first place :) we need to find a way to do that though, we currently lose that behaviour with this change.

Another idea, maybe overkill, could be to change the hosts sniffer API to be async and accept a listener as an argument.

@javanna

javanna Mar 30, 2018

Member

I suspect that having to pass the next interval which differs from the usual interval is the reason why this was not async in the first place :) we need to find a way to do that though, we currently lose that behaviour with this change.

Another idea, maybe overkill, could be to change the hosts sniffer API to be async and accept a listener as an argument.

@javanna

This comment has been minimized.

Show comment
Hide comment
@javanna

javanna Apr 20, 2018

Member

@reembs thanks again for your PR and diving into this one!

Based on my comments above, and some recent work I did in this area, I would close this as it's now superseded by #29638 which fixes the original issue and adds tests for sniffing.

Member

javanna commented Apr 20, 2018

@reembs thanks again for your PR and diving into this one!

Based on my comments above, and some recent work I did in this area, I would close this as it's now superseded by #29638 which fixes the original issue and adds tests for sniffing.

@javanna javanna closed this Apr 20, 2018

@javanna

This comment has been minimized.

Show comment
Hide comment
@javanna

javanna Apr 20, 2018

Member

one more thing, I have incorporated some of your concerns around taking out one host when sniffing on failure. We discussed this internally and we went for taking no hosts out after all and rely entirely on the returned nodes from the cluster. Taking one host out was proving tricky to test and debug, and probably not that useful either.

Member

javanna commented Apr 20, 2018

one more thing, I have incorporated some of your concerns around taking out one host when sniffing on failure. We discussed this internally and we went for taking no hosts out after all and rely entirely on the returned nodes from the cluster. Taking one host out was proving tricky to test and debug, and probably not that useful either.

@reembs reembs deleted the reembs:fix-restclient-onfailure-io-loop-block branch May 6, 2018

javanna added a commit that referenced this pull request May 31, 2018

Refactor Sniffer and make it testable (#29638)
This commit reworks the Sniffer component to simplify it and make it possible to test it.

In particular, it no longer takes out the host that failed when sniffing on failure, but rather relies on whatever the cluster returns. This is the result of some valid comments from #27985. Taking out one single host is too naive, hard to test and debug.

A new Scheduler abstraction is introduced to abstract the tasks scheduling away and make it possible to plug in any test implementation and take out timing aspects when testing.

Concurrency aspects have also been improved, synchronized methods are no longer required. At the same time, we were able to take #27697 and #25701 into account and fix them, especially now that we can more easily add tests.

Last but not least, unit tests are added for the Sniffer component, long overdue.

Closes #27697
Closes #25701

bizybot added a commit that referenced this pull request Jun 1, 2018

Merge master into feature/kerberos (#31016)
* Remove AllocatedPersistentTask.getState() (#30858)

This commit removes the method AllocatedPersistentTask.getState() that
exposes the internal state of an AllocatedPersistentTask and replaces
it with a new isCompleted() method. Related to #29608.

* Improve allocation-disabling instructions (#30248)

Clarify the “one minute” in the instructions to disable the shard allocation
when doing maintenance to say that it is configurable.

* Replace several try-finally statements (#30880)

This change replaces some existing try-finally statements that close resources
in their finally block with the slightly shorter and safer try-with-resources
pattern.

* Move list tasks under Tasks namespace (#30906)

Our API spec define the tasks API as e.g. tasks.list, meaning that they belong to their own namespace. This commit moves them from the cluster namespace to their own namespace.

Relates to #29546

* Deprecate accepting malformed requests in stored script API (#28939)

The stored scripts API today accepts malformed requests instead of throwing an exception.
This PR deprecates accepting malformed put stored script requests (requests not using the official script format).

Relates to #27612

* Remove log traces in AzureStorageServiceImpl and fix test (#30924)

This commit removes some log traces in AzureStorageServiceImpl and also
fixes the AzureStorageServiceTests so that is uses the real
implementation to create Azure clients.

* Fix IndexTemplateMetaData parsing from xContent (#30917)

We failed to register "aliases" and "version" into the list of keywords
in the IndexTemplateMetaData; then fail to parse the following index
template.

```
{
    "aliases": {"log": {}},
    "index_patterns": ["pattern-1"]
}
```
This commit registers that missing keywords.

* [DOCS] Reset edit links (#30909)

* Limit the scope of BouncyCastle dependency (#30358)

Limits the scope of the runtime dependency on
BouncyCastle so that it can be eventually removed.

* Splits functionality related to reading and generating certificates
and keys in two utility classes so that reading certificates and
keys doesn't require BouncyCastle.
* Implements a class for parsing PEM Encoded key material (which also
adds support for reading PKCS8 encoded encrypted private keys).
* Removes BouncyCastle dependency for all of our test suites(except
for the tests that explicitly test certificate generation) by using
pre-generated keys/certificates/keystores.

* Upgrade to Lucene-7.4-snapshot-1cbadda4d3 (#30928)

This snapshot includes LUCENE-8328 which is needed to stabilize CCR builds.

* Moved keyword tokenizer to analysis-common module (#30642)

Relates to #23658

* [test] packaging test logging for suse distros

* Fix location of AbstractHttpServerTransport (#30888)

Currently AbstractHttpServerTransport is in a netty4 module. This is the
incorrect location. This commit moves it out of netty4 module.
Additionally, it moves unit tests that test AbstractHttpServerTransport
logic to server.

* [test] packaging: use shell when running commands (#30852)

When subprocesses are started with ProcessBuilder, they're forked by the
java process directly rather than from a shell, which can be surprising
for our use case here in the packaging tests which is similar to
scripting.

This commit changes the tests to run their subprocess commands in a
shell, using the bash -c <script> syntax for commands on linux and using
the powershell.exe -Command <script> syntax for commands on windows.
This syntax on windows is essentially what the tests were already doing.

* [DOCS] Adds missing TLS settings for auditing (#30822)

* stable filemode for zip distributions (#30854)

Applies default file and directory permissions to zip distributions
similar to how they're set for the tar distributions. Previously zip
distributions would retain permissions they had on the build host's
working tree, which could vary depending on its umask

For #30799

* Minor clean-up in InternalRange. (#30886)

* Make sure all instance variables are final.
* Make generateKey a private static method, instead of protected.
* Rename formatter -> format for consistency.
* Serialize bucket keys as strings as opposed to optional strings.
* Pull the stream serialization logic for buckets into the Bucket class.

* [DOCS] Remove reference to platinum Docker image (#30916)

* Use dedicated ML APIs in tests (#30941)

ML has dedicated APIs for datafeeds and jobs yet base test classes and
some tests were relying on the cluster state for this state. This commit
removes this usage in favor of using the dedicated endpoints.

* Update the version checks around range bucket keys, now that the change was backported.

* [DOCS] Fix watcher file location

* Rename methods in PersistentTasksService (#30837)

This commit renames methods in the PersistentTasksService, to 
make obvious that the methods send requests in order to change 
the state of persistent tasks. 

Relates to #29608.

* Rename index_prefix to index_prefixes (#30932)

This commit also adds index_prefixes tests to TextFieldMapperTests to ensure that cloning and wire-serialization work correctly

* Add missing_bucket option in the composite agg (#29465)

This change adds a new option to the composite aggregation named `missing_bucket`.
This option can be set by source and dictates whether documents without a value for the
source should be ignored. When set to true, documents without a value for a field emits
an explicit `null` value which is then added in the composite bucket.
The `missing` option that allows to set an explicit value (instead of `null`) is deprecated in this change and will be removed in a follow up (only in 7.x).
This commit also changes how the big arrays are allocated, instead of reserving
the provided `size` for all sources they are created with a small intial size and they grow
depending on the number of buckets created by the aggregation:
Closes #29380

* Fsync state file before exposing it (#30929)

With multiple data paths, we write the state files for index metadata to all data paths. We only properly fsync on the first location, though. For other locations, we possibly expose the file before its contents is properly fsynced. This can lead to situations where, after a crash, and where the first data path is not available anymore, ES will see a partially-written state file, preventing the node to start up.

* Fix AliasMetaData parsing (#30866)

AliasMetaData should be parsed more leniently so that the high-level REST client can support forward compatibility on it. This commit addresses this issue that was found as part of #28799 and adds dedicated XContent tests as well.

* Cross Cluster Search: do not use dedicated masters as gateways (#30926)

When we are connecting to a remote cluster we should never select
dedicated master nodes as gateway nodes, or we will end up loading them
with requests that should rather go to other type of nodes e.g. data
nodes or coord_only nodes.

This commit adds the selection based on the node role, to the existing
selection based on version and potential node attributes.

Closes #30687

* Fix missing option serialization after backport

Relates #29465

* REST high-level client: add synced flush API (2) (#30650)

Adds the synced flush API to the high level REST client.

Relates to #27205.

* Fix synced flush docs

They had some copy and paste errors that failed the docs build.

* Change ScriptException status to 400 (bad request) (#30861)

Currently failures to compile a script usually lead to a ScriptException, which
inherits the 500 INTERNAL_SERVER_ERROR from ElasticsearchException if it does
not contain another root cause. Instead, this should be a 400 Bad Request error.
This PR changes this more generally for script compilation errors by changing 
ScriptException to return 400 (bad request) as status code.

Closes #12315

* Fix composite agg serialization error

Fix serialization after backport

Relates #29465

* Revert accidentally pushed changes in NoriAnalysisTests

* SQL: Remove log4j and joda from JDBC dependencies (#30938)

More cleanup of JDBC driver project

Relates to #29856

* [DOCS] Fixes kibana security file location

* [CI] Mute SamlAuthenticatorTests testIncorrectSigningKeyIsRejected

Tracked by #30970

* Add Verify Repository High Level REST API (#30934)

This commit adds Verify Repository, the associated docs and tests for
the high level REST API client. A few small changes to the Verify
Repository Response went into the commit as well.

Relates #27205

* Add “took” timing info to response for _msearch/template API (#30961)

Add “took” timing info to response for _msearch/template API
Closes #30957

* Mute FlushIT tests

We have identified the source causing these tests failed.
This commit mutes them again until we have a proper fix.

Relates #29392

* [CI] Mute HttpSecretsIntegrationTests#testWebhookAction test

Tracked by #30094

* [Test] Prefer ArrayList over Vector (#30965)

Replaces some occurances of Vector class with ArrayList in
tests of the rank-eval module.

* Fix license on AcitveDirectorySIDUtil (#30972)

This code is from an Apache 2.0 licensed codebase and when we imported
it into our codebase it carried the Apache 2.0 license as well. However,
during the migration of the X-Pack codebase from the internal private
repository to the elastic/elasticsearch repository, the migration tool
mistakently changed the license on this source file from the Apache 2.0
license to the Elastic license. This commit addresses this mistake by
reapplying the Apache 2.0 license.

* [CI] Mute Ml rolling upgrade tests

Tracked by #30982

* Make AllocatedPersistentTask.isCompleted() protected (#30949)

This commit changes the isCompleted() method to be protected so that
classes that extends AllocatedPersistentTask can use it.

Related to #30858

* [CI] Mute Ml rolling upgrade test for mixed cluster too

It can fail in either the mixed cluster or the upgraded cluster,
so it needs to be muted in both.

Tracked by #30982

* [Docs] Fix typo in Min Aggregation reference (#30899)

* Refactor Sniffer and make it testable (#29638)

This commit reworks the Sniffer component to simplify it and make it possible to test it.

In particular, it no longer takes out the host that failed when sniffing on failure, but rather relies on whatever the cluster returns. This is the result of some valid comments from #27985. Taking out one single host is too naive, hard to test and debug.

A new Scheduler abstraction is introduced to abstract the tasks scheduling away and make it possible to plug in any test implementation and take out timing aspects when testing.

Concurrency aspects have also been improved, synchronized methods are no longer required. At the same time, we were able to take #27697 and #25701 into account and fix them, especially now that we can more easily add tests.

Last but not least, unit tests are added for the Sniffer component, long overdue.

Closes #27697
Closes #25701

* Deprecates indexing and querying a context completion field without context (#30712)

This change deprecates completion queries and documents without context that target a
context enabled completion field. Querying without context degrades the search
performance considerably (even when the number of indexed contexts is low).
This commit targets master but the deprecation will take place in 6.x and the functionality
will be removed in 7 in a follow up.

Closes #29222

* Core: Remove RequestBuilder from Action (#30966)

This commit removes the RequestBuilder generic type from Action. It was
needed to be used by the newRequest method, which in turn was used by
client.prepareExecute. Both of these methods are now removed, along with
the existing users of prepareExecute constructing the appropriate
builder directly.

* Ensure intended key is selected in SamlAuthenticatorTests (#30993)

* Ensure that a purposefully wrong key is used

Uses a specific keypair for tests that require a purposefully wrong
keypair instead of selecting one randomly from the same pull from
which the correct one is selected. Entropy is low because of the
small space and the same key can be randomly selected as both the
correct one and the wrong one, causing the tests to fail.
The purposefully wrong key is also used in 
testSigningKeyIsReloadedForEachRequest and needs to be cleaned
up afterwards so the rest of the tests don't use that for signing.

Resolves #30970

* [DOCS] Update readme for testing x-pack code snippets (#30696)

* Remove version read/write logic in Verify Response (#30879)

Since master will always communicate with a >=6.4 node, the logic for
checking if the node is 6.4 and conditionally reading and writing based
on that can be removed from master. This logic will stay in 6.x as it is
the bridge to the cleaner response in master. This also unmutes the
failing test due to this bwc change.

Closes #30807

* HLRest: Allow caller to set per request options (#30490)

This modifies the high level rest client to allow calling code to
customize per request options for the bulk API. You do the actual
customization by passing a `RequestOptions` object to the API call
which is set on the `Request` that is generated by the high level
client. It also makes the `RequestOptions` a thing in the low level
rest client. For now that just means you use it to customize the
headers and the `httpAsyncResponseConsumerFactory` and we'll add
node selectors and per request timeouts in a follow up.

I only implemented this on the bulk API because it is the first one
in the list alphabetically and I wanted to keep the change small
enough to review. I'll convert the remaining APIs in a followup.

* [DOCS] Clarify not all PKCS12 usable as truststores (#30750)

Although elasticsearch-certutil generates PKCS#12
files which are usable as both keystore and truststore
this is uncommon in practice. Settle these expectations
for the users following our security guides.

* Transport client: Don't validate node in handshake (#30737)

This is related to #30141. Right now in the transport client we open a
temporary node connection and take the node information. This node
information is used to open a permanent connection that is used for the
client. However, we continue to use the configured transport address.
If the configured transport address is a load balancer, you might
connect to a different node for the permanent connection. This causes
the handshake validation to fail. This commit removes the handshake
validation for the transport client when it simple node sample mode.

* Remove unused query methods from MappedFieldType. (#30987)

* Remove MappedFieldType#nullValueQuery, as it is now unused.
* Remove MappedFieldType#queryStringTermQuery, as it is never overridden.

* Reuse expiration date of trial licenses (#30950)

* Retain the expiryDate for trial licenses

While updating the license signature to the new license spec retain
the trial license expiration date to that of the existing license.

Resolves #30882

* Watcher: Give test a little more time

Changes watcher's integration tests to wait 30 seconds when starting
watcher rather than 10 seconds because this build failed when starting
took 12 seconds:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.3+periodic/222/console

javanna added a commit that referenced this pull request Jun 1, 2018

Refactor Sniffer and make it testable (#29638)
This commit reworks the Sniffer component to simplify it and make it possible to test it.

In particular, it no longer takes out the host that failed when sniffing on failure, but rather relies on whatever the cluster returns. This is the result of some valid comments from #27985. Taking out one single host is too naive, hard to test and debug.

A new Scheduler abstraction is introduced to abstract the tasks scheduling away and make it possible to plug in any test implementation and take out timing aspects when testing.

Concurrency aspects have also been improved, synchronized methods are no longer required. At the same time, we were able to take #27697 and #25701 into account and fix them, especially now that we can more easily add tests.

Last but not least, unit tests are added for the Sniffer component, long overdue.

Closes #27697
Closes #25701
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment