Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove I/O pool blocking sniffing call from onFailure callback, add some logic around host exclusion #27985

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -29,8 +29,11 @@
import java.io.IOException;
import java.security.AccessController;
import java.security.PrivilegedAction;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.ScheduledFuture;
import java.util.concurrent.ThreadFactory;
Expand All @@ -53,15 +56,16 @@ public class Sniffer implements Closeable {

private final Task task;

Sniffer(RestClient restClient, HostsSniffer hostsSniffer, long sniffInterval, long sniffAfterFailureDelay) {
this.task = new Task(hostsSniffer, restClient, sniffInterval, sniffAfterFailureDelay);
Sniffer(RestClient restClient, HostsSniffer hostsSniffer, long sniffInterval, long sniffAfterFailureDelay, int maxExcludedRounds) {
this.task = new Task(hostsSniffer, restClient, sniffInterval, sniffAfterFailureDelay, maxExcludedRounds);
}

/**
* Triggers a new sniffing round and explicitly takes out the failed host provided as argument
*/
public void sniffOnFailure(HttpHost failedHost) {
this.task.sniffOnFailure(failedHost);
this.task.failedHosts.putIfAbsent(failedHost, 0L);
this.task.scheduleNextRun(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that having to pass the next interval which differs from the usual interval is the reason why this was not async in the first place :) we need to find a way to do that though, we currently lose that behaviour with this change.

Another idea, maybe overkill, could be to change the hosts sniffer API to be async and accept a listener as an argument.

}

@Override
Expand All @@ -75,15 +79,24 @@ private static class Task implements Runnable {

private final long sniffIntervalMillis;
private final long sniffAfterFailureDelayMillis;
private final int maxExcludedRounds;
private final ScheduledExecutorService scheduledExecutorService;
private final AtomicBoolean running = new AtomicBoolean(false);
private ScheduledFuture<?> scheduledFuture;
private ConcurrentHashMap<HttpHost, Long> failedHosts = new ConcurrentHashMap<>();

private Task(
HostsSniffer hostsSniffer,
RestClient restClient,
long sniffIntervalMillis,
long sniffAfterFailureDelayMillis,
int maxExcludedRounds) {

private Task(HostsSniffer hostsSniffer, RestClient restClient, long sniffIntervalMillis, long sniffAfterFailureDelayMillis) {
this.hostsSniffer = hostsSniffer;
this.restClient = restClient;
this.sniffIntervalMillis = sniffIntervalMillis;
this.sniffAfterFailureDelayMillis = sniffAfterFailureDelayMillis;
this.maxExcludedRounds = maxExcludedRounds;
SnifferThreadFactory threadFactory = new SnifferThreadFactory(SNIFFER_THREAD_NAME);
this.scheduledExecutorService = Executors.newScheduledThreadPool(1, threadFactory);
scheduleNextRun(0);
Expand All @@ -106,35 +119,63 @@ synchronized void scheduleNextRun(long delayMillis) {

@Override
public void run() {
sniff(null, sniffIntervalMillis);
}

void sniffOnFailure(HttpHost failedHost) {
sniff(failedHost, sniffAfterFailureDelayMillis);
sniff(sniffIntervalMillis);
}

void sniff(HttpHost excludeHost, long nextSniffDelayMillis) {
void sniff(long nextSniffDelayMillis) {
if (running.compareAndSet(false, true)) {
long nextSniffDelay = nextSniffDelayMillis;
try {
List<HttpHost> sniffedHosts = hostsSniffer.sniffHosts();
logger.debug("sniffed hosts: " + sniffedHosts);
if (excludeHost != null) {
sniffedHosts.remove(excludeHost);
}
if (sniffedHosts.isEmpty()) {

List<HttpHost> hostsFiltered = removeExcludedAndCycle(sniffedHosts);
logger.debug("sniffed hosts after filtering: " + sniffedHosts);

if (hostsFiltered.isEmpty()) {
logger.warn("no hosts to set, hosts will be updated at the next sniffing round");
} else {
this.restClient.setHosts(sniffedHosts.toArray(new HttpHost[sniffedHosts.size()]));
this.restClient.setHosts(hostsFiltered.toArray(new HttpHost[hostsFiltered.size()]));
}
} catch (Exception e) {
logger.error("error while sniffing nodes", e);
nextSniffDelay = sniffAfterFailureDelayMillis;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind sniffAfterFailureDelay is that once a node fails, you want to re-sniff, but you may also want to check if the node has come back through a new sniffing round earlier than you'd usually do. So in general sniffAfterFailureDelay would be lower than sniffInterval. I see that here you are using now a different interval for cases where sniffing itself has failed, but that was not the purpose of the sniffAfterFailureDelay. It is not a bad idea to actually re-sniff earlier if something goes wrong, but it should be a different concept.

} finally {
scheduleNextRun(nextSniffDelayMillis);
scheduleNextRun(nextSniffDelay);
running.set(false);
}
}
}

/**
* Remove excluded hosts from the list of all sniffed hosts, and cycle through the map. Hosts in the map remain
* there for {@link org.elasticsearch.client.sniff.Sniffer.Task#maxExcludedRounds} cycles
* @param allHosts the list of all sniffed hosts
* @return a new list containing the remaining hosts
*/
private List<HttpHost> removeExcludedAndCycle(List<HttpHost> allHosts) {
final List<HttpHost> excluded = Collections.list(failedHosts.keys());

if (excluded.isEmpty()) {
return allHosts;
}

try {
List<HttpHost> copy = new ArrayList<>(allHosts);
copy.removeAll(excluded);
return copy;
} finally {
for (HttpHost host : excluded) {
long excludedCycles = failedHosts.get(host) + 1;
if (excludedCycles >= maxExcludedRounds) {
failedHosts.remove(host);
} else {
failedHosts.put(host, excludedCycles);
}
}
}
}

synchronized void shutdown() {
scheduledExecutorService.shutdown();
try {
Expand Down
Expand Up @@ -30,10 +30,12 @@
public final class SnifferBuilder {
public static final long DEFAULT_SNIFF_INTERVAL = TimeUnit.MINUTES.toMillis(5);
public static final long DEFAULT_SNIFF_AFTER_FAILURE_DELAY = TimeUnit.MINUTES.toMillis(1);
public static final int DEFAULT_HOST_EXCLUDED_SNIFF_ROUNDS = 1;

private final RestClient restClient;
private long sniffIntervalMillis = DEFAULT_SNIFF_INTERVAL;
private long sniffAfterFailureDelayMillis = DEFAULT_SNIFF_AFTER_FAILURE_DELAY;
private int maxExcludedRounds = DEFAULT_HOST_EXCLUDED_SNIFF_ROUNDS;
private HostsSniffer hostsSniffer;

/**
Expand Down Expand Up @@ -79,13 +81,24 @@ public SnifferBuilder setHostsSniffer(HostsSniffer hostsSniffer) {
return this;
}

/**
* Sets the amount of future sniffing calls from which a host that had failed a request would be excluded. Will not
* be used if client wasn't built utilizing
* {@link org.elasticsearch.client.RestClientBuilder#setFailureListener(RestClient.FailureListener)}
* @param maxExcludedRounds the number of sniffing calls to exclude a host from
*/
public SnifferBuilder setMaxExcludedRounds(int maxExcludedRounds) {
this.maxExcludedRounds = maxExcludedRounds;
return this;
}

/**
* Creates the {@link Sniffer} based on the provided configuration.
*/
public Sniffer build() {
if (hostsSniffer == null) {
this.hostsSniffer = new ElasticsearchHostsSniffer(restClient);
}
return new Sniffer(restClient, hostsSniffer, sniffIntervalMillis, sniffAfterFailureDelayMillis);
return new Sniffer(restClient, hostsSniffer, sniffIntervalMillis, sniffAfterFailureDelayMillis, maxExcludedRounds);
}
}
Expand Up @@ -82,8 +82,9 @@ public void testUsage() throws IOException {
.build();
Sniffer sniffer = Sniffer.builder(restClient)
.setSniffAfterFailureDelayMillis(30000) // <2>
.setMaxExcludedRounds(3) // <3>
.build();
sniffOnFailureListener.setSniffer(sniffer); // <3>
sniffOnFailureListener.setSniffer(sniffer); // <4>
//end::sniff-on-failure
}
{
Expand Down
3 changes: 2 additions & 1 deletion docs/java-rest/low-level/sniffer.asciidoc
Expand Up @@ -98,7 +98,8 @@ normal and we want to detect that as soon as possible. Said interval can be
customized at `Sniffer` creation time through the `setSniffAfterFailureDelayMillis`
method. Note that this last configuration parameter has no effect in case sniffing
on failure is not enabled like explained above.
<3> Set the `Sniffer` instance to the failure listener
<3> Set the amount of sniffing rounds to exclude a failed host for
<4> Set the `Sniffer` instance to the failure listener

The Elasticsearch Nodes Info api doesn't return the protocol to use when
connecting to the nodes but only their `host:port` key-pair, hence `http`
Expand Down