Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maintenance mode for Historical #6349

Merged
merged 12 commits into from
Feb 5, 2019
Merged

maintenance mode for Historical #6349

merged 12 commits into from
Feb 5, 2019

Conversation

egor-ryashin
Copy link
Contributor

@egor-ryashin egor-ryashin commented Sep 19, 2018

With this feature, a Historical node can be put in maintenance mode, by adding it to historicalNodesInMaintenance in Druid Coordinator dynamic config. Coordinator will move segments from such nodes with a specified priority nodesInMaintenancePriority. That allows to move segments from nodes which will be replaced afterwards avoiding availability issues which could be caused by abrupt shutdown.
Issue #3247

This design suggests Coordinator as the container of the maintenance list which requires minimal changes to the current codebase (DruidCoordinatorBalancer, LoadRule).
An alternative design, where Historical nodes contain maintenance flag, will require, I expect,
double amount of code, as a separate thread will need to get that data from Historical nodes via an HTTP endpoint. Introducing the variable to DruidNode, seems awkward, as DruidNode is defined during service startup only, and changing this will require even greater amount of code and unit-test coverage.
So my design gives the same functionality for less amount of time. The second approach is actually an extension to my design and can be implemented at the next stage.
The initial design has only one flaw: if a Historical node is restarted the resource manager can assign a different port to the Historical node, which requires from the operator to update the maintenance list manually. This is quite rare but possible.

forbidden api fix, config deserialization fix

logging fix, unit tests
@himanshug
Copy link
Contributor

LGTM, its an useful feature.

have you tested this on a largish cluster with like (10%, 20%, 30%... ) nodes in maintenance ?

@leventov leventov self-requested a review September 25, 2018 22:01
@egor-ryashin
Copy link
Contributor Author

@himanshug not yet, I did local tests, I'm expecting to test it in production with a new release.

@JsonProperty("maxSegmentsInNodeLoadingQueue") int maxSegmentsInNodeLoadingQueue
@JsonProperty("maxSegmentsInNodeLoadingQueue") int maxSegmentsInNodeLoadingQueue,
@JsonProperty("maintenanceList") Object maintenanceList,
@JsonProperty("maintenanceModeSegmentsPriority") int maintenanceModeSegmentsPriority
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be nullable for backward compatibility? It's strange to see both the class and the builder with @JsonCreator

Copy link
Contributor Author

@egor-ryashin egor-ryashin Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If JSON doesn't contain new fields then the object will have the empty list and ratio = 7, with the empty list Coordinator will ignore those parameters.

@@ -51,6 +51,8 @@
private final boolean emitBalancingStats;
private final boolean killAllDataSources;
private final Set<String> killDataSourceWhitelist;
private final Set<String> maintenanceList;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing that Set is called "list". Suggested to call it "hostsInMaintenanceMode" or like that.

Same with killDataSourceWhitelist and killPendingSegmentsSkipList. Old names could be preserved in the @JsonProperty annotations only. Additionally with these two fields, it's very unclear that data source names are stored in them.

Copy link
Contributor Author

@egor-ryashin egor-ryashin Sep 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to historicalNodesInMaintenance

@@ -33,14 +33,22 @@
private static final Logger log = new Logger(ServerHolder.class);
private final ImmutableDruidServer server;
private final LoadQueuePeon peon;
private final boolean maintenance;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested "inMaintenanceMode". Also a comment or a reference to the PR/issue is appreciated, it might be very non-obvious for the reader what does "maintenance" mean here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


if (toMoveTo.size() <= 1) {
log.info("[%s]: One or fewer servers found. Cannot balance.", tier);
if (general.size() == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should add || (general.size() == 1 && maintenance.isEmpty())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -95,6 +100,8 @@ public CoordinatorDynamicConfig(
this.killDataSourceWhitelist = parseJsonStringOrArray(killDataSourceWhitelist);
this.killPendingSegmentsSkipList = parseJsonStringOrArray(killPendingSegmentsSkipList);
this.maxSegmentsInNodeLoadingQueue = maxSegmentsInNodeLoadingQueue;
this.maintenanceList = parseJsonStringOrArray(maintenanceList);
this.maintenanceModeSegmentsPriority = Math.min(10, Math.max(0, maintenanceModeSegmentsPriority));
Copy link
Member

@leventov leventov Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should rather check bounds than saturating, to avoid silent consumption of configuration errors

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.withDruidCluster(secondCluster)
.withSegmentReplicantLookup(SegmentReplicantLookup.make(secondCluster))
.withBalancerReferenceTimestamp(DateTimes.of("2013-01-01"))
.withAvailableSegments(Lists.newArrayList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add withAvailableSegments(DataSegment... segments) method for convenience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should extend a production class method list for test purposes only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what should be done. Test writing convenience and succinctness is important. There are a lot of methods marked as @VisibleForTesting in production classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.times(2);
EasyMock.replay(throttler, mockPeon, mockBalancerStrategy);

LoadRule rule = createLoadRule(ImmutableMap.of(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary breakdown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


DruidCoordinatorRuntimeParams params = DruidCoordinatorRuntimeParams.newBuilder()
.withDruidCluster(druidCluster)
.withSegmentReplicantLookup(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could break .newBuilder() on the next line to avoid this far right alignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -760,4 +964,36 @@ private static LoadQueuePeon createLoadingPeon(List<DataSegment> segments)

return mockPeon;
}

private static int serverId = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested AtomicInteger just in case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

@Test
public void testBuilderDefaults()
{

CoordinatorDynamicConfig defaultConfig = CoordinatorDynamicConfig.builder().build();
assertConfig(defaultConfig, 900000, 524288000, 100, 5, 15, 10, 1, false, ImmutableSet.of(), false, 0);
assertConfig(defaultConfig, 900000, 524288000, 100, 5, 15, 10, 1, false, ImmutableSet.of(), false, 0, ImmutableSet.of(),
7);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Druid formatting rules, should break all or none arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clintropolis
Copy link
Member

Neat idea! I'm just getting started to review, but was wondering if there is any reason for this be a part of the balancer (which is already reasonably complicated on it's own) instead of introducing a new coordinator runnable dedicated to migrating segments off of historicals which have been added to the list?

@@ -346,6 +348,13 @@ public Builder withDataSources(Collection<ImmutableDruidDataSource> dataSourcesC
return this;
}

@VisibleForTesting
public Builder withAvailableSegments(DataSegment... availableSegmentsCollection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested just "availableSegments"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predicate = predicate.and(s -> !s.isMaintenance());

return queue.stream().filter(predicate).collect(Collectors.toList());
Predicate<ServerHolder> general = s -> !s.isInMaintenance();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested "isGeneral" or "notInMaintenance"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isGeneral is too abstract, notInMaintenance - having a negation in naming is usually not a good idea.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isGeneral is the same as current general, just more conventional name for a Predicate object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I got it ✅

@@ -105,14 +105,17 @@ private void balanceTier(
}

Map<Boolean, List<ServerHolder>> partitions = servers.stream()
.collect(Collectors.partitioningBy(ServerHolder::isMaintenance));
.collect(Collectors.partitioningBy(ServerHolder::isInMaintenance));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.collect() is not aligned with .stream(). Suggested to break the whole expression:

Map<Boolean, List<ServerHolder>> partitions =
    servers.stream().collect(Collectors.partitioningBy(ServerHolder::isInMaintenance));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


private static ImmutableMap<String, DataSegment> segmentsToMap(DataSegment... segments)
{
return ImmutableMap.copyOf(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be Maps.uniqueIndex(Arrays.asList(segments), DataSegment::getIdentifier)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -965,10 +935,10 @@ private static LoadQueuePeon createLoadingPeon(List<DataSegment> segments)
return mockPeon;
}

private static int serverId = 0;
private static AtomicInteger serverId = new AtomicInteger();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.maintenanceModeSegmentsPriority = Math.min(10, Math.max(0, maintenanceModeSegmentsPriority));
this.historicalNodesInMaintenance = parseJsonStringOrArray(historicalNodesInMaintenance);
Preconditions.checkArgument(
nodesInMaintenancePriority >= 0 && nodesInMaintenancePriority <= 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the priority of 0 make sense?

Copy link
Contributor Author

@egor-ryashin egor-ryashin Oct 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It allows to suppress maintenance for a time without clearing the maintenance list. Segments won't be loaded to a maintenance cluster, but won't be moved from there, that could help if some general servers get overloaded.

@@ -100,8 +101,12 @@ public CoordinatorDynamicConfig(
this.killDataSourceWhitelist = parseJsonStringOrArray(killDataSourceWhitelist);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please rename killDataSourceWhitelist and killPendingSegmentsSkipList parameters, fields and getters to not mention "list" and the latter to mention "dataSource" (it's actual elements), while JSON property names could stay the same for compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to change killDataSourceWhitelist -> killableDatasources and killPendingSegmentsSkipList -> protectedPendingSegmentDatasources. Makes sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"DataSources". Some explanation of what do those variables mean would also be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is actually documented in docs http://druid.io/docs/latest/configuration/index.html
image
I could copy from there, but, I think, it would look redundantly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source code should be understandable and self-contained on it's own. For somebody to understand it now, it's required to find this doc and then find how the mentioned property is mapped backed to some source code concepts, that's very difficult and time-consuming when somebody just wants to read and understand code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also see #4861

@@ -192,10 +195,14 @@ public boolean isKillAllDataSources()
return killAllDataSources;
}

/**
* List of dataSources for which pendingSegments are NOT cleaned up
* if property druid.coordinator.kill.pendingSegments.on is true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please reference druid.coordinator.kill.pendingSegments.on in source code terms? Like a boolean field in some class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no specific variable, it conditionally binds the class, as I understand

            ConditionalMultibind.create(
                properties,
                binder,
                DruidCoordinatorHelper.class,
                CoordinatorIndexingServiceHelper.class
            ).addConditionBinding(
                "druid.coordinator.merge.on",
                Predicates.equalTo("true"),
                DruidCoordinatorSegmentMerger.class
            ).addConditionBinding(
                "druid.coordinator.conversion.on",
                Predicates.equalTo("true"),
                DruidCoordinatorVersionConverter.class
            ).addConditionBinding(
                "druid.coordinator.kill.on",
                Predicates.equalTo("true"),
                DruidCoordinatorSegmentKiller.class
            ).addConditionBinding(
                "druid.coordinator.kill.pendingSegments.on",
                Predicates.equalTo("true"),

Copy link
Member

@leventov leventov Oct 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it could say "list of data sources skipped in {@link DruidCoordinatorCleanupPendingSegments}"

@JsonProperty
public Set<String> getKillDataSourceWhitelist()
/**
* List of dataSources for which kill tasks are sent if property druid.coordinator.kill.on is true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@leventov leventov added this to the 0.13.0 milestone Oct 10, 2018
* By leveraging the priority an operator can prevent general nodes from overload or decrease maitenance time
* instead.
*
* @return number in range [0, 10]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just 1-100 if it represents a %?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial term was a priority, which is usually not a big number (eg, Java thread priority).
Percentage term emerged during writing that description. Moreover, I don't expect someone would like to define 65%, 75% or even 67%, 23% and so on, usually 10, 20, ..., 80 is good enough, and then adding 0 every time seems redudant. Meanwhile a priority term allows to be more concise.

@drcrallen
Copy link
Contributor

I'm reviewing this

Copy link
Contributor

@drcrallen drcrallen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a cool feature.

Can you please add some docs in the code and some in the README for the new configs.

Also, can you please comment in the master comment in this PR on the design tradeoff of having the coordinator be the central point for controlling if a node is under maintenance instead of having the historical nodes themselves store their "I'm in maintenance" state

@@ -92,11 +98,17 @@ public CoordinatorDynamicConfig(
this.balancerComputeThreads = Math.max(balancerComputeThreads, 1);
this.emitBalancingStats = emitBalancingStats;
this.killAllDataSources = killAllDataSources;
this.killDataSourceWhitelist = parseJsonStringOrArray(killDataSourceWhitelist);
this.killPendingSegmentsSkipList = parseJsonStringOrArray(killPendingSegmentsSkipList);
this.killableDatasources = parseJsonStringOrArray(killableDatasources);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why this is using a function in the constructor instead of a jackson parser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume, so that it can handle "killableDatasources": ["a","b"] as well as "killableDatasources": "a, b"


if (this.killAllDataSources && !this.killDataSourceWhitelist.isEmpty()) {
if (this.killAllDataSources && !this.killableDatasources.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you retain the WhiteList descriptor here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked to change this here: #6349 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egor-ryashin could you please rename to killableDataSources that is more conventional spelling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @return list of host:port entries
*/
@JsonProperty
public Set<String> getHistoricalNodesInMaintenance()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these are host and port entries, can they use HostAndPort objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible, but it works intensively with DruidServerMetadata which contains simple String, and HostAndPort doesn't provide efficient toString():

  /** Rebuild the host:port string, including brackets if necessary. */
  @Override
  public String toString() {
    StringBuilder builder = new StringBuilder(host.length() + 7);
    if (host.indexOf(':') >= 0) {
      builder.append('[').append(host).append(']');
    } else {
      builder.append(host);
    }
    if (hasPort()) {
      builder.append(':').append(port);
    }
    return builder.toString();
  }

this(server, peon, false);
}

public ServerHolder(ImmutableDruidServer server, LoadQueuePeon peon, boolean inMaintenance)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward, how is in maintenance different than blacklist?

For example, lets say a historical is able to detect that it is unhealthy, and it wants to shed its load in as safe of a way as possible, so it declares its desire to have its load shed to other nodes. Would inMaintenance be the proper kind of thing for that?

Or if another problem arrises where a node has fractional problems, like it can only operate at about 50% of where it should. Is it proper to have the node be inMaintenance, or would a "fractional operating level" be appropriate here?

What I'm curious of is if it makes sense for the initial implementation to have an int here which represents an arbitrary "weight capacity" for the node, and "in maintenance" simply sets that weight capacity to 0.

Such a thing would open up other methodologies of shedding load in the future, or allow coordinator rules to take a "weight" into account for a server, where "weight" is intended to be dynamic and not a static value for the server.

Copy link
Contributor Author

@egor-ryashin egor-ryashin Oct 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, lets say a historical is able to detect that it is unhealthy, and it wants to shed its load in as safe of a way as possible, so it declares its desire to have its load shed to other nodes. Would inMaintenance be the proper kind of thing for that?
Or if another problem arrises where a node has fractional problems, like it can only operate at about 50% of where it should. Is it proper to have the node be inMaintenance, or would a "fractional operating level" be appropriate here?

Well, I would made 2 separate types like inMaintenance, Unhealthy to make it clear for an operator, otherwise it could be confusing to see a manual list extending itself at random. Also, we need to define detection algorithm for Unhealty status.

What I'm curious of is if it makes sense for the initial implementation to have an int here which represents an arbitrary "weight capacity" for the node, and "in maintenance" simply sets that weight capacity to 0.

Yes, inMaintenance, Unhealthy can carry a weight number that way. I assume, priority for each type could be different too, so an operator can have manually defined nodes to be drained faster, for example.

Map<Boolean, List<ServerHolder>> partitions =
servers.stream().collect(Collectors.partitioningBy(ServerHolder::isInMaintenance));
final List<ServerHolder> maintenance = partitions.get(true);
final List<ServerHolder> general = partitions.get(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest availableServers or similar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int maintenanceSegmentsToMove = (int) Math.ceil(maxSegmentsToMove * priority / 10.0);
log.info("Processing %d segments from servers in maintenance mode", maintenanceSegmentsToMove);
Pair<Integer, Integer> maintenanceResult = balanceServers(params, maintenance, general, maintenanceSegmentsToMove);
int generalSegmentsToMove = maxSegmentsToMove - maintenanceResult.lhs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also posted to a weird line, looks like my comments are off

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest adding the max prefix here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final int maxSegmentsToMove = Math.min(params.getCoordinatorDynamicConfig().getMaxSegmentsToMove(), numSegments);
int priority = params.getCoordinatorDynamicConfig().getNodesInMaintenancePriority();
int maintenanceSegmentsToMove = (int) Math.ceil(maxSegmentsToMove * priority / 10.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, changing the name would help make it clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I think I missed the comment by a line :-P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean changing maintenanceSegmentsToMove -> maxMaintenanceSegmentsToMove?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)
{
final NavigableSet<ServerHolder> queue = druidCluster.getHistoricalsByTier(tier);
if (queue == null) {
log.makeAlert("Tier[%s] has no servers! Check your cluster configuration!", tier).emit();
return Collections.emptyList();
}

return queue.stream().filter(predicate).collect(Collectors.toList());
Predicate<ServerHolder> isGeneral = s -> !s.isInMaintenance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest changing name to isNotInMaintenance

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not isAvailable or something along those lines since the implication is that these servers are available to assign segments to, and to go along with your other variable rename suggestion in balancer logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO since this is specifically around maintenance then keeping the boolean methods related to the maintenance.

That being said, I'm curious if this predicate is in the wrong place. Should it just be part of the passed in predicate, and leave no "extra" or surprising predicate logic in this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFilteredHolders is always called to find servers which can be a target location for a segment, I think it should never return maintenance servers.
Changed to isNotInMaintenance

return Response.status(Response.Status.BAD_REQUEST)
.entity(ImmutableMap.of("error", setResult.getException()))
.entity(ImmutableMap.<String, Object>of("error", e.getMessage()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a Jetty util for cleaning the exceptions floating around somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's ServletResourceUtils.jsonize?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ServletResourceUtils.sanitizeException I believe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjy fjy modified the milestones: 0.13.0, 0.13.1 Oct 10, 2018
@fjy
Copy link
Contributor

fjy commented Oct 10, 2018

@leventov Given this PR still requires quite a bit of review and testing, I think we should slate it for 0.13.1

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM 🤘

I agree with @drcrallen on variable rename suggestions to make code a little clearer

I also highly recommend updating the logic in the computeCost method of org.apache.druid.server.coordinator.CostBalancerStrategy to take into account if a server is in maintenance and return Double.POSITIVE_INFINITY for the cost of that server, as a failsafe in case anything that is calling into the balancer to pick a server omits to do this check externally.

)
{
final NavigableSet<ServerHolder> queue = druidCluster.getHistoricalsByTier(tier);
if (queue == null) {
log.makeAlert("Tier[%s] has no servers! Check your cluster configuration!", tier).emit();
return Collections.emptyList();
}

return queue.stream().filter(predicate).collect(Collectors.toList());
Predicate<ServerHolder> isGeneral = s -> !s.isInMaintenance();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not isAvailable or something along those lines since the implication is that these servers are available to assign segments to, and to go along with your other variable rename suggestion in balancer logic

@dclim
Copy link
Contributor

dclim commented Oct 11, 2018

@egor-ryashin will you have time to address @drcrallen 's comments in the next few hours?

@egor-ryashin
Copy link
Contributor Author

@dclim it's unclear, I'm confused by some comments, I'm waiting for a reply.

@dclim
Copy link
Contributor

dclim commented Oct 12, 2018

Okay, I'm starting the release process for 0.13.0 now so we can keep the 0.13.1 milestone for this one.

@egor-ryashin
Copy link
Contributor Author

@clintropolis @drcrallen I wonder if I answered all your questions?

@clintropolis
Copy link
Member

Oops, will have a look today

Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very sorry for the delay! I've had another look and I'm good with this patch after conflicts are fixed and travis passes 👍

@egor-ryashin
Copy link
Contributor Author

@clintropolis I've resolved merge conflicts

@clintropolis
Copy link
Member

👍, @drcrallen do you have any more comments on this PR?

@glasser
Copy link
Contributor

glasser commented Jan 15, 2019

This seems like a great feature to use for us for implementing k8s node upgrades (we use local disks on our k8s machines for historicals, but sometimes do need to upgrade node pools). We would set maintenance mode on one machine at a time, wait for its segments to be fully replicated, then replace its machine, and move on to the next one.

In order to do this automatically I'm curious what the impact on metrics/API is. All it does is cause the coordinator to try to move segments around: it doesn't affect the response to metrics or the sys SQL entries, right? So I could wait until the historical's segment/count metric goes to zero, or until the SQL query {"query":"SELECT count(*) AS segments FROM sys.server_segments WHERE server = '10.48.27.16:8083'"} goes to 0?

@egor-ryashin
Copy link
Contributor Author

@glasser

So I could wait until the historical's segment/count metric goes to zero, or until the SQL query {"query":"SELECT count(*) AS segments FROM sys.server_segments WHERE server = '10.48.27.16:8083'"} goes to 0?

Exactly.

@clintropolis
Copy link
Member

@egor-ryashin sorry for the delay getting this merged, could you fix up conflicts again? @drcrallen any more comments on this PR? I think it would be nice to get in 0.14.

@egor-ryashin
Copy link
Contributor Author

@clintropolis Resolved.

@jon-wei
Copy link
Contributor

jon-wei commented Feb 5, 2019

Going to go ahead and merge this for inclusion in 0.14.0 release since it already has 2 approvals

@jon-wei jon-wei merged commit 97b6407 into apache:master Feb 5, 2019
@glasser
Copy link
Contributor

glasser commented Feb 5, 2019

One thing that's slightly awkward about storing this list as an array in the dynamic config: you can't atomically add or remove an element from this list. If implementing something that uses this feature dynamically (say from a k8s historical's pre-stop hook) it would be nice if they could send a single request that is guaranteed to add them to the list without accidentally removing something else from the list.

Do you think it would be reasonable to add an optional query param to the /druid/coordinator/v1/config POST handler to allow you to specify the expected previous contents, so you can do a compare-and-swap write? One caveat is that you'd need to be sure that the value you sent is exactly the value stored in the DB, not corrupted by being round-tripped through Jackson...

Another option would be to add a specific HTTP endpoint to add an element to a list in the dynamic config. (There'd still need to be a SQL-level compare-and-swap.)

@clintropolis
Copy link
Member

@glasser I think that would be a useful follow up, I like the 2nd approach better myself since it could be done without first fetching the config and presumably you already know the host you want to add to the list. It would also lend itself to something like adding an http endpoint directly to the historical servers that could be used to have them add themselves to the maintenance list, similar in spirit to what /druid/worker/v1/enable and /druid/worker/v1/disable provide for middle managers.

@egor-ryashin
Copy link
Contributor Author

I have no argument as the PR is supposed to be an MVP. I plan to improve it after I thoroughly test it in production environment at least.

@@ -46,8 +46,9 @@ public CoordinatorStats run(DruidCoordinator coordinator, DruidCoordinatorRuntim
} else {
params.getDruidCluster().getAllServers().forEach(
eachHolder -> {
if (colocatedDataSources.stream()
.anyMatch(source -> eachHolder.getServer().getDataSource(source) != null)) {
if (!eachHolder.isInMaintenance()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this !eachHolder.isInMaintenance() condition be in an outer if, rather than in this if-else chain? As it is written, if a node in the maintenance mode, the coordinator will delete segments from it rather than leave the node alone.

@leventov leventov deleted the feature-3247 branch February 20, 2019 14:08
@leventov
Copy link
Member

leventov commented Feb 20, 2019

In retrospect, I think the "maintenance" term is too vague and somewhat misleading. It suggests that a node is (perhaps temporarily) "conserved", that is, the coordinator doesn't touch it, forever or for some time. I think something like "shuttingDown" or "windingDown" would be better. @egor-ryashin what do you think?

@glasser
Copy link
Contributor

glasser commented Feb 20, 2019

"draining"?

@leventov
Copy link
Member

I don't quite like "shuttingDown" because of false associations with Java's ExecutorService.shutdown/shutdownNow. "draining" sounds good.

@egor-ryashin
Copy link
Contributor Author

@leventov
"conserved" is a bit unusual for me as I initially worked on issue specification which was based on "maintenance" term. Besides Mesos also uses "maintenance" term for the same purpose (consequently Marathon, Spark). That term is more popular, I think we should stick to it too.

@leventov
Copy link
Member

I think clarity and avoiding developer confusion (I raised this issue after the term confused me when I was working on code and I spent a fair amount of time trying to understand why "maintenance" nodes are not in fact conserved) trumps the fact that the term is already adopted in some other projects for something similar. Are you also sure that it was not a naming mistake in that projects?

I've raised #7148 regarding this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants