Skip to content

Conversation

@deardeng
Copy link
Contributor

@deardeng deardeng commented Dec 4, 2023

…ailed window to check be's health state

Proposed changes

Add two windows to detect the health status of be, optimize the publish version and TableSchedule logic

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@deardeng deardeng marked this pull request as draft December 4, 2023 12:42
@deardeng deardeng marked this pull request as ready for review December 4, 2023 12:42
@deardeng deardeng marked this pull request as draft December 4, 2023 12:43
}

if (Config.create_new_replica_in_health_backends && !healthBes.isEmpty()
&& (Env.getCurrentSystemInfo().isLastPublishVersionAccumulated(be.getId())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need consider publish version task accumulate here. just check clone failed a lot replica

* database lock should be held.
*/
public void chooseDestReplicaForVersionIncomplete(Map<Long, PathSlot> backendsWorkingSlots)
public void chooseDestReplicaForVersionIncomplete(Map<Long, PathSlot> backendsWorkingSlots, List<Long> healthBes)
Copy link
Contributor

@yujun777 yujun777 Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a boolean like 'skipAlwaysCloneFail'

stat.counterReplicaVersionMissingErr.incrementAndGet();
try {
tabletCtx.chooseDestReplicaForVersionIncomplete(backendsWorkingSlots);
Set<Long> bes = Sets.newHashSet();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool skipAlwaysCloneFail = Config.create_new_replica_in_health_backends && backends.stream().anyMatch(be -> be.isSchedAvailable && backend not in tablet.backends && backend.getTag() == tablet.getTag() && backend contains disk tablet's storage medium
);

tabletCtx.setErrMsg(e.getMessage());
if (e.getStatus() == Status.RUNNING_FAILED) {
tabletCtx.increaseFailedRunningCounter();
Env.getCurrentSystemInfo().updateControlMaps(tabletCtx.getSrcBackendId(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check cloneTask return code is ok at line 1718

backendIdLastTimesIsAccumulated = ImmutableMap.copyOf(copiedMap);
}

public void updateControlMaps(Long backendId, Map<Long, Set<Long>> map) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this set maybe too big ?

Comment on lines +157 to +161
if (!tasks.containsRow(backendId) || !runningTasks.containsKey(TTaskType.PUBLISH_VERSION)) {
return;
}
Env.getCurrentSystemInfo().updateLastPublishVersionFailedMap(backendId,
runningTasks.get(TTaskType.PUBLISH_VERSION).size() > Config.publish_version_queued_limit_number);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete tasks.containsRow(backendId)
because when a txn finish, fe will remove all be's publish task in agent queue. so task in fe may empty

Set<Long> slowBes = Sets.newHashSet();
AtomicBoolean hasBackendAliveAndUnfinishedTask = new AtomicBoolean(false);
transactionState.getPublishVersionTasks().forEach((beId, task) -> {
if (task.isFinished()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine these code.
bool uninishTaskIsDeadOrPublishSlow = false;
if (task.isFinish()) {
finishNum++
other ...
} else {
if (be.isDead or be.isPublishSlow()) {
uninishTaskIsDeadOrPublishSlow = true;
}
}

@deardeng deardeng closed this Dec 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants