-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[improvement](publish) Add publish task cumulation window and clone f… #27968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ailed window to check be's health state
| } | ||
|
|
||
| if (Config.create_new_replica_in_health_backends && !healthBes.isEmpty() | ||
| && (Env.getCurrentSystemInfo().isLastPublishVersionAccumulated(be.getId()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need consider publish version task accumulate here. just check clone failed a lot replica
| * database lock should be held. | ||
| */ | ||
| public void chooseDestReplicaForVersionIncomplete(Map<Long, PathSlot> backendsWorkingSlots) | ||
| public void chooseDestReplicaForVersionIncomplete(Map<Long, PathSlot> backendsWorkingSlots, List<Long> healthBes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a boolean like 'skipAlwaysCloneFail'
| stat.counterReplicaVersionMissingErr.incrementAndGet(); | ||
| try { | ||
| tabletCtx.chooseDestReplicaForVersionIncomplete(backendsWorkingSlots); | ||
| Set<Long> bes = Sets.newHashSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool skipAlwaysCloneFail = Config.create_new_replica_in_health_backends && backends.stream().anyMatch(be -> be.isSchedAvailable && backend not in tablet.backends && backend.getTag() == tablet.getTag() && backend contains disk tablet's storage medium
);
| tabletCtx.setErrMsg(e.getMessage()); | ||
| if (e.getStatus() == Status.RUNNING_FAILED) { | ||
| tabletCtx.increaseFailedRunningCounter(); | ||
| Env.getCurrentSystemInfo().updateControlMaps(tabletCtx.getSrcBackendId(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check cloneTask return code is ok at line 1718
| backendIdLastTimesIsAccumulated = ImmutableMap.copyOf(copiedMap); | ||
| } | ||
|
|
||
| public void updateControlMaps(Long backendId, Map<Long, Set<Long>> map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this set maybe too big ?
| if (!tasks.containsRow(backendId) || !runningTasks.containsKey(TTaskType.PUBLISH_VERSION)) { | ||
| return; | ||
| } | ||
| Env.getCurrentSystemInfo().updateLastPublishVersionFailedMap(backendId, | ||
| runningTasks.get(TTaskType.PUBLISH_VERSION).size() > Config.publish_version_queued_limit_number); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete tasks.containsRow(backendId)
because when a txn finish, fe will remove all be's publish task in agent queue. so task in fe may empty
| Set<Long> slowBes = Sets.newHashSet(); | ||
| AtomicBoolean hasBackendAliveAndUnfinishedTask = new AtomicBoolean(false); | ||
| transactionState.getPublishVersionTasks().forEach((beId, task) -> { | ||
| if (task.isFinished()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
combine these code.
bool uninishTaskIsDeadOrPublishSlow = false;
if (task.isFinish()) {
finishNum++
other ...
} else {
if (be.isDead or be.isPublishSlow()) {
uninishTaskIsDeadOrPublishSlow = true;
}
}
…ailed window to check be's health state
Proposed changes
Add two windows to detect the health status of be, optimize the publish version and TableSchedule logic
Further comments
If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...