Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IOTDB-860] Emend the async log applier #1635

Merged

Conversation

neuyilan
Copy link
Member

No description provided.

@jt2594838 jt2594838 added the Module - Cluster PRs for the cluster module label Aug 20, 2020
Copy link
Contributor

@jt2594838 jt2594838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix itself has a potential problem that when new logs keep coming in, it is very likely to time out.
Now I see you want to make sure when the snapshot is taken, all previous logs are applied. This side effect has been noticed but back to then, AsyncApplier is merely in the experimental stage. Since you have fixed one of the problems, I would like to give some pieces of advice to make AsyncApplier complete:

  1. When starting to take a snapshot, record the current commitIndex.
  2. Block until all logs whose indices <= the recorded commitIndex are applied. (Use RaftLogManager to do so instead of LogApplier)
  3. Prevent the log cleaner thread to clean logs that are not applied.
  4. Change StorageEngine.getInstance().syncCloseAllProcessor(); in takeSnapshot() to send a flush plan within the group. (So the file sequence will not be broken by snapshots)
  5. When committed logs are recovered during start-up, re-apply all of them. (Notice that operation sequences in IoTDB are idempotent)

@neuyilan
Copy link
Member Author

The fix itself has a potential problem that when new logs keep coming in, it is very likely to time out.
Now I see you want to make sure when the snapshot is taken, all previous logs are applied. This side effect has been noticed but back to then, AsyncApplier is merely in the experimental stage. Since you have fixed one of the problems, I would like to give some pieces of advice to make AsyncApplier complete:

  1. When starting to take a snapshot, record the current commitIndex.
  2. Block until all logs whose indices <= the recorded commitIndex are applied. (Use RaftLogManager to do so instead of LogApplier)
  3. Prevent the log cleaner thread to clean logs that are not applied.
  4. Change StorageEngine.getInstance().syncCloseAllProcessor(); in takeSnapshot() to send a flush plan within the group. (So the file sequence will not be broken by snapshots)
  5. When committed logs are recovered during start-up, re-apply all of them. (Notice that operation sequences in IoTDB are idempotent)

Great suggestion, I think the 1-3 items you mentioned is to make sure that when do snapshot, new logs can not be added and the snapshot task should not take long time. however now the implementation can block the new logs coming in, as when do snapshot we should get the logManager lock, so new logs can not be committed(commited log also need to get the logManager lock ), we just need to wait all the previous committed log applied when do snapshot.

I'm going to rethink the 4-5 suggestions.

@neuyilan neuyilan changed the title [Distributed] fix async applier bug when do snapshpot [Distributed] fix async applier bug when do snapshpot[in progress] Aug 20, 2020
@neuyilan neuyilan force-pushed the apache_cluster_new_0819_async_applier branch 2 times, most recently from b82d6cc to 8cf4003 Compare August 21, 2020 12:29
@neuyilan neuyilan changed the title [Distributed] fix async applier bug when do snapshpot[in progress] [Distributed] remend async applier Aug 24, 2020
@neuyilan neuyilan changed the title [Distributed] remend async applier [Distributed] emend async applier Aug 24, 2020
@neuyilan neuyilan changed the title [Distributed] emend async applier [Distributed] emend async applier[in progress] Aug 24, 2020
@neuyilan neuyilan changed the title [Distributed] emend async applier[in progress] [Distributed] emend async applier Aug 25, 2020
@neuyilan neuyilan changed the title [Distributed] emend async applier [Distributed] Emend the async applier Aug 25, 2020
@neuyilan neuyilan changed the title [Distributed] Emend the async applier [Distributed] Emend the async applier[in progress] Aug 26, 2020
@neuyilan neuyilan marked this pull request as draft August 26, 2020 02:01
@neuyilan neuyilan changed the title [Distributed] Emend the async applier[in progress] [Distributed] Emend the async applier Aug 26, 2020
@neuyilan neuyilan marked this pull request as ready for review August 26, 2020 02:44
@neuyilan neuyilan force-pushed the apache_cluster_new_0819_async_applier branch from d639d8b to 5c3f239 Compare August 27, 2020 05:40
Comment on lines 62 to 76
public void syncFlushAllProcessor() {
logger.info("{}: Start flush all storage group processor in one data group", getName());
ConcurrentHashMap<String, StorageGroupProcessor> processorMap = StorageEngine.getInstance()
.getProcessorMap();
if (processorMap.size() == 0) {
logger.info("{}: no need to flush processor", getName());
return;
}
List<Path> storageGroups = new ArrayList<>();
for (Map.Entry<String, StorageGroupProcessor> entry : processorMap.entrySet()) {
Path path = new Path(entry.getKey());
storageGroups.add(path);
}
FlushPlan plan = new FlushPlan(null, true, storageGroups);
dataGroupMember.flushFileWhenDoSnapshot(plan);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that we cannot flush the whole storage group when time partitioning is enabled. Because in that case, a storage group will be managed by several data groups, if you flush one storage group without notifying other data groups, the file integrity of other data groups will be broken.
So you should either flush other related groups (which is rather hard to find), or only flush partitions that belong to the data group.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thanks for your advice, I'd like to only flush the partitions that belong to the data group

@neuyilan neuyilan changed the title [Distributed] Emend the async applier [IOTDB-860] Emend the async log applier Sep 1, 2020
@jt2594838 jt2594838 merged commit 8fda671 into apache:cluster_new Sep 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module - Cluster PRs for the cluster module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants