init dummyIndex after restart cluster #3939

cigarl · 2021-09-09T11:17:17Z

In current logic, if we restart cluster, and applyIndex is smaller than commitIndex,here are three problems occur.

First, we cannot append un-apply logs to commitManager.Like 103 and 104 in the picture,they can not be appended to commitEntries in commitManager.
Second, un-commitManager can not append a new log. Like 0 can not be appended to unCommitEntries in unCommitManager.
Third, un-applied logs can not be re-applied.That means 103 and 104 could be lost.

This is all because dummyIndex is initialized to the default value,it cannot return to its pre-reboot state.

ps.The information in the picture assumes none of these questions exist

This PR try to resolve this problem,but we may also need a processor to avoid logIndex is too big, dunmmyIndex need to re-initialize if logIndex will be bigger than our expect.

coveralls · 2021-09-09T12:04:27Z

Coverage increased (+0.3%) to 67.748% when pulling 9449a6b on cigarl:di_fix into 601df56 on apache:master.

OneSizeFitsQuorum

To solve these problems completely, I think raftLogManager needs a complete refactoring, including but not limited to better concurrency control, better persistence strategy, etc.

When I first joined the community in my senior year, the first thing I did was to implement raftLogManager with reference to storage and unstable interfaces in etcd. Because I didn't have a very deep understanding of raft and etcd at that time, after a long period, I finally realized that this was probably an awful implementation and I apologize for that. There are two main reasons:

For raft, uncommitted logs also need to be persisted (i.e., logs on disk may also need to be truncated) in order to ensure correctness after reboot.
etcd is an event-driven architecture, so its interior is all unlocked. However, in our architecture, raftLogManager can be accessed concurrently by multiple threads. So we've added a lot of patches for concurrency control right now, but this is actually an area we can think about implementing better.

Thank you very much for your great contributions. This week I will focus on the PR you raised about cluster- branch and a bug at hand. After that, I would like to discuss the refactoring of raftLogManager with you guys and community. And of course you're welcome to start to design the new raftLogManager with us right now if you are free.

...r/src/main/java/org/apache/iotdb/cluster/log/manage/serializable/SyncLogDequeSerializer.java

cluster/src/main/java/org/apache/iotdb/cluster/log/manage/RaftLogManager.java

cigarl · 2021-09-13T03:17:41Z

To solve these problems completely, I think raftLogManager needs a complete refactoring, including but not limited to better concurrency control, better persistence strategy, etc.

When I first joined the community in my senior year, the first thing I did was to implement raftLogManager with reference to storage and unstable interfaces in etcd. Because I didn't have a very deep understanding of raft and etcd at that time, after a long period, I finally realized that this was probably an awful implementation and I apologize for that. There are two main reasons:

For raft, uncommitted logs also need to be persisted (i.e., logs on disk may also need to be truncated) in order to ensure correctness after reboot.

etcd is an event-driven architecture, so its interior is all unlocked. However, in our architecture, raftLogManager can be accessed concurrently by multiple threads. So we've added a lot of patches for concurrency control right now, but this is actually an area we can think about implementing better.

Thank you very much for your great contributions. This week I will focus on the PR you raised about cluster- branch and a bug at hand. After that, I would like to discuss the refactoring of raftLogManager with you guys and community. And of course you're welcome to start to design the new raftLogManager with us right now if you are free.

IMO，we need the following two approaches in parallel:

First,we should solve this problem in the current code structure and submit to master and rel/0.12.x branch. Because when a user has a problem like mine, it's hard to solve immediately, this means that some data could be loss, and cluster may enter an unstable state.This is a disaster for the production environment.We need to provide users with a patch to avoid this problem.
Second,we can refactor this part on cluster- branch.Once we have solved the first problem, we can spend more time refactoring this part better. And we don't have to push this part of the content to the user immediately.It can be put into the next release as a better implementation.

I'd be happy to do it together : )

OneSizeFitsQuorum · 2021-09-13T03:24:23Z

I'd be happy to do it together : )

OK, then that's my reviews~At current, it seems that the getAllEntriesAfterCommittedIndex() will always return empty list in current design? What's your opinion?

cigarl · 2021-09-13T03:29:24Z

I'd be happy to do it together : )

OK, then that's my reviews~At current, it seems that the getAllEntriesAfterCommittedIndex() will always return empty list in current design? What's your opinion?

see #3930
In my environment, that is not empty. It is caused by meta information not being flushed synchronously.

cluster/src/main/java/org/apache/iotdb/cluster/log/manage/RaftLogManager.java

neuyilan · 2021-09-13T07:49:00Z

...r/src/main/java/org/apache/iotdb/cluster/log/manage/serializable/SyncLogDequeSerializer.java

+  /**
+   * When raft log files flushed,meta would not be flushed synchronously.So data has flushed to disk
+   * is uncommitted for persistent LogManagerMeta(meta's info is stale).We need to recover these
+   * already persistent logs.
+   *
+   * <p>For example,commitIndex is 5 in persistent LogManagerMeta,But the log file has actually been
+   * flushed to 7,when we restart cluster,we need to recover 6 and 7.
+   *
+   * <p>Maybe,we can extract getAllEntriesAfterAppliedIndex and getAllEntriesAfterCommittedIndex
+   * into getAllEntriesByIndex,but now there are too many test cases using it.
+   */


In this comment, it seems due to that the log's meta file does not flush synchronously with the raft log, however, the persistent commit raft log maybe has been applied.
We should not commit the entry again when restart.

The problem here is that we don't know the actual applyIndex.
Fortunately, in IOTDB, repeating apply operation doesn't matter too much, but losing data can be unacceptable.
That is only my views.

As we have so many plans, I am not sure all the plan is idempotent.

According to my understanding, re-apply the same sequence(guaranteed by log index) of the plans won't break the correctness of IoTDB. And looks like we don't have a better choice.

chengjianyun · 2021-09-13T08:21:25Z

To solve these problems completely, I think raftLogManager needs a complete refactoring, including but not limited to better concurrency control, better persistence strategy, etc.

When I first joined the community in my senior year, the first thing I did was to implement raftLogManager with reference to storage and unstable interfaces in etcd. Because I didn't have a very deep understanding of raft and etcd at that time, after a long period, I finally realized that this was probably an awful implementation and I apologize for that. There are two main reasons:

For raft, uncommitted logs also need to be persisted (i.e., logs on disk may also need to be truncated) in order to ensure correctness after reboot.

etcd is an event-driven architecture, so its interior is all unlocked. However, in our architecture, raftLogManager can be accessed concurrently by multiple threads. So we've added a lot of patches for concurrency control right now, but this is actually an area we can think about implementing better.

Thank you very much for your great contributions. This week I will focus on the PR you raised about cluster- branch and a bug at hand. After that, I would like to discuss the refactoring of raftLogManager with you guys and community. And of course you're welcome to start to design the new raftLogManager with us right now if you are free.

Thanks for your implementation so that we can enjoy the first cluster version of IoTDB :). Raft algorithm is really hard to implementation, there are too many corner cases to consideration. We could make the algorithm work correct in most of cases in a short time which is terrific.

In my opinion, we need to spend too much effort to guarantee the correctness of Raft if we want to keep working on the current cluster implementation. Cluster module lack a whole system design according to my feeling.

Another my suggestion is we can gradually separate raft implementation and raft client(application) just like what etcd did. The goal of this is, finally, we can involve Ratis to help manage the Raft status. Maybe it's time to put Ratis on the agenda. I strongly suggest someone spend some time to investigate Ratis and evaluate the cost of integration. Instead of Raft correctness, we can focus on improving the core engine of IoTDB and ecosystem after that.

neuyilan · 2021-09-13T08:29:06Z

Another my suggestion is we can gradually separate raft implementation and raft client(application) just like what etcd did. The goal of this is, finally, we can involve Ratis to help manage the Raft status. Maybe it's time to put Ratis on the agenda. I strongly suggest someone spend some time to investigate Ratis and evaluate the cost of integration. Instead of Raft correctness, We could focus on improving the core engine of IoTDB and ecosystem after that.

Actually, I have the same idea as you, I think we can integrate Apache Ratis with our project, so at least we can guarantee the correctness of the raft, and our energy can be used for some functions of IoTDB. And at present, I think the correctness is far greater than the performance.
But it's really a time-consuming thing. We need to work together to finish it.

chengjianyun · 2021-09-13T08:53:10Z

Another my suggestion is we can gradually separate raft implementation and raft client(application) just like what etcd did. The goal of this is, finally, we can involve Ratis to help manage the Raft status. Maybe it's time to put Ratis on the agenda. I strongly suggest someone spend some time to investigate Ratis and evaluate the cost of integration. Instead of Raft correctness, We could focus on improving the core engine of IoTDB and ecosystem after that.

Actually, I have the same idea as you, I think we can integrate Apache Ratis with our project, so at least we can guarantee the correctness of the raft, and our energy can be used for some functions of IoTDB. And at present, I think the correctness is far greater than the performance.
But it's really a time-consuming thing. We need to work together to finish it.

Very glad to work together. As I will work on prometheus integration in the later of this month, I only have small bandwidth for this. But I'd like to propose some design sooner or later to discuss together after investigation.

OneSizeFitsQuorum · 2021-09-13T08:59:43Z

Another my suggestion is we can gradually separate raft implementation and raft client(application) just like what etcd did. The goal of this is, finally, we can involve Ratis to help manage the Raft status. Maybe it's time to put Ratis on the agenda. I strongly suggest someone spend some time to investigate Ratis and evaluate the cost of integration. Instead of Raft correctness, we can focus on improving the core engine of IoTDB and ecosystem after that.

In fact, In the last few months I have studied Raft algorithm carefully and completed 6.824 lab. This is just an instructional Raft implementation, which makes me deeply appreciate how difficult it is to implement a correct consensus algorithm, while Raft algorithm at production level requires much more. So I think maybe it's time for us to weigh in and make a decision. Choosing a mature implementation of the raft algorithm liberates our productivity, which may take a long time at first, but can be hugely beneficial once implemented.

But I also want to list two concerns:

Production-level raft algorithms generally guarantee linearizability, which is the strictest consistency in distributed systems. Our current Raft algorithm may not be complete, so strictly speaking it does not guarantee linearizability, but at the same time it's performance may be better. What would we say if we migrated raft implementation and performance dropped? Furthermore, is linearizability really necessary for OLAP databases like time-series databases?
At the moment we have Raft implementation and business logic mixed together, but there is a certain amount of performance improvement. For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously. In fact, at the bottom, they can be executed in parallel. So we implemented parallel asynchronous apply optimization so that plans from the same Raft group but different storage groups can be applied in parallel. The performance gains from this optimization are significant. How should we handle this case after we migrate raft implementation? Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group? Or just let them execute inefficiently?

Any discussions are welcomed~

neuyilan · 2021-09-13T09:32:42Z

is linearizability really necessary for OLAP databases like time-series databases?

As I see the raft is one consensus algorithm, which may do not have many relations with the linearizability, just put the post[1] for reference.

we currently put data from multiple storage groups into one Raft group to be executed synchronously.

As far as I know, the apply function is user-defined, we can still implement a parallel apply function according to different storage groups in the one raft log.

What is certain is that using a raft library will definitely limit our optimization work compared with the current implementation (mixing the raft framework and business logic), but I think the availability and correctness are far greater than the performance for now.
[1] https://zhuanlan.zhihu.com/p/47117804

chengjianyun · 2021-09-13T09:41:25Z

is linearizability really necessary for OLAP databases like time-series databases?

As I see the raft is one consensus algorithm, which may do not have many relations with the linearizability, just put the post[1] for reference.

we currently put data from multiple storage groups into one Raft group to be executed synchronously.

As far as I know, the apply function is user-defined, we can still implement a parallel apply function according to different storage groups in the one raft log.

What is certain is that using a raft library will definitely limit our optimization work compared with the current implementation (mixing the raft framework and business logic), but I think the availability and correctness are far greater than the performance for now.
[1] https://zhuanlan.zhihu.com/p/47117804

Let's move Ratis related discussion to #3954 so that more people could join.

cigarl · 2021-09-16T02:19:29Z

but we may also need a processor to avoid logIndex is too big, dunmmyIndex need to re-initialize if logIndex will be bigger than our expect.

This may not seem like a problem.I tried to calculate that it would handle 100 million requests per second and the cluster would run for 2924 years.

Sorry, I closed this PR due to my misoperation. I have reopened it.

...r/src/main/java/org/apache/iotdb/cluster/log/manage/serializable/SyncLogDequeSerializer.java

…alizable/SyncLogDequeSerializer.java

neuyilan · 2021-10-11T07:42:52Z

Any problems with this PR? if not, I think we can merge this PR.

Co-authored-by: lisijia <44458757+cigarl@users.noreply.github.com>

init dummyIndex after restart cluster

198100f

cigarl added 2 commits September 10, 2021 09:49

spotless apply

8400574

init dummyIndex if the cluster is started for the first time

e571533

cigarl marked this pull request as ready for review September 10, 2021 06:04

HTHou added the Module - Cluster PRs for the cluster module label Sep 13, 2021

OneSizeFitsQuorum reviewed Sep 13, 2021

View reviewed changes

...r/src/main/java/org/apache/iotdb/cluster/log/manage/serializable/SyncLogDequeSerializer.java Outdated Show resolved Hide resolved

cluster/src/main/java/org/apache/iotdb/cluster/log/manage/RaftLogManager.java Show resolved Hide resolved

update comments

9257baf

OneSizeFitsQuorum reviewed Sep 13, 2021

View reviewed changes

cluster/src/main/java/org/apache/iotdb/cluster/log/manage/RaftLogManager.java Show resolved Hide resolved

neuyilan reviewed Sep 13, 2021

View reviewed changes

chengjianyun mentioned this pull request Sep 13, 2021

Integrate Apache Ratis to help manage Raft status #3954

Open

cigarl closed this Sep 16, 2021

cigarl reopened this Sep 16, 2021

neuyilan approved these changes Oct 11, 2021

View reviewed changes

...r/src/main/java/org/apache/iotdb/cluster/log/manage/serializable/SyncLogDequeSerializer.java Outdated Show resolved Hide resolved

Update cluster/src/main/java/org/apache/iotdb/cluster/log/manage/seri…

9449a6b

…alizable/SyncLogDequeSerializer.java

neuyilan added the 0.12.3 label Oct 11, 2021

neuyilan merged commit 40397fd into apache:master Oct 11, 2021

cigarl deleted the di_fix branch October 11, 2021 09:00

neuyilan pushed a commit to neuyilan/iotdb that referenced this pull request Oct 11, 2021

init dummyIndex after restart cluster (apache#3939)

88f0294

neuyilan mentioned this pull request Oct 11, 2021

[To Rel/0.12]init dummyIndex after restart cluster #4104

Merged

mychaow pushed a commit that referenced this pull request Oct 11, 2021

init dummyIndex after restart cluster (#3939) (#4104)

a203da6

Co-authored-by: lisijia <44458757+cigarl@users.noreply.github.com>

cornmonster pushed a commit to cornmonster/iotdb that referenced this pull request Oct 25, 2021

init dummyIndex after restart cluster (apache#3939)

950c324

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

init dummyIndex after restart cluster #3939

init dummyIndex after restart cluster #3939

cigarl commented Sep 9, 2021

coveralls commented Sep 9, 2021 •

edited

OneSizeFitsQuorum left a comment •

edited

cigarl commented Sep 13, 2021

OneSizeFitsQuorum commented Sep 13, 2021

cigarl commented Sep 13, 2021

neuyilan Sep 13, 2021

cigarl Sep 13, 2021

neuyilan Sep 13, 2021

chengjianyun Sep 13, 2021 •

edited

chengjianyun commented Sep 13, 2021 •

edited

neuyilan commented Sep 13, 2021 •

edited

chengjianyun commented Sep 13, 2021

OneSizeFitsQuorum commented Sep 13, 2021

neuyilan commented Sep 13, 2021

chengjianyun commented Sep 13, 2021

cigarl commented Sep 16, 2021

neuyilan commented Oct 11, 2021

init dummyIndex after restart cluster #3939

init dummyIndex after restart cluster #3939

Conversation

cigarl commented Sep 9, 2021

coveralls commented Sep 9, 2021 • edited

OneSizeFitsQuorum left a comment • edited

Choose a reason for hiding this comment

cigarl commented Sep 13, 2021

OneSizeFitsQuorum commented Sep 13, 2021

cigarl commented Sep 13, 2021

neuyilan Sep 13, 2021

Choose a reason for hiding this comment

cigarl Sep 13, 2021

Choose a reason for hiding this comment

neuyilan Sep 13, 2021

Choose a reason for hiding this comment

chengjianyun Sep 13, 2021 • edited

Choose a reason for hiding this comment

chengjianyun commented Sep 13, 2021 • edited

neuyilan commented Sep 13, 2021 • edited

chengjianyun commented Sep 13, 2021

OneSizeFitsQuorum commented Sep 13, 2021

neuyilan commented Sep 13, 2021

chengjianyun commented Sep 13, 2021

cigarl commented Sep 16, 2021

neuyilan commented Oct 11, 2021

coveralls commented Sep 9, 2021 •

edited

OneSizeFitsQuorum left a comment •

edited

chengjianyun Sep 13, 2021 •

edited

chengjianyun commented Sep 13, 2021 •

edited

neuyilan commented Sep 13, 2021 •

edited