Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init dummyIndex after restart cluster #3939

Merged
merged 5 commits into from Oct 11, 2021
Merged

init dummyIndex after restart cluster #3939

merged 5 commits into from Oct 11, 2021

Conversation

cigarl
Copy link
Contributor

@cigarl cigarl commented Sep 9, 2021

In current logic, if we restart cluster, and applyIndex is smaller than commitIndex,here are three problems occur.

  • First, we cannot append un-apply logs to commitManager.Like 103 and 104 in the picture,they can not be appended to commitEntries in commitManager.
  • Second, un-commitManager can not append a new log. Like 0 can not be appended to unCommitEntries in unCommitManager.
  • Third, un-applied logs can not be re-applied.That means 103 and 104 could be lost.

This is all because dummyIndex is initialized to the default value,it cannot return to its pre-reboot state.

ps.The information in the picture assumes none of these questions exist
6211a9f5a25a465958c041d145f6dd5

This PR try to resolve this problem,but we may also need a processor to avoid logIndex is too big, dunmmyIndex need to re-initialize if logIndex will be bigger than our expect.

@coveralls
Copy link

coveralls commented Sep 9, 2021

Coverage Status

Coverage increased (+0.3%) to 67.748% when pulling 9449a6b on cigarl:di_fix into 601df56 on apache:master.

@cigarl cigarl marked this pull request as ready for review September 10, 2021 06:04
@HTHou HTHou added the Module - Cluster PRs for the cluster module label Sep 13, 2021
Copy link
Contributor

@OneSizeFitsQuorum OneSizeFitsQuorum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To solve these problems completely, I think raftLogManager needs a complete refactoring, including but not limited to better concurrency control, better persistence strategy, etc.

When I first joined the community in my senior year, the first thing I did was to implement raftLogManager with reference to storage and unstable interfaces in etcd. Because I didn't have a very deep understanding of raft and etcd at that time, after a long period, I finally realized that this was probably an awful implementation and I apologize for that. There are two main reasons:

  • For raft, uncommitted logs also need to be persisted (i.e., logs on disk may also need to be truncated) in order to ensure correctness after reboot.
  • etcd is an event-driven architecture, so its interior is all unlocked. However, in our architecture, raftLogManager can be accessed concurrently by multiple threads. So we've added a lot of patches for concurrency control right now, but this is actually an area we can think about implementing better.

Thank you very much for your great contributions. This week I will focus on the PR you raised about cluster- branch and a bug at hand. After that, I would like to discuss the refactoring of raftLogManager with you guys and community. And of course you're welcome to start to design the new raftLogManager with us right now if you are free.

@cigarl
Copy link
Contributor Author

cigarl commented Sep 13, 2021

To solve these problems completely, I think raftLogManager needs a complete refactoring, including but not limited to better concurrency control, better persistence strategy, etc.

When I first joined the community in my senior year, the first thing I did was to implement raftLogManager with reference to storage and unstable interfaces in etcd. Because I didn't have a very deep understanding of raft and etcd at that time, after a long period, I finally realized that this was probably an awful implementation and I apologize for that. There are two main reasons:

  • For raft, uncommitted logs also need to be persisted (i.e., logs on disk may also need to be truncated) in order to ensure correctness after reboot.
  • etcd is an event-driven architecture, so its interior is all unlocked. However, in our architecture, raftLogManager can be accessed concurrently by multiple threads. So we've added a lot of patches for concurrency control right now, but this is actually an area we can think about implementing better.

Thank you very much for your great contributions. This week I will focus on the PR you raised about cluster- branch and a bug at hand. After that, I would like to discuss the refactoring of raftLogManager with you guys and community. And of course you're welcome to start to design the new raftLogManager with us right now if you are free.

IMO,we need the following two approaches in parallel:

  • First,we should solve this problem in the current code structure and submit to master and rel/0.12.x branch. Because when a user has a problem like mine, it's hard to solve immediately, this means that some data could be loss, and cluster may enter an unstable state.This is a disaster for the production environment.We need to provide users with a patch to avoid this problem.
  • Second,we can refactor this part on cluster- branch.Once we have solved the first problem, we can spend more time refactoring this part better. And we don't have to push this part of the content to the user immediately.It can be put into the next release as a better implementation.

I'd be happy to do it together : )

@OneSizeFitsQuorum
Copy link
Contributor

I'd be happy to do it together : )

OK, then that's my reviews~At current, it seems that the getAllEntriesAfterCommittedIndex() will always return empty list in current design? What's your opinion?

@cigarl
Copy link
Contributor Author

cigarl commented Sep 13, 2021

I'd be happy to do it together : )

OK, then that's my reviews~At current, it seems that the getAllEntriesAfterCommittedIndex() will always return empty list in current design? What's your opinion?

see #3930
In my environment, that is not empty. It is caused by meta information not being flushed synchronously.

Comment on lines +240 to +250
/**
* When raft log files flushed,meta would not be flushed synchronously.So data has flushed to disk
* is uncommitted for persistent LogManagerMeta(meta's info is stale).We need to recover these
* already persistent logs.
*
* <p>For example,commitIndex is 5 in persistent LogManagerMeta,But the log file has actually been
* flushed to 7,when we restart cluster,we need to recover 6 and 7.
*
* <p>Maybe,we can extract getAllEntriesAfterAppliedIndex and getAllEntriesAfterCommittedIndex
* into getAllEntriesByIndex,but now there are too many test cases using it.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this comment, it seems due to that the log's meta file does not flush synchronously with the raft log, however, the persistent commit raft log maybe has been applied.
We should not commit the entry again when restart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that we don't know the actual applyIndex.
Fortunately, in IOTDB, repeating apply operation doesn't matter too much, but losing data can be unacceptable.
That is only my views.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have so many plans, I am not sure all the plan is idempotent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my understanding, re-apply the same sequence(guaranteed by log index) of the plans won't break the correctness of IoTDB. And looks like we don't have a better choice.

@chengjianyun
Copy link
Contributor

chengjianyun commented Sep 13, 2021

To solve these problems completely, I think raftLogManager needs a complete refactoring, including but not limited to better concurrency control, better persistence strategy, etc.

When I first joined the community in my senior year, the first thing I did was to implement raftLogManager with reference to storage and unstable interfaces in etcd. Because I didn't have a very deep understanding of raft and etcd at that time, after a long period, I finally realized that this was probably an awful implementation and I apologize for that. There are two main reasons:

  • For raft, uncommitted logs also need to be persisted (i.e., logs on disk may also need to be truncated) in order to ensure correctness after reboot.
  • etcd is an event-driven architecture, so its interior is all unlocked. However, in our architecture, raftLogManager can be accessed concurrently by multiple threads. So we've added a lot of patches for concurrency control right now, but this is actually an area we can think about implementing better.

Thank you very much for your great contributions. This week I will focus on the PR you raised about cluster- branch and a bug at hand. After that, I would like to discuss the refactoring of raftLogManager with you guys and community. And of course you're welcome to start to design the new raftLogManager with us right now if you are free.

Thanks for your implementation so that we can enjoy the first cluster version of IoTDB :). Raft algorithm is really hard to implementation, there are too many corner cases to consideration. We could make the algorithm work correct in most of cases in a short time which is terrific.

In my opinion, we need to spend too much effort to guarantee the correctness of Raft if we want to keep working on the current cluster implementation. Cluster module lack a whole system design according to my feeling.

Another my suggestion is we can gradually separate raft implementation and raft client(application) just like what etcd did. The goal of this is, finally, we can involve Ratis to help manage the Raft status. Maybe it's time to put Ratis on the agenda. I strongly suggest someone spend some time to investigate Ratis and evaluate the cost of integration. Instead of Raft correctness, we can focus on improving the core engine of IoTDB and ecosystem after that.

@neuyilan
Copy link
Member

neuyilan commented Sep 13, 2021

Another my suggestion is we can gradually separate raft implementation and raft client(application) just like what etcd did. The goal of this is, finally, we can involve Ratis to help manage the Raft status. Maybe it's time to put Ratis on the agenda. I strongly suggest someone spend some time to investigate Ratis and evaluate the cost of integration. Instead of Raft correctness, We could focus on improving the core engine of IoTDB and ecosystem after that.

Actually, I have the same idea as you, I think we can integrate Apache Ratis with our project, so at least we can guarantee the correctness of the raft, and our energy can be used for some functions of IoTDB. And at present, I think the correctness is far greater than the performance.
But it's really a time-consuming thing. We need to work together to finish it.

@chengjianyun
Copy link
Contributor

Another my suggestion is we can gradually separate raft implementation and raft client(application) just like what etcd did. The goal of this is, finally, we can involve Ratis to help manage the Raft status. Maybe it's time to put Ratis on the agenda. I strongly suggest someone spend some time to investigate Ratis and evaluate the cost of integration. Instead of Raft correctness, We could focus on improving the core engine of IoTDB and ecosystem after that.

Actually, I have the same idea as you, I think we can integrate Apache Ratis with our project, so at least we can guarantee the correctness of the raft, and our energy can be used for some functions of IoTDB. And at present, I think the correctness is far greater than the performance.
But it's really a time-consuming thing. We need to work together to finish it.

Very glad to work together. As I will work on prometheus integration in the later of this month, I only have small bandwidth for this. But I'd like to propose some design sooner or later to discuss together after investigation.

@OneSizeFitsQuorum
Copy link
Contributor

Another my suggestion is we can gradually separate raft implementation and raft client(application) just like what etcd did. The goal of this is, finally, we can involve Ratis to help manage the Raft status. Maybe it's time to put Ratis on the agenda. I strongly suggest someone spend some time to investigate Ratis and evaluate the cost of integration. Instead of Raft correctness, we can focus on improving the core engine of IoTDB and ecosystem after that.

In fact, In the last few months I have studied Raft algorithm carefully and completed 6.824 lab. This is just an instructional Raft implementation, which makes me deeply appreciate how difficult it is to implement a correct consensus algorithm, while Raft algorithm at production level requires much more. So I think maybe it's time for us to weigh in and make a decision. Choosing a mature implementation of the raft algorithm liberates our productivity, which may take a long time at first, but can be hugely beneficial once implemented.

But I also want to list two concerns:

  • Production-level raft algorithms generally guarantee linearizability, which is the strictest consistency in distributed systems. Our current Raft algorithm may not be complete, so strictly speaking it does not guarantee linearizability, but at the same time it's performance may be better. What would we say if we migrated raft implementation and performance dropped? Furthermore, is linearizability really necessary for OLAP databases like time-series databases?

  • At the moment we have Raft implementation and business logic mixed together, but there is a certain amount of performance improvement. For example, we currently put data from multiple storage groups into one Raft group to be executed synchronously. In fact, at the bottom, they can be executed in parallel. So we implemented parallel asynchronous apply optimization so that plans from the same Raft group but different storage groups can be applied in parallel. The performance gains from this optimization are significant. How should we handle this case after we migrate raft implementation? Add parallel apply feature to raft library Or do we change our partitioning pattern so that a Raft group serves only one storage group? Or just let them execute inefficiently?

Any discussions are welcomed~

@neuyilan
Copy link
Member

is linearizability really necessary for OLAP databases like time-series databases?

As I see the raft is one consensus algorithm, which may do not have many relations with the linearizability, just put the post[1] for reference.

we currently put data from multiple storage groups into one Raft group to be executed synchronously.

As far as I know, the apply function is user-defined, we can still implement a parallel apply function according to different storage groups in the one raft log.

What is certain is that using a raft library will definitely limit our optimization work compared with the current implementation (mixing the raft framework and business logic), but I think the availability and correctness are far greater than the performance for now.
[1] https://zhuanlan.zhihu.com/p/47117804

@chengjianyun
Copy link
Contributor

is linearizability really necessary for OLAP databases like time-series databases?

As I see the raft is one consensus algorithm, which may do not have many relations with the linearizability, just put the post[1] for reference.

we currently put data from multiple storage groups into one Raft group to be executed synchronously.

As far as I know, the apply function is user-defined, we can still implement a parallel apply function according to different storage groups in the one raft log.

What is certain is that using a raft library will definitely limit our optimization work compared with the current implementation (mixing the raft framework and business logic), but I think the availability and correctness are far greater than the performance for now.
[1] https://zhuanlan.zhihu.com/p/47117804

Let's move Ratis related discussion to #3954 so that more people could join.

@cigarl cigarl closed this Sep 16, 2021
@cigarl cigarl reopened this Sep 16, 2021
@cigarl
Copy link
Contributor Author

cigarl commented Sep 16, 2021

but we may also need a processor to avoid logIndex is too big, dunmmyIndex need to re-initialize if logIndex will be bigger than our expect.

This may not seem like a problem.I tried to calculate that it would handle 100 million requests per second and the cluster would run for 2924 years.

Sorry, I closed this PR due to my misoperation. I have reopened it.

@neuyilan
Copy link
Member

Any problems with this PR? if not, I think we can merge this PR.

@neuyilan neuyilan merged commit 40397fd into apache:master Oct 11, 2021
@cigarl cigarl deleted the di_fix branch October 11, 2021 09:00
neuyilan pushed a commit to neuyilan/iotdb that referenced this pull request Oct 11, 2021
mychaow pushed a commit that referenced this pull request Oct 11, 2021
Co-authored-by: lisijia <44458757+cigarl@users.noreply.github.com>
cornmonster pushed a commit to cornmonster/iotdb that referenced this pull request Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module - Cluster PRs for the cluster module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants