New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-6582. ContainerRecoveryStore for ec containers under recovery. #3361
Conversation
Hi @adoroszlai @umamaheswararao please help review this, thanks~ |
Thank you for your great work here @guihecheng. I will take a look. Thank you. |
@guihecheng This API seems be locally creating container and writing to it. But transferring remotely still need to expose them throw different protocol. Other hand @sodonnel has a point to have recovery state and create in the same directory. Can we discuss this option as well? Looks like SCM needs to handle recovery state little differently, but that seems not difficult though. |
There two things we discussed: So, the next question for discussion here, while generating locally, whether to use existing protocols and have different container state or use this patch proposed new interfaces. |
I think we can cache the reconstructed data and allow retry in a sliding window when streaming data to TargetDNs. Overall, I agree with implementing the simple approach first, and do optimizations later. |
Yes, Uma, I think we need an open discussion around these ideas. I mean that the simple approach doesn't seems that simple actually, and if it is not the final ideal solution, why bother cook it? |
Thanks @guihecheng for pointing out the issue. There is one more problem: For Ratis replicas, each DN in the pipeline should persist the same container. For EC, CoordinatorDN should not persist the recovered containers that does not belong to it. |
Still there are some arguments that generating locally may give some advantage on "less failures while transferring"
We had some offline discussion on this approaches and tried to capture most of the points, what we are discussing here and in offline discussion. I will share that shortly. |
Implementation Level Design Choices for Recovered Containers StorageDiscussion with @guihecheng , @sodonnel , @kaijchen and @umamaheswararao In today's offline ( online zoom :-) ) discussion, we have discussed the following: With the current discussions we have few options to store the recovered containers at DN. 1. Creating the recovered containers locally in a tempStore Service:Advantages:
Disadvantages:
2. Creating containers remotely and use the different state “Recovering”Advantages:
Disadvantages:
3. Use “TempStore” service, but still transfer the replicas remotely.Advantages:
Disadvantages:
4. Create containers locally, but use “Recovering” State and create containers with the name replicaIndex.Advantages:
Disadvantages:
Additional Discussions: Other arguments we also have that, in a good NW configuration machine, we don’t need to worry about NW failure probability. But on average today’s cluster deployments we still see NW issues though. Further details on the Recovering State on DNs.Potentially this could work as follows. The coordinator issues a “createContainer” call with a flag to indicate that it is a recovering container. The DN stores this container in the usual place, but keeps track of it in a ECRecoveryMonitor. We skip sending an ICR or including it in any container reporting. If the DN is restarted when a container is recovering, we should just remove any recovering containers, as the coordinator will have failed anyway. The coordinator then writes to the container as usual and the DN stores the chunks as usual. When all chunks are recovered, the coordinator issues a new “completeRecovery” call to the datanode. This will trigger an ICR etc. If the coordinator fails for some reason, we need to clean up. The ECRecoveryMonitor on the datanode can scan the set of recovering containers looking for progress. Eg if the last write was longer than some time threshold, assume the coordinator has failed and remove the container. If the coordinator comes back again and tries to write, it will get an container does not exist error. Another edge case is that the recovery coordinator fails and is rescheduled, picking the same host as the target. It tries to create the recovering container and finds it already exists. In that case, we can just remove the recovering container and create a new empty one instead. A further enhancement may allow us to read what is in the partly recovered container and restart recovered, but that is probably too complex for day 1. If we build as described above, there are no changes needed on SCM. We just need to handle a few areas where ICRs are sent, and create the new ECRecoveryMonitor in the Datanodes. |
Update: Today we had a small group discussion with the folks who involved in the review of above PR task. |
@umamaheswararao Thanks for driving this to a conclusion, after an agreement on Option 2, I think this PR could be closed, and new ones should be opened after we have a simple outlined design on Option 2 and we'll help driving it. |
What changes were proposed in this pull request?
ContainerRecoveryStore for ec containers under recovery.
A design doc: https://docs.google.com/document/d/1CW73NSIWmrzobVyMvtGQtj6-mLyLlpNennQE8yYMM4I/edit?usp=sharing
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-6582
How was this patch tested?
New UT.