Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-6582. ContainerRecoveryStore for ec containers under recovery. #3361

Closed
wants to merge 1 commit into from

Conversation

guihecheng
Copy link
Contributor

@guihecheng guihecheng commented Apr 28, 2022

What changes were proposed in this pull request?

ContainerRecoveryStore for ec containers under recovery.

A design doc: https://docs.google.com/document/d/1CW73NSIWmrzobVyMvtGQtj6-mLyLlpNennQE8yYMM4I/edit?usp=sharing

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-6582

How was this patch tested?

New UT.

@guihecheng
Copy link
Contributor Author

Hi @adoroszlai @umamaheswararao please help review this, thanks~

@umamaheswararao
Copy link
Contributor

Thank you for your great work here @guihecheng. I will take a look. Thank you.

@umamaheswararao
Copy link
Contributor

@guihecheng This API seems be locally creating container and writing to it. But transferring remotely still need to expose them throw different protocol.

Other hand @sodonnel has a point to have recovery state and create in the same directory. Can we discuss this option as well?

Looks like SCM needs to handle recovery state little differently, but that seems not difficult though.
When SCM smartly handle this different state, we don't need to worry about temp store though, it can be in actual dir itself. If we just wanted to move ahead with temp store, I feel it's too early to discard that idea.
I am thinking a bit about this: mainly handling failure cases. Is it is easy to change the state of container? example we created recovery state containers and coordinator crashed. Then we will have left out recovery state containers. Then when they can be cleared. cc: @sodonnel

@umamaheswararao
Copy link
Contributor

There two things we discussed:
Generating containers locally first and transfer them to the remaining targets. This will give some advantage on failure handling side. Me and Arpit had a short discussion on this.
If we stream the container content to avoid local storage consumption, the failure probability is more as we may have longer time spending in NW IO ( prone to failures). Where if we generate them locally, we can transfer the whole container as single step and even if failed to transfer, we could retry. So, it would be good idea to start with generating container locally and transfer them once all recovery done.

So, the next question for discussion here, while generating locally, whether to use existing protocols and have different container state or use this patch proposed new interfaces.

@kaijchen
Copy link
Contributor

kaijchen commented May 4, 2022

If we stream the container content to avoid local storage consumption, the failure probability is more as we may have longer time spending in NW IO ( prone to failures). Where if we generate them locally, we can transfer the whole container as single step and even if failed to transfer, we could retry. So, it would be good idea to start with generating container locally and transfer them once all recovery done.

I think we can cache the reconstructed data and allow retry in a sliding window when streaming data to TargetDNs.
However, if any TargetDN fails and needs to be replaced, we have to start over.

Overall, I agree with implementing the simple approach first, and do optimizations later.

@guihecheng
Copy link
Contributor Author

There two things we discussed: Generating containers locally first and transfer them to the remaining targets. This will give some advantage on failure handling side. Me and Arpit had a short discussion on this. If we stream the container content to avoid local storage consumption, the failure probability is more as we may have longer time spending in NW IO ( prone to failures). Where if we generate them locally, we can transfer the whole container as single step and even if failed to transfer, we could retry. So, it would be good idea to start with generating container locally and transfer them once all recovery done.

So, the next question for discussion here, while generating locally, whether to use existing protocols and have different container state or use this patch proposed new interfaces.

Yes, Uma, I think we need an open discussion around these ideas.
For the "Generating containers locally then transfer" idea, I think there is an issue to consider:
For Ratis Replication, on a Healthy DN, we have a existing container to export as a tarball, then other DNs come to read them and then import locally.
For EC Offline Recovery, on a CoordinatorDN, we'll have several container replicas on disk with the same containerID, so we can't have them all in a normal place as Ratis Replication, then we can't reuse the exportContainer API easily, we still need to put these container replicas in a temp location and do a special export for them(may be by some refactoring and reuse around the existing exportContainer API).

I mean that the simple approach doesn't seems that simple actually, and if it is not the final ideal solution, why bother cook it?
Actually for now, I tends to support the idea from Stephen, we could introduce a new state "RECOVERING" for containers, maybe it is not so easy and need some changes around existing logics, but after it is done, it should be very useful and clean as I imagined. Let's discuss more about these ideas.

@guihecheng guihecheng marked this pull request as draft May 5, 2022 02:35
@kaijchen
Copy link
Contributor

kaijchen commented May 5, 2022

For Ratis Replication, on a Healthy DN, we have a existing container to export as a tarball, then other DNs come to read them and then import locally.
For EC Offline Recovery, on a CoordinatorDN, we'll have several container replicas on disk with the same containerID, so we can't have them all in a normal place as Ratis Replication, then we can't reuse the exportContainer API easily

Thanks @guihecheng for pointing out the issue. There is one more problem:

For Ratis replicas, each DN in the pipeline should persist the same container.

For EC, CoordinatorDN should not persist the recovered containers that does not belong to it.
Unless we allow a degraded state in which some of the DN may hold more than one container in a container group.

@umamaheswararao
Copy link
Contributor

I mean that the simple approach doesn't seems that simple actually, and if it is not the final ideal solution, why bother cook it?

Still there are some arguments that generating locally may give some advantage on "less failures while transferring"

For EC, CoordinatorDN should not persist the recovered containers that does not belong to it.
It's different state we are discussing. The new state is "Recovering". With this state container should not be treated as regular container.

We had some offline discussion on this approaches and tried to capture most of the points, what we are discussing here and in offline discussion. I will share that shortly.

@umamaheswararao
Copy link
Contributor

umamaheswararao commented May 6, 2022

Implementation Level Design Choices for Recovered Containers Storage

Discussion with @guihecheng , @sodonnel , @kaijchen and @umamaheswararao

In today's offline ( online zoom :-) ) discussion, we have discussed the following:

With the current discussions we have few options to store the recovered containers at DN.

1. Creating the recovered containers locally in a tempStore Service:

Advantages:

  1. This option gives some advantage as we don’t need to stream the data to remote nodes. All containers first recovered locally and transferred. So, less failure probability.
  2. When transferring if target nodes failed, then with additional buffer targets, we can transfer the same replica to other target.

Disadvantages:

  1.    It would be slow.
    
  2.    More disk IO as we will need to write locally and read back.
    
  3.    Replica downloaders need to understand the tempstore container structure when downloading.
    

2. Creating containers remotely and use the different state “Recovering”

Advantages:

  1. We can reuse more code in the write path at DN.

Disadvantages:

  1. Failure handling is tricky. Need additional daemon services/other mechanisms to clean the previously failed and still in “Recovering” state containers.
    
  2. Possibility of more failures due to NW transfer along the recovery. We may need to discard the work done on failures.
    

    Refer below section [Further details on the Recovering State on DNs.] for additional information.

3. Use “TempStore” service, but still transfer the replicas remotely.

Advantages:

  1. We can have our own metadata structure, ex: in memory.

Disadvantages:

  1. Some new interfaces and new implementations, so we need more test coverage.
    
  2. Same as #2
    
  3. May not reuse the existing write paths.
    

4. Create containers locally, but use “Recovering” State and create containers with the name replicaIndex.

Advantages:

  1. We may be able to reuse the code and can come to a stable state quickly.
  2. Less probability for intermittent NW failures.
  3. With fewer modifications ReplicaDownloader may work.

Disadvantages:

  1. We need to modify it to allow containers to have different names. (ex: container-1-replica1 )
  2. We still need some special scrubber services to clean up the failed container. Relatively easy as Coordinator only creates those containers, so if it restarts, either it can catch up/ we can discard them.
  3. Same as 1.1: It would be slow.
  4. Same as 1.2: More disk IO as we will need to write locally and read back.

Additional Discussions:

Other arguments we also have that, in a good NW configuration machine, we don’t need to worry about NW failure probability. But on average today’s cluster deployments we still see NW issues though.
We also should be able to transfer the replicas as is - similar to Ratis replication as with decommission we may need this. But I think if we schedule the ReplicateContainer command instead of ReconstructECCommand, that should work.

Further details on the Recovering State on DNs.

Potentially this could work as follows. The coordinator issues a “createContainer” call with a flag to indicate that it is a recovering container.

The DN stores this container in the usual place, but keeps track of it in a ECRecoveryMonitor. We skip sending an ICR or including it in any container reporting.

If the DN is restarted when a container is recovering, we should just remove any recovering containers, as the coordinator will have failed anyway.

The coordinator then writes to the container as usual and the DN stores the chunks as usual.

When all chunks are recovered, the coordinator issues a new “completeRecovery” call to the datanode. This will trigger an ICR etc.

If the coordinator fails for some reason, we need to clean up. The ECRecoveryMonitor on the datanode can scan the set of recovering containers looking for progress. Eg if the last write was longer than some time threshold, assume the coordinator has failed and remove the container. If the coordinator comes back again and tries to write, it will get an container does not exist error.

Another edge case is that the recovery coordinator fails and is rescheduled, picking the same host as the target. It tries to create the recovering container and finds it already exists. In that case, we can just remove the recovering container and create a new empty one instead.

A further enhancement may allow us to read what is in the partly recovered container and restart recovered, but that is probably too complex for day 1.

If we build as described above, there are no changes needed on SCM. We just need to handle a few areas where ICRs are sent, and create the new ECRecoveryMonitor in the Datanodes.

@umamaheswararao
Copy link
Contributor

umamaheswararao commented May 10, 2022

Update:

Today we had a small group discussion with the folks who involved in the review of above PR task.
Attendees: @guihecheng , @sodonnel , @kaijchen , @adoroszlai , @umamaheswararao
After discussing above options, we all came to the same agreement that, we will be going with Option 2, mainly because that could be quicker as much of needed APIs available in ContainerProtocolCalls.
The disadvantages parts are also discussed
When two co-ordinators doing the same work somehow, how do we prevent one from another. There could be an attempt ID when SCM scheduled, so that target nodes can check the attempt ID on writes to make sure to respect the right coordinator and reject others. Remaining solutions discussed in the PR comment already.
For NW failures, we could still do local container writes and transfer if the current one really becoming an issues in realistic clusters. With the current approach and with improved retry mechanism, we could avoid NW glitches or simple restarts anyway.

@guihecheng
Copy link
Contributor Author

@umamaheswararao Thanks for driving this to a conclusion, after an agreement on Option 2, I think this PR could be closed, and new ones should be opened after we have a simple outlined design on Option 2 and we'll help driving it.
I think I'll keep this open for another several days for more developers to see the decision here and close this next week maybe.

@guihecheng guihecheng closed this May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants