Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-9192. Update Ratis to 3.0.0. #5205

Merged
merged 5 commits into from
Nov 22, 2023
Merged

Conversation

szetszwo
Copy link
Contributor

What changes were proposed in this pull request?

This is to update Ratis to the latest 3.0.0 version, although it is currently unreleased. We should start testing it with the 3.0.0 snapshots.

What is the link to the Apache JIRA

HDDS-9192

How was this patch tested?

Using existing tests.

@guohao-rosicky
Copy link
Contributor

Thanks @szetszwo work on this, there are some unit test failures.
By looking at the CI it looks like the problem is with the Failed to FORMAT directories

@szetszwo
Copy link
Contributor Author

@guohao-rosicky , you are right. The failures are due to RATIS-1677. I believe that RATIS-1871 has reduced the number of failures. Let me check the failures tests.

@szetszwo
Copy link
Contributor Author

@guohao-rosicky , it seems the Failed to FORMAT problem is fixed. However, there are still quite a few failures. See if you have time to help fixing them. Thank you in advance.

@guohao-rosicky
Copy link
Contributor

hi, @szetszwo , I'll take a look and try to fix the part.

@errose28
Copy link
Contributor

This is to update Ratis to the latest 3.0.0 version, although it is currently unreleased

If we are just testing, would this be better done on a fork or branch until Ratis 3.0 is released? We are planning the Ozone 1.4.0 release and cannot release that with a snapshot Ratis version. It is probably safer to wait until Ratis 3.0 is officially released rather than bring master into an unreleasable state blocked on the Ratis release. cc @symious

@errose28 errose28 marked this pull request as draft August 24, 2023 21:35
@errose28
Copy link
Contributor

Converting to draft. This looks like it will burn significant CI time and is better debugged on a fork until it is green.

@symious
Copy link
Contributor

symious commented Aug 25, 2023

@errose28 Thanks for the information. I will try to test this snapshot version on Ozone first.

@szetszwo
Copy link
Contributor Author

If we are just testing, would this be better done on a fork or branch until Ratis 3.0 is released? ...

@errose28 , sure, running the test in szetszwo#5

... It is probably safer to wait until Ratis 3.0 is officially released ...

If we are not in a hurry, let's wait for Ratis 3.0.0 before releasing Ozone 1.4.0. We should start releasing Ratis 3.0.0 once it can pass the Ozone test.

@szetszwo
Copy link
Contributor Author

... I will try to test this snapshot version on Ozone first.

@symious , thanks for helping out. It can pass almost all the tests; see szetszwo#5.

A few tests are failing with NullPointerException in GrpcLogAppender.resetClient. Will fix them in https://issues.apache.org/jira/browse/RATIS-1876 .

@guohao-rosicky
Copy link
Contributor

Hi @szetszwo, This pr(https://issues.apache.org/jira/browse/RATIS-1876) has been merged to ratis master, after updating this pr ozone, whether it will pass ci, we also need to test, whether we can HDDS-9192, create a separate branch, we are in the branch, to deal with upgrading ratis-3.0.0 need to deal with things.

@guohao-rosicky
Copy link
Contributor

guohao-rosicky commented Aug 28, 2023

Hi @szetszwo, If you find out what parts of upgrading raits-3.0.0 need to be changed, let me know to share some of the work.

@szetszwo
Copy link
Contributor Author

@guohao-rosicky , it seems all the tests are passing except for TestOMRatisSnapshots; see https://issues.apache.org/jira/browse/HDDS-8876

Let me sync the branch with master and then run it again.

@guohao-rosicky
Copy link
Contributor

hi @szetszwo, I observe that you have modified and made compatible ratis 3.0.0, should we go ahead and open this pr.
your branch: https://github.com/szetszwo/ozone/commits/HDDS-9192c

@szetszwo
Copy link
Contributor Author

@guohao-rosicky , let me update my branch and then open this pr.

@szetszwo
Copy link
Contributor Author

@guohao-rosicky , updated the branch. On a second thought, we should wait for Ratis 3.0.0 release before reopening this.

@ivandika3
Copy link
Contributor

ivandika3 commented Sep 28, 2023

Hi @szetszwo, I saw that Ratis 3.0.0 will include RATIS-1569 (https://issues.apache.org/jira/browse/RATIS-1569) patch which might introduce incompatibility between client and server with different versions.

May I know whether this is only for client and server that uses DataStream API (write pipeline v2) or will Async API (write pipeline v1) also affected? In other words, can Async API client with lower version (e.g. 2.4.1) communicate with Async API server in Ratis 3.0.0?

@szetszwo
Copy link
Contributor Author

@ivandika3 , only DataStream API (write pipeline v2) is affected. The Async API will not be affected. A 2.4.1 client can use Async API to communicate to a server running 3.0.0

Thanks for checking it!

@ivandika3
Copy link
Contributor

@szetszwo Thanks for the clarification.

@ChenSammi ChenSammi marked this pull request as ready for review October 8, 2023 03:39
@adoroszlai
Copy link
Contributor

@ChenSammi please check CI in fork before marking ready for review. No need to run the same tests for the PR if they are failing.

@adoroszlai adoroszlai marked this pull request as draft October 8, 2023 07:38
@adoroszlai adoroszlai added the dependencies Pull requests that update a dependency file label Oct 8, 2023
@ChenSammi
Copy link
Contributor

Hi @szetszwo , thanks for working on this. Looks like it is still in draft state. Do we know how much gap is there to make it ready?

@szetszwo
Copy link
Contributor Author

szetszwo commented Oct 9, 2023

@ChenSammi , @adoroszlai , the tests were passing some time back. We were just waiting for the official Rats 3.0.0 release.

It seems that some tests are failing now. Let me take a look.

@symious
Copy link
Contributor

symious commented Oct 11, 2023

@szetszwo We'd like to include this PR in the Ozone 1.4.0 release, hope to hear the good news from Ratis side soon.

cc: @ChenSammi @guohao-rosicky

@szetszwo
Copy link
Contributor Author

@symious , sure, we should include this for Ozone 1.4.0. Let me fix the test failures and then start rolling a Ratis 3.0.0 release.

@szetszwo
Copy link
Contributor Author

This is the JIRA filed for the test failures: https://issues.apache.org/jira/browse/RATIS-1902

@symious
Copy link
Contributor

symious commented Oct 13, 2023

@szetszwo Thanks for the update.

Have tested locally with the changes in RATIS-1902.

org.apache.hadoop.hdds.scm.ha.TestSCMHAManagerImpl
org.apache.hadoop.hdds.upgrade.TestScmHAFinalization
org.apache.hadoop.ozone.om.TestAddRemoveOzoneManager
org.apache.hadoop.ozone.om.TestOzoneManagerHAKeyDeletion

The above 4 cases are passed, but the following one still got some issues. Could you help to confirm?

org.apache.hadoop.ozone.om.TestOMRatisSnapshots

@guohao-rosicky
Copy link
Contributor

https://github.com/szetszwo/ozone/commits/HDDS-9192c

Thanks @symious for the test, is this the part of the code used for the test?

@symious
Copy link
Contributor

symious commented Oct 16, 2023

is this the part of the code used for the test?

@guohao-rosicky , I tested with HDDS-9192c, seems the results are the same.

@szetszwo
Copy link
Contributor Author

@symious , thanks for checking. Let me sync the branch and check TestOMRatisSnapshots.

@szetszwo
Copy link
Contributor Author

TestOMRatisSnapshots was a timing sensitive test. When we changed a (unlrelated) Ratis timeout in HDDS-9232, it failed. HDDS-9416 fixed it by resetting the timeout.

I guess updating to Ratis to 3.0.0 also changes the timing behavior so that TestOMRatisSnapshots is failing again.

@kerneltime
Copy link
Contributor

TestOMRatisSnapshots was a timing sensitive test. When we changed a (unlrelated) Ratis timeout in HDDS-9232, it failed. HDDS-9416 fixed it by resetting the timeout.

I guess updating to Ratis to 3.0.0 also changes the timing behavior so that TestOMRatisSnapshots is failing again.
@hemantk-12 @prashantpogde @smengcl can we take a look at which test is timing sensitive?

@szetszwo
Copy link
Contributor Author

TestOMRatisSnapshots is failing in current master branch without any changes in my machine. So it probably is not related to updating to Ratis 3.0.0.

@adoroszlai
Copy link
Contributor

TestOMRatisSnapshots is failing in current master branch without any changes in my machine.

@szetszwo is it HDDS-8876? If not, can you please file a bug?

@szetszwo
Copy link
Contributor Author

@adoroszlai , yes, It looks like the same as HDDS-8876 -- it also timed out in my machine.

@szetszwo szetszwo marked this pull request as ready for review November 20, 2023 20:15
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @szetszwo for the patch, LGTM.

@adoroszlai
Copy link
Contributor

@guohao-rosicky would you like to take a look?

@szetszwo
Copy link
Contributor Author

@adoroszlai , @smengcl , thanks a lot for reviewing this!

I will wait for another day before merging this. See if anyone like to take a look.

Copy link
Contributor

@guohao-rosicky guohao-rosicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@szetszwo szetszwo merged commit 904a916 into apache:master Nov 22, 2023
33 checks passed
@szetszwo
Copy link
Contributor Author

@guohao-rosicky , thanks for reviewing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
9 participants