Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-4440. HDDS-4440-s3-performance branch Merge to Master PR - draft #3297

Closed
wants to merge 34 commits into from

Conversation

neils-dev
Copy link
Contributor

What changes were proposed in this pull request?

Draft PR for feature branch merge to master.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4440

How was this patch tested?

CI workflow.

neils-dev and others added 29 commits July 29, 2021 15:22
…rt (apache#2485)

* Commit for create specific OmTransportFactory initialized with META-INF/services api.  Modifications made for green build targeted to s3 grpc incrementental PRs.  Integration tests are now independent of s3g implementation (through pom file changes).  Smoketests s3g specific are temp not run, delayed until further implementation submitted to support functionality.
)

* Initial commit for s3g with om grpc server to process OzoneManagerProtocol s3 requests.
…he#2580)

* Initial commit with s3 gRPC client and associated unit tests.
* Added configurable OZONE_OM_GRPC_MAXIMUM_RESPONSE_LENGTH to config keys for s3g om OmRequest OmResponses over gRPC.
* Added suggested changes to unit tests from review.  Including updating test timeout to appropriate test timeout value from 5min to 30s and using custom values (to reflect server response) in OMResponse to validate gRPC transport of Ozone Manager Protocol.
* Added configuration setting in ozone-default for ozone.om.grpc.maximum.response.length.
…tion transport (apache#2655)

* Modifications made as per reviewers comments.  Clean up code, remove unnecessary code (not used or required).  Changed OzoneManagerProtocol OmRequest from using multiple optional single variable entries for s3 auth to using an optional S3Authentication structure containing s3 auth info.  Added default enabling of s3 Grpc on om through configuration setting in compose ozone cluster and re-enabled s3 smoke tests for that cluster.
Co-authored-by: Neil Joshi <neil2021.gres@gmail.com>
…ster branch with HDDS-5881 changes INTO HDDS-4400-s3-performance feature branch. Merge allows s3 gateway gRPC to work with s3 gateway persistent connection changes made in HDDS-5881 found in master branch.
…ch green builds. Merging with Master branch impacts HDDS-4440 PRs in progress. This merge also requires partial changes that were initially in PR for HDDS-5545 to be included. Namely, the patch to allow switching between OmTransports, GrpcOmTransport and Hadoop3OmTransport through ozone configuration in the OmConfigKeys.
…nsport for s3 gateway - TestOzoneConfigurationFields (added config key not in xml).
…ansport support (apache#3074)

* HDDS-6149. Remove unused keytabs (apache#2960)

* HDDS-6094. Some unit tests are skipped due to using JUnit4 runner (apache#2909)

* HDDS-6075. OzoneConfiguration constructor overrides input configuration keys. (apache#2921)

* HDDS-4177. SCM Container DB bootstrap on Recon startup (apache#2942)

* HDDS-6086. Compute MD5MD5CRC file checksum using chunk checksums from DataNodes (apache#2919)

* HDDS-6148. Validate ContainerBalancerConfiguration before start ContainerBalancer (apache#2957)

* HDDS-6161. SCM StateMachine failing to reinitialize doesn't terminate the process. (apache#2971)

* HDDS-6134. Move replication-specific config to ReplicationServer (apache#2943)

* HDDS-4010. S3G startup fails when multiple service ids are configured. (apache#2976)

* HDDS-6170. Add metrics to replication manager to track container health states (apache#2975)

* HDDS-3231. Cleanup KeyManagerImpl (apache#2961)

* HDDS-5927. Improve defaults in ContainerBalancerConfiguration (apache#2892)

* HDDS-6157. More consistent synchronization in InputStreams (apache#2965)

* HDDS-4743. [FSO] Add FSO variant of ITestOzoneContractDistcp. (apache#2980)

* HDDS-6114. Intermittent error due to Failed to init RocksDB (apache#2947)

* HDDS-6175. Use s3Auth during proxy during decrypt in RpcClient. (apache#2981)

* HDDS-6175. Use s3Auth during proxy during decrypt in RpcClient.

* HDDS-5740. Enable ratis by default for SCM. (apache#2637)

* HDDS-6183. Intermittent failure in TestKeyDeletingService.checkIfDeleteServiceWithFailingSCM. (apache#2991)

* HDDS-4190. Intermittent failure in TestOMVolumeSetOwnerRequest and TestOMVolumeSetQuotaRequest. (apache#2982)

* HDDS-6120. Compute block checksum using chunk checksums (apache#2930)

* HDDS-6147. Add ability in OM to get limited delta updates (apache#2956)

* HDDS-6195. Remove unused jmh-core dependency. (apache#2997)

* HDDS-6167. Update ozone-runner version to 20211202-1 (apache#2969)

* HDDS-6171. Create an API to change Bucket Owner (apache#2988)

* HDDS-6163. Fix PATH in docker image (apache#2967)

* HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (apache#2998)

* HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (apache#2972)

* HDDS-6109. Preserve the underlying exception raised in client lib. (apache#2989)

* HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (apache#2983)

* HDDS-6203. CleanUp incomplete gz files during Container move (apache#3000)

* HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (apache#3011)

* HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (apache#3015)

* HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (apache#2987)

* HDDS-6177. Extend container info command to include replica details  (apache#2995)

* HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (apache#3007)

* HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (apache#3014)

* HDDS-6192. feature/Observability.md translated to Chinese (apache#2994)

* HDDS-6205. Add CLI command to display the latest Replication Manager report (apache#3013)

* HDDS-6227. Test helpers should observe naming conditions (apache#3020)

* HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (apache#3029)

* HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (apache#3005)

* HDDS-5529. For any IOexception from @REPLicated method we should throw it (apache#2788)

* HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (apache#2992)

* HDDS-6206. Application errors must not flood system log (apache#3001)

* HDDS-6245. Add BucketLayout logging to Audit Logs (apache#3040)

* HDDS-6238 Reduce memory requirements for list keys. (apache#3032)

* HDDS-2919. Intermittent failure in TestRDBStore (apache#3028)

* HDDS-6253. Unnecessary duplicate smoketest after defaulting to FSO (apache#3036)

* HDDS-6204. Cleanup handling malformed authorization header (apache#2999)

* HDDS-6169. Selective checks: skip junit tests on ozone-runner image update (apache#2974)

* HDDS-6270. Use a dedicated file instead of /etc/passwd for xcompat acceptance test (apache#3050)

* HDDS-6273. Amend doc SecuringTDE.md (apache#3047)

* HDDS-6140. Selective checks: skip unit check for integration-test changes (apache#2948)

* HDDS-6215. Recon get limited delta updates from OM (apache#3009)

* HDDS-6125. Recon get limited delta updates from OM

* HDDS-6215. Fix unit test

* trigger new CI check

* HDDS-6215. Fix typo

* trigger new CI check

Co-authored-by: Symious <yiyang0203@gmail.com>

* HDDS-6226. Run tests for selective CI checks in CI (apache#3019)

* HDDS-6247. Avoid logging stack trace for user input problems (apache#3039)

* HDDS-6208. New checkstyle: WhitespaceAround (apache#3003)

* HDDS-6289. Upgrade acceptance test log flooded with parse error (apache#3063)

Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>

* Trigger Build

* Fix integration test for added configuation field for selecting OmTransport for s3 gateway - TestOzoneConfigurationFields (added config key not in xml).

Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com>
Co-authored-by: Lokesh Jain <ljain@apache.org>
Co-authored-by: Aswin Shakil Balasubramanian <aswinshakilbalu@gmail.com>
Co-authored-by: Wei-Chiu Chuang <weichiu@apache.org>
Co-authored-by: Symious <yiyang0203@foxmail.com>
Co-authored-by: Bharat Viswanadham <bharat@apache.org>
Co-authored-by: Stephen O'Donnell <stephen.odonnell@gmail.com>
Co-authored-by: GeorgeJahad <github@blackbirdsystems.net>
Co-authored-by: Siddhant Sangwan <siddhantsangwan027@gmail.com>
Co-authored-by: Jyotinder Singh <jyotindrsingh@gmail.com>
Co-authored-by: Shashikant Banerjee <shashikant@apache.org>
Co-authored-by: Tsz-Wo Nicholas Sze <szetszwo@apache.org>
Co-authored-by: Kiyoshi Mizumaru <kiyoshi.mizumaru@gmail.com>
Co-authored-by: Ritesh H Shukla <kerneltime@gmail.com>
Co-authored-by: Nibiru <axcsd3692@qq.com>
Co-authored-by: Kaijie Chen <chen@kaijie.org>
Co-authored-by: Zita Dombi <50611074+dombizita@users.noreply.github.com>
Co-authored-by: Istvan Fajth <pifta@cloudera.com>
Co-authored-by: steinsgateted <71066027+steinsgateted@users.noreply.github.com>
Co-authored-by: Gui Hecheng <markgui@tencent.com>
Co-authored-by: Jackson Yao <jacksonyao@tencent.com>
Co-authored-by: Keyi Song <72794035+sky76093016@users.noreply.github.com>
Co-authored-by: UENISHI Kota <kuenishi@users.noreply.github.com>
Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>
Co-authored-by: Symious <yiyang0203@gmail.com>
… current master. Resolved conflicts affecting feature versioning PR#3155, propagating s3 exception stack trace HDDS-6257 PR#3066 and performFailover PR#3160.
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @neils-dev and everyone who worked on this feature.

I have a few questions about the code change.

…ed constant for max len messages. Changes made to integration test for configuration test removing redundent statement. Small change to s3g Grpc unit tests for client and server to not only print out exception stack on error but to also assert on error ensuring test fails on error.'
…rpcOzoneManagerServer from definition in ConfigGroup for GrpcOmTransport.

<property>
<name>ozone.om.grpc.maximum.response.length</name>
<value>134217728</value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we validated that list keys will never violate this max length (if I understand this correctly)?
How is this value selected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kerneltime for reviewing this and for your comments.
This config defines the maximum message size limit that we send / receive over the gRPC channel.
The value was selected based on the max size defined for Hadoop RPC messages :ipc.maximum.response.length with default size 128 * 1024 * 1024.

If a list keys message exceeds this maximum limit for the channel, then it will not be processed and the RPC will fail with RESOURCE_EXHAUSTED. The exception should be raised from the channel and propagated to the user client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enforced at the client side or server side. As in will the server hit this when returning more than the max number of bytes?

Copy link
Contributor Author

@neils-dev neils-dev Apr 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is enforced by the both the client and server as we define the limit, max message size. In the case of bucket or key listing, the server will feed the payload within the OMResponse and the length of the message is enforced by the client in this case. The client raises a RESOURCE_EXHAUSTED - message length exceed io.grpc exception. This occurs when the message passed from the server to the client exceeds the incoming max message length.

Thanks @kerneltime, in looking into handling this exception I found that it needs to be isolated by the failover proxy provider retryPolicy to fail (not retry) on calling the shouldFailover method. The failover provider and retryPolicy shouldFailover is currently refactored and will be included in HDDS-6433 that will be commited to the Master after the feature branch merge.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @neils-dev for updating the patch. Seems like there is a conflict due to HDDS-6346.

…rvice. Resync with master to resolve merge conflicts.
@neils-dev
Copy link
Contributor Author

Thanks @neils-dev for updating the patch. Seems like there is a conflict due to HDDS-6346.

Yes, I added that to the master so that the patch was not lost. It was for the master branch.
I'm updating with a new commit to simplify the unit test for start/stopping the grpc service and resync with Master.
Should be resolved with latest commit. - Thanks!

… OmClientProtocol for S3Authentication messages changing required fields to optional within message.
@adoroszlai
Copy link
Contributor

Thanks @neils-dev for addressing my comments and updating the branch. I'm fine with the change, let's see if @kerneltime (or anyone else) has further comments.

@kerneltime
Copy link
Contributor

Why is this marked as a draft? Also, please rebase.

@neils-dev
Copy link
Contributor Author

Why is this marked as a draft? Also, please rebase.

Latest commit to re-sync with master branch resolving conflicts.
I think we started the PR 'marked' as draft to provide a PR for discussion on the feature prior to merging to master.

Thanks @adoroszlai and @kerneltime for your comments and help!

@adoroszlai
Copy link
Contributor

Why is this marked as a draft?

Because we want to avoid a merge via this PR, as that would squash commits. The PR is only for discussion about code changes. Feature branch should be merged via command-line. CI checks are run in Neil's fork.

@adoroszlai
Copy link
Contributor

HDDS-4440 feature branch was merged, this can be closed.

@adoroszlai adoroszlai closed this May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants