Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-5927. Improve defaults in ContainerBalancerConfiguration #2892

Merged
merged 10 commits into from Jan 12, 2022

Conversation

siddhantsangwan
Copy link
Contributor

What changes were proposed in this pull request?

Making the defaults smarter and more readable, such as expressing threshold in percentage, making maxSizeEnteringTarget equal the size of five containers etc.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-5927

How was this patch tested?

TestContainerBalancer unit tests.

@siddhantsangwan
Copy link
Contributor Author

Hey @lokeshj1703 @JacksonYao287
We can discuss the balancer defaults here. I've made some initial changes:

  1. Changed idleIterations to iterations to make it clear that this configuration means the number of iterations that balancer will run for.
  2. Changed maxSizeEnteringTarget and maxSizeLeavingSource to allow the size of five containers, by default.

Proposed changes:

  1. Change the definitions of configurations from ratio to percentage to make them more intuitive. For example, threshold can be "A percentage in the range of 0 to 100..."
  2. Make the default for maxSizeToMovePerIteration something like maxDatanodesRatioToInvolvePerIteration * number of healthy, in service DNs * maxSizeEnteringTarget.

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a review, I just skimmed the PR and wanted to make sure the proto.lock files aren't modified.

@@ -1334,7 +1334,7 @@
},
{
"id": 3,
"name": "idleiterations",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't EVER change proto lock files! These are automatically updated by protolock only on each major or minor release to ensure compatibility. This is an easy mistake to make when proto-backwards-compatability-check fails though, and it would definitely be helpful if our CI could help the devs out here : ) @adoroszlai is there an easy check we can put in CI to fail PRs that that modify anything called proto.lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This seems to have sneaked in while I was refactoring this field's name across the project. CI got skipped since this is still a draft PR.

BTW, from what I understand, it's compatible to change the name of a proto field as long as the ID and type remain the same, and JSON is not involved. The hadoop policy that's linked here also says it's compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the rename in the proto itself is safe, lets just make sure the proto lock stays in its original state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing only the proto file fails the backwards compatibility check here:

[INFO] CONFLICT: "StartContainerBalancerRequestProto" field: "idleiterations" has been removed, but is not reserved [ScmAdminProtocol.proto]
[INFO] CONFLICT: "StartContainerBalancerRequestProto" field: "iterations" ID: 3 has an updated name, previously "idleiterations" [ScmAdminProtocol.proto]

With the proto lock file in its original state.

# Conflicts:
#	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
#	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerConfiguration.java
…tainerBalancer constructor.

This change is required for ContainerBalancer#toString() to be able to call ContainerBalancerConfiguration#toString().
@siddhantsangwan siddhantsangwan marked this pull request as ready for review December 23, 2021 06:06
@siddhantsangwan
Copy link
Contributor Author

siddhantsangwan commented Dec 23, 2021

A summary of the changes:

  1. Deprecate the proto fields idleiterations and maxDatanodesRatioToInvolvePerIteration and refactor them to iterations and maxDatanodesPercentageToInvolvePerIteration.
  2. Represent threshold as a percentage instead of a ratio. Since ratios are used in calculations, add a method getThresholdAsRatio in ContainerBalancerConfiguration.
  3. Change the default values of configurations along with their javadocs and related methods.
  4. This PR also pulls in changes from the related PR HDDS-6070. ContainerBalancerConfig doesn't read config from ozone-site.xml #2893 , so let's get that PR merged first.

@siddhantsangwan
Copy link
Contributor Author

@lokeshj1703 @JacksonYao287 @errose28 please take a look.

Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @siddhantsangwan for this work , the changes overall looks good ! i will take a deep review after #2893 is merged

Comment on lines +482 to +483
optional int32 idleiterations = 3 [deprecated = true];
optional double maxDatanodesRatioToInvolvePerIteration = 4 [deprecated =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just replace them with the new. Compatibility may be not considered for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backwards compatibility goal fails during build if I just replace them.

" of the entire cluster) no more than the threshold value.")
private String threshold = "0.1";
" of the entire cluster) no more than the threshold.")
private String threshold = "10";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also make threshold int not String?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's what I wanted to do. But we'll need to deprecate threshold in the proto message and add a new field with the int32 type. Should I go ahead with that?

# Conflicts:
#	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
#	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancerConfiguration.java
#	hadoop-hdds/server-scm/src/test/java/org/apache/hadoop/hdds/scm/container/balancer/TestContainerBalancer.java
Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan Thanks for working on this! The changes look good to me. Just a minor comment inline.

if(request.hasIdleiterations()) {
idleiterations = Optional.of(request.getIdleiterations());

if (request.hasIterations()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also handle idleIterations here?

# Conflicts:
#	hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java
@siddhantsangwan
Copy link
Contributor Author

@lokeshj1703 Thanks for the review! I've updated the PR.

Copy link
Contributor

@JacksonYao287 JacksonYao287 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @siddhantsangwan for this work! LGTM +1

@siddhantsangwan
Copy link
Contributor Author

Thanks for the review @JacksonYao287 !

Copy link
Contributor

@lokeshj1703 lokeshj1703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@siddhantsangwan Thanks for updating the PR! The changes look good to me. +1.

@lokeshj1703 lokeshj1703 merged commit ac99b47 into apache:master Jan 12, 2022
@lokeshj1703
Copy link
Contributor

@siddhantsangwan Thanks for the contribution! @errose28 @JacksonYao287 Thanks for the reviews! I have merged the PR to master branch.

arp7 pushed a commit that referenced this pull request Feb 17, 2022
…ansport support (#3074)

* HDDS-6149. Remove unused keytabs (#2960)

* HDDS-6094. Some unit tests are skipped due to using JUnit4 runner (#2909)

* HDDS-6075. OzoneConfiguration constructor overrides input configuration keys. (#2921)

* HDDS-4177. SCM Container DB bootstrap on Recon startup (#2942)

* HDDS-6086. Compute MD5MD5CRC file checksum using chunk checksums from DataNodes (#2919)

* HDDS-6148. Validate ContainerBalancerConfiguration before start ContainerBalancer (#2957)

* HDDS-6161. SCM StateMachine failing to reinitialize doesn't terminate the process. (#2971)

* HDDS-6134. Move replication-specific config to ReplicationServer (#2943)

* HDDS-4010. S3G startup fails when multiple service ids are configured. (#2976)

* HDDS-6170. Add metrics to replication manager to track container health states (#2975)

* HDDS-3231. Cleanup KeyManagerImpl (#2961)

* HDDS-5927. Improve defaults in ContainerBalancerConfiguration (#2892)

* HDDS-6157. More consistent synchronization in InputStreams (#2965)

* HDDS-4743. [FSO] Add FSO variant of ITestOzoneContractDistcp. (#2980)

* HDDS-6114. Intermittent error due to Failed to init RocksDB (#2947)

* HDDS-6175. Use s3Auth during proxy during decrypt in RpcClient. (#2981)

* HDDS-6175. Use s3Auth during proxy during decrypt in RpcClient.

* HDDS-5740. Enable ratis by default for SCM. (#2637)

* HDDS-6183. Intermittent failure in TestKeyDeletingService.checkIfDeleteServiceWithFailingSCM. (#2991)

* HDDS-4190. Intermittent failure in TestOMVolumeSetOwnerRequest and TestOMVolumeSetQuotaRequest. (#2982)

* HDDS-6120. Compute block checksum using chunk checksums (#2930)

* HDDS-6147. Add ability in OM to get limited delta updates (#2956)

* HDDS-6195. Remove unused jmh-core dependency. (#2997)

* HDDS-6167. Update ozone-runner version to 20211202-1 (#2969)

* HDDS-6171. Create an API to change Bucket Owner (#2988)

* HDDS-6163. Fix PATH in docker image (#2967)

* HDDS-6202. Avoid using jmh-generator-annprocess since it is GPL2.0. (#2998)

* HDDS-6135. SCM Container DB bootstrap on Recon startup for SCM HA. (#2972)

* HDDS-6109. Preserve the underlying exception raised in client lib. (#2989)

* HDDS-3408. Rename ChunkLayOutVersion to ContainerLayoutVersion. (#2983)

* HDDS-6203. CleanUp incomplete gz files during Container move (#3000)

* HDDS-6216. Move OMOpenKeysDeleteRequest to package om.request.key (#3011)

* HDDS-6191. Intermittent failure in TestDeleteWithSlowFollower (#3015)

* HDDS-6128. CLI tool that downloads all the block replicas and creates a manifest file (#2987)

* HDDS-6177. Extend container info command to include replica details  (#2995)

* HDDS-6211. [Docs] Image styling on deployed site does not replicate local builds. (#3007)

* HDDS-6219. Switch to RATIS ReplicationType from STAND_ALONE in our tests. (#3014)

* HDDS-6192. feature/Observability.md translated to Chinese (#2994)

* HDDS-6205. Add CLI command to display the latest Replication Manager report (#3013)

* HDDS-6227. Test helpers should observe naming conditions (#3020)

* HDDS-6239. ozonesecure-mr failing with No URLs in mirrorlist (#3029)

* HDDS-6201. Fix NPE for DataScanner with scanned container deleted by others. (#3005)

* HDDS-5529. For any IOexception from @REPLicated method we should throw it (#2788)

* HDDS-6181. Change SCMHAInvocationHandler#invokeRatis() logging to TRACE (#2992)

* HDDS-6206. Application errors must not flood system log (#3001)

* HDDS-6245. Add BucketLayout logging to Audit Logs (#3040)

* HDDS-6238 Reduce memory requirements for list keys. (#3032)

* HDDS-2919. Intermittent failure in TestRDBStore (#3028)

* HDDS-6253. Unnecessary duplicate smoketest after defaulting to FSO (#3036)

* HDDS-6204. Cleanup handling malformed authorization header (#2999)

* HDDS-6169. Selective checks: skip junit tests on ozone-runner image update (#2974)

* HDDS-6270. Use a dedicated file instead of /etc/passwd for xcompat acceptance test (#3050)

* HDDS-6273. Amend doc SecuringTDE.md (#3047)

* HDDS-6140. Selective checks: skip unit check for integration-test changes (#2948)

* HDDS-6215. Recon get limited delta updates from OM (#3009)

* HDDS-6125. Recon get limited delta updates from OM

* HDDS-6215. Fix unit test

* trigger new CI check

* HDDS-6215. Fix typo

* trigger new CI check

Co-authored-by: Symious <yiyang0203@gmail.com>

* HDDS-6226. Run tests for selective CI checks in CI (#3019)

* HDDS-6247. Avoid logging stack trace for user input problems (#3039)

* HDDS-6208. New checkstyle: WhitespaceAround (#3003)

* HDDS-6289. Upgrade acceptance test log flooded with parse error (#3063)

Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>

* Trigger Build

* Fix integration test for added configuation field for selecting OmTransport for s3 gateway - TestOzoneConfigurationFields (added config key not in xml).

Co-authored-by: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com>
Co-authored-by: Lokesh Jain <ljain@apache.org>
Co-authored-by: Aswin Shakil Balasubramanian <aswinshakilbalu@gmail.com>
Co-authored-by: Wei-Chiu Chuang <weichiu@apache.org>
Co-authored-by: Symious <yiyang0203@foxmail.com>
Co-authored-by: Bharat Viswanadham <bharat@apache.org>
Co-authored-by: Stephen O'Donnell <stephen.odonnell@gmail.com>
Co-authored-by: GeorgeJahad <github@blackbirdsystems.net>
Co-authored-by: Siddhant Sangwan <siddhantsangwan027@gmail.com>
Co-authored-by: Jyotinder Singh <jyotindrsingh@gmail.com>
Co-authored-by: Shashikant Banerjee <shashikant@apache.org>
Co-authored-by: Tsz-Wo Nicholas Sze <szetszwo@apache.org>
Co-authored-by: Kiyoshi Mizumaru <kiyoshi.mizumaru@gmail.com>
Co-authored-by: Ritesh H Shukla <kerneltime@gmail.com>
Co-authored-by: Nibiru <axcsd3692@qq.com>
Co-authored-by: Kaijie Chen <chen@kaijie.org>
Co-authored-by: Zita Dombi <50611074+dombizita@users.noreply.github.com>
Co-authored-by: Istvan Fajth <pifta@cloudera.com>
Co-authored-by: steinsgateted <71066027+steinsgateted@users.noreply.github.com>
Co-authored-by: Gui Hecheng <markgui@tencent.com>
Co-authored-by: Jackson Yao <jacksonyao@tencent.com>
Co-authored-by: Keyi Song <72794035+sky76093016@users.noreply.github.com>
Co-authored-by: UENISHI Kota <kuenishi@users.noreply.github.com>
Co-authored-by: Siyao Meng <50227127+smengcl@users.noreply.github.com>
Co-authored-by: Symious <yiyang0203@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants