Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-1574 Average out pipeline allocation on datanodes and add metrcs/test #291

Merged
merged 4 commits into from Dec 18, 2019

Conversation

timmylicheng
Copy link
Contributor

What changes were proposed in this pull request?

  1. Fix pipeline allocation in non-topology env and prevent pipelines from sharing the same set of datanodes.
  2. Add metrics and logs to record when pipeline policy chooses two same set of datanodes for different pipeline for future reference.
  3. Add tests for basic reports and mock what pipeline allocation would do given different number of nodes.

(Please fill in changes proposed in this fix)

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-1574
(Please create an issue in ASF JIRA before opening a pull request,
and you need to set the title of the pull request which starts with
the corresponding JIRA issue number. (e.g. HDDS-XXXX. Fix a typo in YYY.)

Please replace this section with the link to the Apache JIRA)

How was this patch tested?

UT

(Please explain how this patch was tested. Ex: unit tests, manual tests)
(If this patch involves UI changes, please attach a screen-shot; otherwise, remove this)

@timmylicheng timmylicheng force-pushed the HDDS-1574 branch 2 times, most recently from e5cb6e6 to d144941 Compare December 5, 2019 07:03
@timmylicheng timmylicheng force-pushed the HDDS-1574 branch 2 times, most recently from 40dc52d to b746e50 Compare December 6, 2019 02:29
@timmylicheng timmylicheng force-pushed the HDDS-1574 branch 4 times, most recently from 628cb14 to 9ff7b85 Compare December 10, 2019 13:56
Copy link
Contributor

@xiaoyuyao xiaoyuyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, a few minor issues commented inline.

@timmylicheng
Copy link
Contributor Author

The acceptance test is fixed in https://issues.apache.org/jira/browse/HDDS-2650

@timmylicheng
Copy link
Contributor Author

testCloseContainerEventWithRatis doesn't see the same error locally as on CI.

@xiaoyuyao
Copy link
Contributor

Thanks @timmylicheng for the update. The latest change LGTM.

Wrt. the failures in testCloseContainerEventWithRatis. I spend some time and debug it.
After we allow background pipeline thread to create more than one pipelines per datanode, the OZONE_SCM_PIPELINE_NUMBER_LIMIT should always be set properly in the test. Otherwise, this will run indefinitely and timeout the even queue processing for some of those with tight timeout (e.g., 1s in this case). Also, when the limit is reached, we throw exceptions which ends up with a lot of false alarm error logs and the pipeline failure counter incremented. We can fix that in follow up JIRAs.

Here is a proposed fix for testCloseContainerEventWithRatis, which you can include in this PR.

TestCloseContainerEventHandler.java

@@ -24,6 +24,7 @@
 import org.apache.hadoop.hdds.HddsConfigKeys;
 import org.apache.hadoop.hdds.protocol.DatanodeDetails;
 import org.apache.hadoop.hdds.protocol.proto.HddsProtos;
+import org.apache.hadoop.hdds.scm.ScmConfigKeys;
 import org.apache.hadoop.hdds.scm.TestUtils;
 import org.apache.hadoop.hdds.scm.pipeline.MockRatisPipelineProvider;
 import org.apache.hadoop.hdds.scm.pipeline.PipelineProvider;
@@ -67,22 +68,25 @@ public static void setUp() throws Exception {
         .getTestDir(TestCloseContainerEventHandler.class.getSimpleName());
     configuration
         .set(HddsConfigKeys.OZONE_METADATA_DIRS, testDir.getAbsolutePath());
+    configuration.setInt(
+        ScmConfigKeys.OZONE_SCM_PIPELINE_NUMBER_LIMIT, 16);
+
     nodeManager = new MockNodeManager(true, 10);
     eventQueue = new EventQueue();
     pipelineManager =
         new SCMPipelineManager(configuration, nodeManager, eventQueue);
     PipelineProvider mockRatisProvider =
         new MockRatisPipelineProvider(nodeManager,
-            pipelineManager.getStateManager(), configuration);
+            pipelineManager.getStateManager(), configuration, eventQueue);
     pipelineManager.setPipelineProvider(HddsProtos.ReplicationType.RATIS,
         mockRatisProvider);
     containerManager = new
         SCMContainerManager(configuration, nodeManager,
         pipelineManager, new EventQueue());
-    pipelineManager.triggerPipelineCreation();
     eventQueue.addHandler(CLOSE_CONTAINER,
         new CloseContainerEventHandler(pipelineManager, containerManager));
     eventQueue.addHandler(DATANODE_COMMAND, nodeManager);
+    pipelineManager.triggerPipelineCreation();
     // Move all pipelines created by background from ALLOCATED to OPEN state
     Thread.sleep(2000);
     TestUtils.openAllRatisPipelines(pipelineManager);
@@ -93,6 +97,9 @@ public static void tearDown() throws Exception {
     if (containerManager != null) {
       containerManager.close();
     }
+    if (pipelineManager != null) {
+      pipelineManager.close();
+    }
     FileUtil.fullyDelete(testDir);
   }

@timmylicheng
Copy link
Contributor Author

  • if (pipelineManager != null) {
  •  pipelineManager.close();
    
  • }

Thanks for the efforts!
Looks like this test is not caught in previous PR when I enabled PipelinePlacementPolicy. I'm able to finish the test locally now.

@xiaoyuyao
Copy link
Contributor

Thanks @timmylicheng for the update. The latest change LGTM, +1. Not sure why the acceptance test is failing here. Will take another look tomorrow before merge.

Also, can you open two follow up JIRAs for the other issues?

@timmylicheng
Copy link
Contributor Author

Thanks @timmylicheng for the update. The latest change LGTM, +1. Not sure why the acceptance test is failing here. Will take another look tomorrow before merge.

Also, can you open two follow up JIRAs for the other issues?

Are you referring to:

  1. Use either OZONE_SCM_PIPELINE_NUMBER_LIMIT or OZONE_DATANODE_MAX_PIPELINE_ENGAGEMENT to limit pipeline creation for all tests.

  2. Use other method to notify pipeline creation failure due to exceeding max limit than throwing an exception?

@timmylicheng
Copy link
Contributor Author

Thanks @timmylicheng for the update. The latest change LGTM, +1. Not sure why the acceptance test is failing here. Will take another look tomorrow before merge.

Also, can you open two follow up JIRAs for the other issues?

https://issues.apache.org/jira/browse/HDDS-2756 is creating to track logging issue.

@xiaoyuyao
Copy link
Contributor

bq. Use either OZONE_SCM_PIPELINE_NUMBER_LIMIT or OZONE_DATANODE_MAX_PIPELINE_ENGAGEMENT to limit pipeline creation for all tests.

Yes. I think the default value of 0 will be problematic as it will keep sending createPipeline command to DNs without restriction. A better default should be provided for it to use in production.

@timmylicheng
Copy link
Contributor Author

timmylicheng commented Dec 18, 2019

OZONE_DATANODE_MAX_PIPELINE_ENGAGEMENT

https://issues.apache.org/jira/browse/HDDS-2772 is created to track this. @xiaoyuyao

@xiaoyuyao xiaoyuyao merged commit ad1617d into apache:HDDS-1564 Dec 18, 2019
@xiaoyuyao
Copy link
Contributor

The acceptance test failures are unrelated tracked by HDDS-2774.

@timmylicheng timmylicheng deleted the HDDS-1574 branch December 19, 2019 02:24
timmylicheng added a commit to timmylicheng/hadoop-ozone that referenced this pull request Dec 23, 2019
timmylicheng added a commit to timmylicheng/hadoop-ozone that referenced this pull request Jan 7, 2020
timmylicheng added a commit to timmylicheng/hadoop-ozone that referenced this pull request Feb 10, 2020
timmylicheng added a commit to timmylicheng/hadoop-ozone that referenced this pull request Feb 12, 2020
anuengineer pushed a commit that referenced this pull request Feb 19, 2020
* HDDS-1577. Add default pipeline placement policy implementation. (#1366)



(cherry picked from commit b640a5f6d53830aee4b9c2a7d17bf57c987962cd)

* HDDS-1571. Create an interface for pipeline placement policy to support network topologies. (#1395)

(cherry picked from commit 753fc6703a39154ed6013e44dbae572391748906)

* HDDS-2089: Add createPipeline CLI. (#1418)

(cherry picked from commit 326b5acd4a63fe46821919322867f5daff30750c)

* HDDS-1569 Support creating multiple pipelines with same datanode. Contributed by Li Cheng. 

This closes #28

* HDDS-1572 Implement a Pipeline scrubber to clean up non-OPEN pipeline. (#237)

* Rebase Fix

* HDDS-2650 Fix createPipeline CLI. (#340)

* HDDS-2035 Implement datanode level CLI to reveal pipeline relation. (#348)

* Revert "HDDS-2650 Fix createPipeline CLI. (#340)"

This reverts commit 7c71710.

* HDDS-2650 Fix createPipeline CLI and make it message based. (#370)

* HDDS-1574 Average out pipeline allocation on datanodes and add metrcs/test (#291)

* Resolve rebase conflict.

* HDDS-2756. Handle pipeline creation failure in different way when it exceeds pipeline limit

Closes #401

* HDDS-2115 Add acceptance test for createPipeline CLI and datanode list CLI (#375)

* HDDS-2115 Add acceptance test for createPipeline CLI and datanode list CLI.

* HDDS-2772 Better management for pipeline creation limitation. (#410)

*  HDDS-2913 Update config names and CLI for multi-raft feature. (#462)

* HDDS-2924. Fix Pipeline#nodeIdsHash collision issue. (#478)

* HDDS-2923 Add fall-back protection for rack awareness in pipeline creation. (#516)

* HDDS-3007 Fix CI test failure for TestSCMNodeManager. (#550)

Co-authored-by: Sammi Chen <sammichen@apache.org>
Co-authored-by: Xiaoyu Yao <xyao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants