Skip to content

HDDS-4914. Failure injection and validating HDDS upgrade.#1998

Merged
avijayanhwx merged 10 commits intoapache:HDDS-3698-nonrolling-upgradefrom
prashantpogde:HDDS-4914
Apr 14, 2021
Merged

HDDS-4914. Failure injection and validating HDDS upgrade.#1998
avijayanhwx merged 10 commits intoapache:HDDS-3698-nonrolling-upgradefrom
prashantpogde:HDDS-4914

Conversation

@prashantpogde
Copy link
Contributor

What changes were proposed in this pull request?

The goals of this PR is to write comprehensive framework that will

  • drives SCM - finalization
  • Inject failures in both DataNodes as well as SCM at every state change in both SCM and DataNodes.
  • Validate that SCM and DataNodes eventually finalize and upgrade is successful.

HDDS upgrade model can be thought of as a State Machine model {states, transitions}, where
states are specific stages in upgrade finalization either on the SCM node or on the individual DataNodes
transitions are events that trigger state change

Different HDDS-Upgrade stages, for Both DataNodes as well SCM are defined as

  • BeforePreFinalizeUpgrade
  • AfterPreFinalizeUpgrade
  • BeforeCompleteFinalization
  • AfterCompleteFinalization
  • AfterPostFinalizeUpgrade

This validation framework will trigger all possible combination of failures while the nodes are in different possible states. The different combinations will include :

  • One Node failures - Fail SCM in the middle of SCM upgrade while the SCM is at a specific state.
    -Try this for all possible SCM-upgrade states
  • One Node failures - Fail DataNode in the middle of SCM upgrade while the SCM is at a specific state.
    - Try this for all possible SCM-upgrade states
  • One Node failures - Fail SCM in the middle of DataNode upgrade while the DataNode is at a specific state.
    • Try this for all possible DataNode-upgrade states
  • One Node failures - Fail DataNode in the middle of DataNode upgrade while the same DataNode is at a specific state.
    • Try this for all possible DataNode-upgrade states
  • Two Node Failures - Fail SCM as well as a DataNode in the middle of SCM upgrade while the SCM is at a specific state.
    • Try this for all possible SCM-upgrade states
  • Two Node Failures - Fail SCM as well as a DataNode in the middle of the DataNode upgrade while the same DataNode is at a specific state.
    • Try this for all possible DataNode-upgrade states
  • Two Node Failures - Fail SCM at a specific upgrade state in SCM thread context. Fail DataNode at a specific upgrade state in DataNode upgrade thread context.
    • Try this for all permutations of SCM-upgrade-states and Data-Node-Upgrade-states
  • Multi-node failure - Fail All the DataNodes at specific SCM-upgrade state
    • Try this for all possible SCM-upgrade states
  • Multi-node failure - Fail All the DataNodes at specific DataNode-upgrade state
    • Try this for all possible DataNode-upgrade states

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-4914

How was this patch tested?

Running newly introduced Integration Tests.

@swagle
Copy link
Contributor

swagle commented Mar 11, 2021

@adoroszlai These fault injection tests are adding significant time to complete CI runs, is there any way to 1) run the conditionally on layout version upgrade or 2) speed up the tests by running as separate processes?

@swagle swagle requested a review from adoroszlai March 11, 2021 23:49
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These fault injection tests are adding significant time to complete CI runs

We can skip it in regular integration tests by adding an <exclude> for TestHDDSUpgrade in pom.xml:

ozone/pom.xml

Lines 2175 to 2191 in 2ce0594

<profile>
<id>filesystem-hdds</id>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<includes>
<include>org.apache.hadoop.fs.ozone.**</include>
<include>org.apache.hadoop.hdds.**</include>
</includes>
</configuration>
</plugin>
</plugins>
</build>
</profile>

If the pre-existing test case(s) or any other future test cases need to be run as regular integration tests, then the new injection test cases should be separated into a separate class.

is there any way to 1) run the conditionally on layout version upgrade

We can introduce a new workflow for failure injection tests. It can be scheduled with lower frequency.

Is "layout version upgrade" indicated by changes to specific source files, which we could use as trigger? Also, are you sure that upgrade functionality is not affected by other cocde changes?

  1. speed up the tests by running as separate processes?

We can override surefire fork parameters for this separate workflow.

@prashantpogde
Copy link
Contributor Author

@adoroszlai @avijayanhwx @swagle Addressing all CI failures. The long running failure injection test is disabled by default. We need to find a way to run them with less frequency.

@prashantpogde
Copy link
Contributor Author

The one failure in CI is unrelated with the changes.

Copy link
Contributor

@avijayanhwx avijayanhwx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @prashantpogde. This will be useful in the future where complex finalization/rollback scenarios can be tested. I have some comments on the abstractions.

I am yet to review the actual test code that has been added. Will post more comments if needed after reviewing the injected failure testing.

@prashantpogde
Copy link
Contributor Author

These fault injection tests are adding significant time to complete CI runs

We can skip it in regular integration tests by adding an <exclude> for TestHDDSUpgrade in pom.xml:

ozone/pom.xml

Lines 2175 to 2191 in 2ce0594

<profile>
<id>filesystem-hdds</id>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<includes>
<include>org.apache.hadoop.fs.ozone.**</include>
<include>org.apache.hadoop.hdds.**</include>
</includes>
</configuration>
</plugin>
</plugins>
</build>
</profile>

If the pre-existing test case(s) or any other future test cases need to be run as regular integration tests, then the new injection test cases should be separated into a separate class.

is there any way to 1) run the conditionally on layout version upgrade

We can introduce a new workflow for failure injection tests. It can be scheduled with lower frequency.

Is "layout version upgrade" indicated by changes to specific source files, which we could use as trigger? Also, are you sure that upgrade functionality is not affected by other cocde changes?

  1. speed up the tests by running as separate processes?

We can override surefire fork parameters for this separate workflow.

For now we have disabled long running tests. Therefore we do not need to make change in pom.xml for now. But we do need a way to run these test with less frequency e.g. every 50th commit or something like that.

@prashantpogde prashantpogde requested a review from adoroszlai April 9, 2021 22:02
@avijayanhwx
Copy link
Contributor

@prashantpogde Can we resolve the merge conflicts? That will trigger CI which will actually run the added tests.

@prashantpogde
Copy link
Contributor Author

@prashantpogde Can we resolve the merge conflicts? That will trigger CI which will actually run the added tests.

Done

@avijayanhwx
Copy link
Contributor

Thanks for working on this @prashantpogde. I am merging this, with a follow up item of HDDS-5108.

@avijayanhwx avijayanhwx merged commit 7266f32 into apache:HDDS-3698-nonrolling-upgrade Apr 14, 2021
errose28 added a commit to errose28/ozone that referenced this pull request Apr 19, 2021
* HDDS-3698-nonrolling-upgrade:
  HDDS-5086. Add pre-finalize validation action for SCM HA. (apache#2143)
  HDDS-4914. Failure injection and validating HDDS upgrade. (apache#1998)
  HDDS-5014. Move upgrade user flow to 'feature' folder.
  HDDS-5014. Upgrade usage primer documentation. (apache#2133)
  HDDS-4181. Add acceptance tests for upgrade, finalization and downgrade. (apache#2056)
  HDDS-4828. SCM should go into "safe mode" until there is at least 1 pipeline to work with after finalization. (apache#2101)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants