Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FLINK-4253 - Rename "recovery.mode" config key to "high-availability" (Ram) #2342

Closed
wants to merge 4 commits into from
Closed

Conversation

ramkrish86
Copy link
Contributor

Thanks for contributing to Apache Flink. Before you open your pull request, please take the following check list into consideration.
If your changes take all of the items into account, feel free to open your pull request. For more information and/or questions please refer to the How To Contribute guide.
In addition to going through the list, please provide a meaningful description of your changes.

  • General
    • The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
    • The pull request addresses only one issue
    • Each commit in the PR has a meaningful commit message (including the JIRA id)
  • Documentation
    • Documentation has been added for new functionality
    • Old documentation affected by the pull request has been updated
    • JavaDoc for public methods has been added
  • Tests & Build
    • Functionality added by the pull request is covered by tests
    • mvn clean verify has been executed successfully locally or a Travis build has passed

I ran mvn clean verify - all the tests in flink-runtime passed and this test failed
Failed tests: BlobServerDeleteTest.testDeleteAll:157 DELETE operation failed: Server side error: Unable to delete directory C:\Users\rsvasude\AppData\Local\Temp\blobStore-18502f30-ee19-4c4c-9cb7-7b51c9bdeffb\job_801e21ed42b26de3c813cfe4917d029d. LeaderChangeStateCleanupTest.testReelectionOfSameJobManager:244 TaskManager should not be able to register at JobManager.
I think it is an environment issue. Ran other tests changed as part of this PR and they all seems to pass.
Handled backward compatability if the config file has the older config recover.mode. The same has been handled in the config.sh script also.

Suggestions/feedback welcome.

@@ -502,7 +503,8 @@ object TestingUtils {
configuration: Configuration)
: ActorGateway = {

configuration.setString(ConfigConstants.HIGH_AVAILABILITY, ConfigConstants.DEFAULT_HIGH_AVAILABILTY)
configuration.setString(ConfigConstants.HIGH_AVAILABILITY,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After wrapping here
[INFO] force-shading ...................................... SUCCESS [ 11.954 s] [INFO] flink .............................................. SUCCESS [01:01 min] [INFO] flink-annotations .................................. SUCCESS [ 7.724 s] [INFO] flink-shaded-hadoop ................................ SUCCESS [ 0.364 s] [INFO] flink-shaded-hadoop2 ............................... SUCCESS [ 15.982 s] [INFO] flink-shaded-include-yarn-tests .................... SUCCESS [ 16.367 s] [INFO] flink-shaded-curator ............................... SUCCESS [ 0.278 s] [INFO] flink-shaded-curator-recipes ....................... SUCCESS [ 1.953 s] [INFO] flink-shaded-curator-test .......................... SUCCESS [ 0.738 s] [INFO] flink-metrics ...................................... SUCCESS [ 0.326 s] [INFO] flink-metrics-core ................................. SUCCESS [ 8.961 s] [INFO] flink-test-utils-parent ............................ SUCCESS [ 0.302 s] [INFO] flink-test-utils-junit ............................. SUCCESS [ 7.145 s] [INFO] flink-core ......................................... SUCCESS [01:27 min] [INFO] flink-java ......................................... SUCCESS [01:03 min] [INFO] flink-runtime ...................................... SUCCESS [03:31 min] [INFO] flink-optimizer .................................... SUCCESS [ 43.211 s] [INFO] flink-clients ...................................... SUCCESS [ 33.062 s] [INFO] flink-streaming-java ............................... SUCCESS [01:11 min] [INFO] flink-test-utils ................................... SUCCESS [ 29.337 s] [INFO] flink-scala ........................................ SUCCESS [01:48 min] [INFO] flink-runtime-web .................................. SUCCESS [ 28.635 s] [INFO] flink-examples ..................................... SUCCESS [ 3.177 s] [INFO] flink-examples-batch ............................... SUCCESS [01:09 min] [INFO] flink-contrib ...................................... SUCCESS [ 0.847 s] [INFO] flink-statebackend-rocksdb ......................... SUCCESS [ 11.992 s] [INFO] flink-tests ........................................ SUCCESS [03:27 min] [INFO] flink-streaming-scala .............................. SUCCESS [01:21 min] [INFO] flink-streaming-connectors ......................... SUCCESS [ 0.532 s] [INFO] flink-connector-flume .............................. SUCCESS [ 8.794 s] [INFO] flink-libraries .................................... SUCCESS [ 0.285 s] [INFO] flink-table ........................................ SUCCESS [02:22 min] [INFO] flink-metrics-jmx .................................. SUCCESS [ 6.208 s] [INFO] flink-connector-kafka-base ......................... SUCCESS [ 14.832 s] [INFO] flink-connector-kafka-0.8 .......................... SUCCESS [ 10.985 s] [INFO] flink-connector-kafka-0.9 .......................... SUCCESS [ 7.464 s] [INFO] flink-connector-elasticsearch ...................... SUCCESS [ 11.186 s] [INFO] flink-connector-elasticsearch2 ..................... SUCCESS [ 11.327 s] [INFO] flink-connector-rabbitmq ........................... SUCCESS [ 5.484 s] [INFO] flink-connector-twitter ............................ SUCCESS [ 6.641 s] [INFO] flink-connector-nifi ............................... SUCCESS [ 6.188 s] [INFO] flink-connector-cassandra .......................... SUCCESS [ 11.955 s] [INFO] flink-connector-redis .............................. SUCCESS [ 7.246 s] [INFO] flink-connector-filesystem ......................... SUCCESS [ 8.218 s] [INFO] flink-batch-connectors ............................. SUCCESS [ 0.283 s] [INFO] flink-avro ......................................... SUCCESS [ 9.166 s] [INFO] flink-jdbc ......................................... SUCCESS [ 7.117 s] [INFO] flink-hadoop-compatibility ......................... SUCCESS [ 9.769 s] [INFO] flink-hbase ........................................ SUCCESS [ 8.976 s] [INFO] flink-hcatalog ..................................... SUCCESS [ 15.990 s] [INFO] flink-examples-streaming ........................... SUCCESS [ 35.892 s] [INFO] flink-gelly ........................................ SUCCESS [ 21.147 s] [INFO] flink-gelly-scala .................................. SUCCESS [ 35.202 s] [INFO] flink-gelly-examples ............................... SUCCESS [ 31.837 s] [INFO] flink-python ....................................... SUCCESS [ 8.998 s] [INFO] flink-ml ........................................... SUCCESS [02:34 min] [INFO] flink-cep .......................................... SUCCESS [ 21.447 s] [INFO] flink-cep-scala .................................... SUCCESS [ 46.998 s] [INFO] flink-scala-shell .................................. SUCCESS [ 50.330 s] [INFO] flink-quickstart ................................... SUCCESS [ 4.723 s] [INFO] flink-quickstart-java .............................. SUCCESS [ 2.686 s] [INFO] flink-quickstart-scala ............................. SUCCESS [ 0.957 s] [INFO] flink-storm ........................................ SUCCESS [ 24.464 s] [INFO] flink-storm-examples ............................... SUCCESS [03:51 min] [INFO] flink-streaming-contrib ............................ SUCCESS [ 18.222 s] [INFO] flink-tweet-inputformat ............................ SUCCESS [ 9.515 s] [INFO] flink-operator-stats ............................... SUCCESS [ 8.018 s] [INFO] flink-connector-wikiedits .......................... SUCCESS [ 6.316 s] [INFO] flink-yarn ......................................... SUCCESS [ 26.015 s] [INFO] flink-dist ......................................... SUCCESS [01:04 min] [INFO] flink-metrics-dropwizard ........................... SUCCESS [ 8.947 s] [INFO] flink-metrics-ganglia .............................. SUCCESS [ 5.176 s] [INFO] flink-metrics-graphite ............................. SUCCESS [ 4.891 s] [INFO] flink-metrics-statsd ............................... SUCCESS [ 7.830 s] [INFO] flink-fs-tests ..................................... SUCCESS [ 3.448 s] [INFO] flink-java8 ........................................ SUCCESS [ 14.214 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 39:09 min

@ramkrish86
Copy link
Contributor Author

Looks like only this build failed
https://travis-ci.org/apache/flink/jobs/150992610 and that too due to a cassandra-connector test case. Should be unrelated to this PR.

@StephanEwen
Copy link
Contributor

Looks good in general. I think this could use a test case that ensures that the configuration parsing accepts for now both the old and the new config key.

Can you add such a test?

@uce
Copy link
Contributor

uce commented Aug 11, 2016

I think we need to change more than just that single variable. The other variables should match, e.g.
recovery.jobmanager.port => high-availability.jobmanager.port and also recovery.zookeeper.* => recovery.zookeeper.* etc.

@ramkrish86
Copy link
Contributor Author

Can you add such a test?

Sure I can. The existing test cases helped to ensure that if there are no old configs we are able to fetch from the new config. I can add a test case to ensure both the configs are accepted.

The other variables should match, e.g.
recovery.jobmanager.port => high-availability.jobmanager.port and also recovery.zookeeper.* => recovery.zookeeper.* etc.

I see. I had this doubt but since the PR was talking about this specific param I thought that would be enough. +1 to change all relevant ones. I can update it in the next PR.

Thanks all for the comments.

@ramkrish86
Copy link
Contributor Author

Just a query,
If there are both old and new configs available - we should give priority to the new one right? Even if the value for the old and new configs are different?

@uce
Copy link
Contributor

uce commented Aug 12, 2016

Regarding the priority: Yes, I think so.

@ramkrish86
Copy link
Contributor Author

Updated PR request. Handles all other configs and renames them from 'recovery.* to 'high-availability.*'.
Also added test case and suitably modified code to ensure that if there are both old and new configs only the new value is taken. If both are not set then the default is taken. Feedback and suggestions are welcome.

@ramkrish86
Copy link
Contributor Author

github build shows two failure

org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase.org.apache.flink.test.checkpointing.EventTimeWindowCheckpointingITCase
org.apache.flink.test.checkpointing.StreamCheckpointingITCase.org.apache.flink.test.checkpointing.StreamCheckpointingITCase

Both seems to be unrelated.
Two travis build hung. Rest of them passed.

https://travis-ci.org/apache/flink/jobs/151785550
https://travis-ci.org/apache/flink/jobs/151785556

Not sure of the problem.

@StephanEwen
Copy link
Contributor

There are some test instabilities that we are trying to address. Seems unrelated to your changes at the first glance. We'll look at the tests again anyways before merging,

@ramkrish86
Copy link
Contributor Author

Ran those two tests individually they seem to pass. But the Rocks_DB version failed due to some initialization error. Thank you @StephanEwen for your comments.

@ramkrish86
Copy link
Contributor Author

@uce
Updated with changes so that all the configs are renamed from 'recovery' to 'high-availability'.

@@ -285,7 +285,7 @@ of the JobManager, because the same ActorSystem is used. Its not possible to use

## High Availability Mode

- `recovery.mode`: (Default 'standalone') Defines the recovery mode used for the cluster execution. Currently, Flink supports the 'standalone' mode where only a single JobManager runs and no JobManager state is checkpointed. The high availability mode 'zookeeper' supports the execution of multiple JobManagers and JobManager state checkpointing. Among the group of JobManagers, ZooKeeper elects one of them as the leader which is responsible for the cluster execution. In case of a JobManager failure, a standby JobManager will be elected as the new leader and is given the last checkpointed JobManager state. In order to use the 'zookeeper' mode, it is mandatory to also define the `recovery.zookeeper.quorum` configuration value.
- `high-availability`: (Default 'none') Defines the recovery mode used for the cluster execution. Currently, Flink supports the 'none' mode where only a single JobManager runs and no JobManager state is checkpointed. The high availability mode 'zookeeper' supports the execution of multiple JobManagers and JobManager state checkpointing. Among the group of JobManagers, ZooKeeper elects one of them as the leader which is responsible for the cluster execution. In case of a JobManager failure, a standby JobManager will be elected as the new leader and is given the last checkpointed JobManager state. In order to use the 'zookeeper' mode, it is mandatory to also define the `recovery.zookeeper.quorum` configuration value. Previously this config was named 'recovery.mode' and the default config was 'standalone'.

- `recovery.zookeeper.quorum`: Defines the ZooKeeper quorum URL which is used to connet to the ZooKeeper cluster when the 'zookeeper' recovery mode is selected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs not updated (same for jobmanager_high_availability.md)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. Will change them too.

@ramkrish86
Copy link
Contributor Author

@uce
Updated the PR with suitable doc updates and also renamed the enum

RecoveryMode

to

HighAvailabilityMode

@ramkrish86
Copy link
Contributor Author

@uce
The test case failures seems to be unrelated. Want to have a look at the updated PR?

@ramkrish86
Copy link
Contributor Author

@uce
I saw that there were conflicts after yesterday's updates to the master. Hence I have rebased my PR and force pushed the changes. Thanks.

@ramkrish86
Copy link
Contributor Author

@uce
Got a green build. Any feedback/reviews here. I know you guys are busy, just a gentle reminder.

@ramkrish86
Copy link
Contributor Author

@uce
Got some time to check this? A gentle reminder !!!

@uce
Copy link
Contributor

uce commented Aug 19, 2016

I will check this later today or on Monday. You will have to wait a little, sorry.

@ramkrish86
Copy link
Contributor Author

@uce
Thanks. No problem. I can wait.

@uce
Copy link
Contributor

uce commented Aug 22, 2016

I had to address some things when reviewing this: https://github.com/uce/flink/tree/ram

In general the changes were good, but overlooked many occurrences that are not possible to catch with automatic refactoring tools. I went through those manually. I will merge this after my local Travis build passes.

@ramkrish86
Copy link
Contributor Author

@uce
Thank you very much for checking it out. I can see in the branch that you have created you have updated some doc comment that I had added - from 'config' to 'key' and some missing places. Good on you. Thanks once again. Do let me know if you want me to help out in this if some more places you feel it is missing.

@uce
Copy link
Contributor

uce commented Aug 23, 2016

Builds are passing here: https://travis-ci.org/uce/flink/builds/154236174

I'm going to merge this later today. Thanks for starting this renaming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants