Release 1.6 #8406

magic20191 · 2019-05-10T10:06:19Z

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

The TaskInfo is stored in the blob store on job creation time as a persistent artifact
Deployments RPC transmits only the blob storage reference
TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluser with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Sometimes it can happen that Netty does not properly initialize the channel pipeline when sending a request from the RestClient. In this situation, we need to fail the response so that the caller will be notified about the un- successful call.

This adds tests to verify that jobs can be cancelled properly when there are standby masters. To enable this, we added tests to install Flink as a standalone cluster. The first two DB nodes will be running the master processes, while the others are running the TaskManagers. All Flink processes are supervised by runit – this allows killing for Flink processes by the nemesis. The client now implements a cancel operation. The model used by the checker had to be rewritten to address the fact that the job can be canceled. The cancel operation is "reliable", i.e., it either cancels the job successfully, or it fails the whole test fatally. This way we can be that the job should be eventually not running if the cancel operation completes successfully. This closes #6712.

Dynamic arguments hould be prefixed with -D. Despite that the config key is not needed because it is set in the flink-conf.yaml already.

Previously the Mesos tests were disabled due to FLINK-9936, which is resolved now. There are currently no known issues with Mesos so it is justified to enable the tests.

This closes #6766.

…daloneJobClusterEntrypoint This closes #6733.

Update the common methods to the new testing harness.

…ycle of ScalarFunction This closes #6771.

…ints by configuration [FLINK-10371][tests] Regenerate configuration docs [FLINK-10371] Adapt to code review This closes #6727.

…ompletedCheckpointStore" This reverts commit 6f570e7. This closes #6704

…ink/docker-entrypoint.sh

…ger.sh script. jobmanager.sh script syntax changed in Flink 1.5 as documented here: https://ci.apache.org/projects/flink/flink-docs-stable/release-notes/flink-1.5.html#changed-syntax-of-jobmanagersh-script

… API

…L example for Java. This closes #6790.

…untimeInfo

…eChannel

…dation. This closes #6775.

Logging configuration was set only for scala-shell in yarn mode. This commit sets the configuration for local and remote mode in start-scala-shell.sh script as well.

…location This commit removes container requests after containers have been allocated. This prevents that we will request more and more containers from Yarn in case of a recovery. Since we cannot rely on the reported container Resource, we remove the container request by using the requested Resource. This is due Yarn's DefaultResourceCalculator which neglects the number of vCores when allocating containers.

…in job defination This closes #7436

This closes #7436

Added check if minikube is running. If it is not we try to start it couple of times. If we do not succeed we fail with a descriptive message.

…Large State` This closes #7603. (cherry picked from commit 3abb3de)

…linkKafkaConsumerBase

…cle verifications

…before closed when concurrently accessed

…LifeCycle Split #testConsumerLifeCycle into two methods which represent the two if-else branches. This closes #7606.

…izedTaskInformation in class TaskDeploymentDescriptor

This closes #7532.

…ng file in Hadoop.

…g file in Hadoop.

…ough finished batch query This closes #7265.

…b cancellation.

The cause of the instability seems to be that due to a not-so-rare timing, the thread that calls the `interrupt()` on the main thread, runs still after its original test finishes and calls `interrupt()` during execution of the next test. This causes the normal execution (or `sleep()` in this case) to be interrupted.

flinkbot · 2019-05-10T10:06:39Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

klion26 · 2019-05-10T11:27:04Z

@magic20191 Could you please close this pr, this pr wants to merge release-1.6 into master, which is unneeded

tillrohrmann and others added 30 commits September 27, 2018 18:15

[hotfix][tests] Remove wrong -rest.port=8081 from mesos start arguments.

58538b7

Dynamic arguments hould be prefixed with -D. Despite that the config key is not needed because it is set in the flink-conf.yaml already.

[hotfix][tests] Extract mesos appmaster command to separate function.

500786f

[hotfix][tests] Enable building uberjar.

3c4ba46

[hotfix][docs, tests] Fix formatting in jepsen README.md

079d6fa

[hotfix][tests] Update default Flink distribution to 1.6

bb58785

[hotfix][tests] Reformat code in nemesis.clj

ff8b169

[hotfix][tests] Enable Mesos Jepsen tests.

1e2584f

Previously the Mesos tests were disabled due to FLINK-9936, which is resolved now. There are currently no known issues with Mesos so it is justified to enable the tests.

[hotfix][tests] Stop all services supervised by runit when tearing down.

5c47ec3

[hotfix][tests] Assert there is only one applicationId in ZooKeeper.

2b5ce0f

[hotfix] [connectors] Remove unused BulkProcessorIndexer class

679a304

[hotfix][docs] Improve documentation of savepoints

057394c

This closes #6766.

[FLINK-10291] Generate JobGraph with fixed/configurable JobID in Stan…

3ee66af

…daloneJobClusterEntrypoint This closes #6733.

[hotfix] Fix quickstarts end-to-end test

b50a7a7

Update the common methods to the new testing harness.

[FLINK-10451] [table] TableFunctionCollector should handle the life c…

7fb980e

…ycle of ScalarFunction This closes #6771.

[FLINK-10371] Allow to enable SSL mutual authentication on REST endpo…

5e7237a

…ints by configuration [FLINK-10371][tests] Regenerate configuration docs [FLINK-10371] Adapt to code review This closes #6727.

[FLINK-10354] Revert "[FLINK-6328] [chkPts] Don't add savepoints to C…

6f8b43f

…ompletedCheckpointStore" This reverts commit 6f570e7. This closes #6704

[FLINK-10345][docs] Added note with warning about removing savepoints

938cfa3

[hotfix] Remove unused cluster parameter from flink-contrib/docker-fl…

0a0ef88

…ink/docker-entrypoint.sh

[docs] Update cluster setup docs to reflect the new syntax of jobmana…

005142c

…ger.sh script. jobmanager.sh script syntax changed in Flink 1.5 as documented here: https://ci.apache.org/projects/flink/flink-docs-stable/release-notes/flink-1.5.html#changed-syntax-of-jobmanagersh-script

[FLINK-10312][rest] Propagate exception from server to client in REST…

f132f22

… API

[FLINK-10487] [docs] Fix table conversion example and add runnable SQ…

54bba8e

…L example for Java. This closes #6790.

[hotfix][tests] Extend MockEnvironmentBuilder to support TaskManagerR…

e2f1363

…untimeInfo

[FLINK-10242][tests] Split StreamSourceOperatorTest

e58fb7f

[FLINK-10242][metrics] Make latency interval configurable

6f30e8b

[FLINK-10243][metrics] Make latency metrics granularity configurable

0600c17

[FLINK-10465][tests] Do not stop sshd if it is supervised by runit.

f3865aa

[FLINK-10469][core] make sure to always write the whole buffer to Fil…

f266975

…eChannel

[FLINK-5542][yarn] Use YarnCluster vcores setting to do MaxVCore vali…

5b3211c

…dation. This closes #6775.

zjffdu and others added 26 commits January 11, 2019 10:27

[FLINK-11224][scala-shell] Log is missing in scala-shell

99a89d2

Logging configuration was set only for scala-shell in yarn mode. This commit sets the configuration for local and remote mode in start-scala-shell.sh script as well.

[FLINK-11304][docs][table] Fix typos in time attributes doc

c9c2fa4

[hotfix][docs] Fix complex Table API example bug

9b7af86

[hotfix][build] Append shade-plugin transformers in child modules

4a97ac1

[FLINK-11289][examples] Rework examples to account for licensing

8e10177

[FLINK-11071][core] add support for dynamic proxy classes resolution …

4321bd3

…in job defination This closes #7436

[FLINK-11071][core] Improved proxy class serialization test

bdc29b9

This closes #7436

[FLINK-10910][e2e] Hardened Kubernetes e2e test.

01e3c72

Added check if minikube is running. If it is not we try to start it couple of times. If we do not succeed we fail with a descriptive message.

[FLINK-11469][docs] Update documentation for `Tuning Checkpoints and …

de3772b

…Large State` This closes #7603. (cherry picked from commit 3abb3de)

[FLINK-10774] Rework lifecycle management of partitionDiscoverer in F…

402f235

…linkKafkaConsumerBase

[FLINK-10774] [tests] Refactor Kafka tests to have consistent life cy…

2a5d97d

…cle verifications

[FLINK-10774] [tests] Test that Kafka partition discoverer is wokeup …

c215064

…before closed when concurrently accessed

[FLINK-10774][tests] Refactor FlinkKafkaConsumerBaseTest#testConsumer…

1b9c464

…LifeCycle Split #testConsumerLifeCycle into two methods which represent the two if-else branches. This closes #7606.

[FLINK-11389] Fix Incorrectly use job information when call getSerial…

941ed4d

…izedTaskInformation in class TaskDeploymentDescriptor

[FLINK-11389][tests] Refactor TaskDeploymentDescriptorTest

ea90666

This closes #7532.

[FLINK-11419][filesystem] Wait until lease is revoked before truncati…

7966c88

…ng file in Hadoop.

[FLINK-11419][filesystem] Wait for lease to be revoked when truncatin…

4e78d58

…g file in Hadoop.

[hotfix][docs] Add space in self-closing linebreak tag

b3f9dde

[FLINK-11584][docs][tests] Fix linebreak parsing

2840106

[FLINK-11585][docs] Fix prefix matching

1cfa90c

[FLINK-11628][travis] Cache maven

a268ae1

[hotfix][travis] Remove stray slash

de1560d

[FLINK-10964][sql-client] SQL Client throws exception when paging thr…

9663323

…ough finished batch query This closes #7265.

[FLINK-11745][State TTL][E2E] Restore from the savepoint after the jo…

950657b

…b cancellation.

rmetzger added the review=description? label May 10, 2019

zentol closed this May 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.6 #8406

Release 1.6 #8406

Uh oh!

magic20191 commented May 10, 2019

Uh oh!

flinkbot commented May 10, 2019

Uh oh!

klion26 commented May 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Release 1.6 #8406

Release 1.6 #8406

Uh oh!

Conversation

magic20191 commented May 10, 2019

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented May 10, 2019

Review Progress

Uh oh!

klion26 commented May 10, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants