Release 1.8 #8432

sxganapa · 2019-05-13T18:06:53Z

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

The TaskInfo is stored in the blob store on job creation time as a persistent artifact
Deployments RPC transmits only the blob storage reference
TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluser with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

The test did not actually run since the class was refactored with JUnit's parameterized, because it was always running into a NPE and the NPE was then silently swallowed in a shutdown catch-block. (cherry picked from commit 168660a)

…b cancellation.

This closes #7825. (cherry picked from commit 435807e)

(cherry picked from commit c0149ba)

The cause of the instability seems to be that due to a not-so-rare timing, the thread that calls the `interrupt()` on the main thread, runs still after its original test finishes and calls `interrupt()` during execution of the next test. This causes the normal execution (or `sleep()` in this case) to be interrupted.

This closes #7840

…ocksDBIncrementalRestoreOperation (cherry picked from commit 6b9ec27)

… local state This corrects a problem that was introduced with the refactorings in FLINK-10043. This closes #7841. (cherry picked from commit 7a078a6)

(cherry picked from commit 5bc04d2)

This closes #7586. (cherry picked from commit a8580ca)

…t has not been started

Add a dedicated onStart method to the RpcEndpoint which is called when the RpcEndpoint is started via the start() method. Due to this change it is no longer necessary for the user to override the start() method which is error prone because it always requires to call super.start(). Now this contract is explicitly enforced. Moreover, it allows to execute the setup logic in the RpcEndpoint's main thread.

…ce size of method

… its behaviour This closes #7808.

This closes #7737

…ecovery Wait until the Dispatcher has been started before adding new JobGraphs to the SubmittedJobGraphStore

…ompatibility.md. This closes #7802

…gnature Prior to this commit, the CompositeTypeSerializerSnapshot class signature was a bit confusing and contained raw types. Moreover, it required subclasses to always erase types and re-cast. This closes #7818.

…ethod signature

…ot field / method names in InternalTimersSnapshot This renaming corresponds to the fact that TypeSerializerConfigSnapshot is now deprecated, and is fully replaced by TypeSerializerSnapshot.

…lization compatibility APIs for key / namespace serializer checks This commit lets the InternalTimerServiceImpl properly use TypeSerializerSchemaCompatibility / TypeSerializerSnapshot#resolveSchemaCompatibility when attempting to check the compatibility of new key and namespace serializers. This also fixes the fact that this check was previously broken, in that the key / namespace serializer was not reassigned to be reconfigured ones.

…uld not be serializing timers' key / namespace serializers anymore All of the changes done to managed state surrounding how we no longer Java-serialize serializers anymore, and only write the serializer snapshot, was not reflected to how we snapshot timers. This was mainly due to the fact that timers were not handled by state backends (and were therefore not managed state) in the past, and were handled in an isolated manner by the InternalTimerServiceSerializationProxy. This closes #7849.

…in CompositeTypeSerializerConfigSnapshot We often want to get only the restored serializer snapshots from a legacy CompostieTypeSerializerConfigSnapshot when attempting to redirect compatibility checks to new snapshots. This commit adds a getNestedSerializerSnapshots utility method for that purpose.

…lity method with SelfResolvingTypeSerializer implementation

… method using SelfResolvingTypeSerializer interface Only the TtlSerializer needs to implement the SelfResolvingTypeSerializer interface, because all other subclasses of CompositeSerializer are test serializers.

…the incremental checkpoint code path This closes #8297. (cherry picked from commit 9aeb4e5)

…state loss for chained keyed operators - Change Will change the local data path from `.../local_state_root/allocatio_id/job_id/jobvertext_id_subtask_id/chk_id/rocksdb` to `.../local_state_root/allocatio_id/job_id/jobvertext_id_subtask_id/chk_id/operator_id` When preparing the local directory Flink deletes the local directory for each subtask if it already exists, If more than one stateful operators chained in a single task, they'll share the same local directory path, then the local directory will be deleted unexpectedly, and the we'll get data loss. This closes #8263. (cherry picked from commit ee60846)

…heck

Jar caching is not required since they are rebuilt in the test profiles anyway.

Run dependency convergence in main compile run. Invoking maven once per modules requires significant time. Run convergence in install phase (i.e. after the shade plugin) to work against dependency-reduced poms.

- fix find -mindepth parameter - pass PROFILE to maven to prevent downloads of modules that weren't built beforehand - add -maxdepth parameter for pom.xml searches

…ersion This closes #8313.

…r#jobReachedGloballyTerminalState fails FutureUtils#assertNoException will assert that the given future has not been completed exceptionally. If it has been completed exceptionally, then it will call the FatalExitExceptionHandler. This commit uses assertNoException to assert that the Dispatcher#jobReachedGloballyTerminalState method has not failed. This closes #8334.

…ontainer requests Flink's YarnResourceManager sets a faster heartbeat interval when it is requesting containers from Yarn's ResourceManager. Since requests and responses are transported via heartbeats, this speeds up requests. However, it can also put additional load on Yarn due to excessive container requests. Therefore, this commit introduces a config option which allows to control this heartbeat.

…documentation

We now use Scala reflection because it correctly deals with Scala language features.

flinkbot · 2019-05-13T18:08:01Z

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

…eout and race TaskExecutor registration has asynchronous process, which allows a next retry after timeout to be processed first ahead of earlier request. Such delayed timed-out request can accidently unregister a valid task manager, whose slots are permanently not reported to job manager. This patch introduces ongoing task executor futures to prevent such race.

…ayedRegisterTaskExecutor Use latches instead of timeouts/sleeps to test problematic thread interleaving. This closes #8415.

klion26 · 2019-05-15T01:15:21Z

@sxganapa could you please close this pr which wants to merge release-1.8 to master

twalthr and others added 30 commits February 25, 2019 12:49

[hotfix][table] Remove unused geometry dependency

ea2b569

[hotfix] Repair ineffective LocalRecoveryITCase

cd1964a

The test did not actually run since the class was refactored with JUnit's parameterized, because it was always running into a NPE and the NPE was then silently swallowed in a shutdown catch-block. (cherry picked from commit 168660a)

[FLINK-11745][State TTL][E2E] Restore from the savepoint after the jo…

f1a1b55

…b cancellation.

[FLINK-11744][core]Provide stable/final toHexString-method in AbstractID

aa55d58

This closes #7825. (cherry picked from commit 435807e)

[hotfix] Revert overriding toString-method in AllocationID

ecdf473

(cherry picked from commit c0149ba)

[FLINK-11728] [table] Deprecate CalciteConfig temporarily

4d82c7f

This closes #7840

[hotfix] Minor code cleanups in RocksDBKeyedStateBackendBuilder and R…

dc32807

…ocksDBIncrementalRestoreOperation (cherry picked from commit 6b9ec27)

[FLINK-11743] Fix problem with restoring incremental checkpoints from…

060d0e6

… local state This corrects a problem that was introduced with the refactorings in FLINK-10043. This closes #7841. (cherry picked from commit 7a078a6)

[hotfix] Introduce common interface to all IncrementalKeyedStateHandles

3504d46

(cherry picked from commit 5bc04d2)

[FLINK-10912][rocksdb] Configurable RocksDBStateBackend options

0fe9815

This closes #7586. (cherry picked from commit a8580ca)

[hotfix] Allow leader assignment to TestingLeaderElectionService if i…

0291460

…t has not been started

[hotfix] Factor logic out of AkkaRpcActor#handleRpcInvocation to redu…

fe7b11a

…ce size of method

[FLINK-11718] Add onStart to Dispatcher

f6cbd8b

[FLINK-11718] Remove start override from JobMaster

68c2aaf

[FLINK-11718] Add onStart method to ResourceManager

008701d

[FLINK-11718] Add onStart method to TaskExecutor

51d428a

[FLINK-11718] Make RpcEndpoint#start method final to prevent changing…

8e66e9b

… its behaviour This closes #7808.

[hotfix][docs] Add that LIMIT requires ORDER BY.

77dc5de

This closes #7737

[hotfix][tests] Harden DispatcherTest#testJobSubmissionErrorAfterJobR…

a0dd4d6

…ecovery Wait until the Dispatcher has been started before adding new JobGraphs to the SubmittedJobGraphStore

[FLINK-11777][docs] Remove and update useless html anchor in hadoop_c…

dd032c5

…ompatibility.md. This closes #7802

[hotfix] [core] Fix TypeSerializerUtils#snapshotBackwardsCompatible m…

829b267

…ethod signature

[FLINK-11772] [DataStream] Remove "config" from all serializer snapsh…

943c934

…ot field / method names in InternalTimersSnapshot This renaming corresponds to the fact that TypeSerializerConfigSnapshot is now deprecated, and is fully replaced by TypeSerializerSnapshot.

[FLINK-11741] [runtime] Replace ArrayListSerializer's ensureCompatibi…

2724c84

…lity method with SelfResolvingTypeSerializer implementation

[FLINK-11741] [core] Remove CompositeSerializer's ensureCompatibility…

7b3b7cb

… method using SelfResolvingTypeSerializer interface Only the TtlSerializer needs to implement the SelfResolvingTypeSerializer interface, because all other subclasses of CompositeSerializer are test serializers.

carp84 and others added 24 commits April 29, 2019 10:24

[FLINK-12350] [State Backends] RocksDBStateBackendTest doesn't cover …

8020787

…the incremental checkpoint code path This closes #8297. (cherry picked from commit 9aeb4e5)

[hotfix] Fix compile error from rebase of FLINK-12296]

4d1605b

[hotfix][travis] Fix scala 2.12 profile name

633e78a

[hotfix][travis] Add sleep before timestamp update

9109b58

[FLINK-12346][travis][build] Account for timestamps in scala-suffix c…

faad0ae

…heck

[hotfix][travis] Remove unnecessary faraday installation

6257631

[hotfix][travis] Don't cache gems

58b46aa

[hotfix][travis] Exclude all jars from cache

afa93bd

Jar caching is not required since they are rebuilt in the test profiles anyway.

[hotfix][travis] Speed up dependency convergence

a2d7050

Run dependency convergence in main compile run. Invoking maven once per modules requires significant time. Run convergence in install phase (i.e. after the shade plugin) to work against dependency-reduced poms.

[hotfix][travis] Allow scala-suffix check to download plugins

fa4b0d0

[hotfix][travis] Speed up scala-suffix check

66a08d9

- fix find -mindepth parameter - pass PROFILE to maven to prevent downloads of modules that weren't built beforehand - add -maxdepth parameter for pom.xml searches

[hotfix][travis] Move docker wordcount to nightly tests

3f1f7b9

[hotfix][travis] Install kubernetes/docker only in E2E profiles

dc857f3

[FLINK-12184][hs] HistoryServerArchiveFetcher incompatible with old v…

c1835d4

…ersion This closes #8313.

[hotfix][tests] Prevent HistoryServerTest to print to STDOUT

8194080

[FLINK-12391][travis] Limit transfer.sh upload to 1 minute

011c121

[hotfix][travis] Fix log4j configuration path

22a8099

[hotfix][travis] Skip WebUI build process for e2e runs

d084c9e

[FLINK-12460][docs] Replace taskmanager.tmp.dirs with io.tmp.dirs in …

0f34def

…documentation

[FLINK-9445][scala] Scala-shell uses JAVA_RUN

3d8cc4a

[FLINK-12301] Fix ScalaCaseClassSerializer to support value types

ecc6639

We now use Scala reflection because it correctly deals with Scala language features.

rmetzger added the review=description? label May 13, 2019

hwanju and others added 2 commits May 14, 2019 08:56

[FLINK-12260][tests] Speed up ResourceManagerTaskExecutorTest#testDel…

a043e41

…ayedRegisterTaskExecutor Use latches instead of timeouts/sleeps to test problematic thread interleaving. This closes #8415.

zentol closed this May 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 1.8 #8432

Release 1.8 #8432

Uh oh!

sxganapa commented May 13, 2019

Uh oh!

flinkbot commented May 13, 2019

Uh oh!

klion26 commented May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Release 1.8 #8432

Release 1.8 #8432

Uh oh!

Conversation

sxganapa commented May 13, 2019

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented May 13, 2019

Review Progress

Uh oh!

klion26 commented May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants