Skip to content

Conversation

@sxganapa
Copy link

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

  • The TaskInfo is stored in the blob store on job creation time as a persistent artifact
  • Deployments RPC transmits only the blob storage reference
  • TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluser with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

twalthr and others added 30 commits February 25, 2019 12:49
The test did not actually run since the class was refactored with JUnit's parameterized, because it was always running into a NPE and the NPE was then silently swallowed in a shutdown catch-block.

(cherry picked from commit 168660a)
The cause of the instability seems to be that due to a not-so-rare timing,
the thread that calls the `interrupt()` on the main thread, runs still
after its original test finishes and calls `interrupt()` during execution
of the next test. This causes the normal execution (or `sleep()` in this case)
to be interrupted.
…ocksDBIncrementalRestoreOperation

(cherry picked from commit 6b9ec27)
… local state

This corrects a problem that was introduced with the refactorings in FLINK-10043.

This closes #7841.

(cherry picked from commit 7a078a6)
Add a dedicated onStart method to the RpcEndpoint which is called when the RpcEndpoint
is started via the start() method. Due to this change it is no longer necessary for the
user to override the start() method which is error prone because it always requires to
call super.start(). Now this contract is explicitly enforced. Moreover, it allows to
execute the setup logic in the RpcEndpoint's main thread.
…ecovery

Wait until the Dispatcher has been started before adding new JobGraphs to the SubmittedJobGraphStore
…gnature

Prior to this commit, the CompositeTypeSerializerSnapshot class
signature was a bit confusing and contained raw types. Moreover, it
required subclasses to always erase types and re-cast.

This closes #7818.
…ot field / method names in InternalTimersSnapshot

This renaming corresponds to the fact that TypeSerializerConfigSnapshot
is now deprecated, and is fully replaced by TypeSerializerSnapshot.
…lization compatibility APIs for key / namespace serializer checks

This commit lets the InternalTimerServiceImpl properly use
TypeSerializerSchemaCompatibility /
TypeSerializerSnapshot#resolveSchemaCompatibility when attempting to
check the compatibility of new key and namespace serializers.

This also fixes the fact that this check was previously broken, in that
the key / namespace serializer was not reassigned to be reconfigured
ones.
…uld not be serializing timers' key / namespace serializers anymore

All of the changes done to managed state surrounding how we no longer
Java-serialize serializers anymore, and only write the serializer
snapshot, was not reflected to how we snapshot timers. This was mainly
due to the fact that timers were not handled by state backends (and were
therefore not managed state) in the past, and were handled in an
isolated manner by the InternalTimerServiceSerializationProxy.

This closes #7849.
…in CompositeTypeSerializerConfigSnapshot

We often want to get only the restored serializer snapshots from a
legacy CompostieTypeSerializerConfigSnapshot when attempting to redirect
compatibility checks to new snapshots. This commit adds a
getNestedSerializerSnapshots utility method for that purpose.
…lity method with SelfResolvingTypeSerializer implementation
… method using SelfResolvingTypeSerializer interface

Only the TtlSerializer needs to implement the
SelfResolvingTypeSerializer interface, because all other subclasses of
CompositeSerializer are test serializers.
carp84 and others added 24 commits April 29, 2019 10:24
…the incremental checkpoint code path

This closes #8297.

(cherry picked from commit 9aeb4e5)
…state loss for chained keyed operators

- Change
Will change the local data path from
`.../local_state_root/allocatio_id/job_id/jobvertext_id_subtask_id/chk_id/rocksdb`
to
`.../local_state_root/allocatio_id/job_id/jobvertext_id_subtask_id/chk_id/operator_id`

When preparing the local directory Flink deletes the local directory for each subtask if it already exists,
If more than one stateful operators chained in a single task, they'll share the same local directory path,
then the local directory will be deleted unexpectedly, and the we'll get data loss.

This closes #8263.

(cherry picked from commit ee60846)
Jar caching is not required since they are rebuilt in the test profiles anyway.
Run dependency convergence in main compile run.
Invoking maven once per modules requires significant time.
Run convergence in install phase (i.e. after the shade plugin) to work against dependency-reduced poms.
- fix find -mindepth parameter
- pass PROFILE to maven to prevent downloads of modules that weren't built beforehand
- add -maxdepth parameter for pom.xml searches
…r#jobReachedGloballyTerminalState fails

FutureUtils#assertNoException will assert that the given future has not been completed
exceptionally. If it has been completed exceptionally, then it will call the
FatalExitExceptionHandler.

This commit uses assertNoException to assert that the Dispatcher#jobReachedGloballyTerminalState
method has not failed.

This closes #8334.
…ontainer requests

Flink's YarnResourceManager sets a faster heartbeat interval when it is requesting containers
from Yarn's ResourceManager. Since requests and responses are transported via heartbeats, this
speeds up requests. However, it can also put additional load on Yarn due to excessive container
requests. Therefore, this commit introduces a config option which allows to control this heartbeat.
We now use Scala reflection because it correctly deals with Scala
language features.
@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

hwanju and others added 2 commits May 14, 2019 08:56
…eout and race

TaskExecutor registration has asynchronous process, which allows a next
retry after timeout to be processed first ahead of earlier request. Such
delayed timed-out request can accidently unregister a valid task
manager, whose slots are permanently not reported to job manager. This
patch introduces ongoing task executor futures to prevent such race.
…ayedRegisterTaskExecutor

Use latches instead of timeouts/sleeps to test problematic thread interleaving.

This closes #8415.
@klion26
Copy link
Member

klion26 commented May 15, 2019

@sxganapa could you please close this pr which wants to merge release-1.8 to master

@zentol zentol closed this May 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.