Skip to content

Conversation

@magic20191
Copy link

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

  • The TaskInfo is stored in the blob store on job creation time as a persistent artifact
  • Deployments RPC transmits only the blob storage reference
  • TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluser with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

tillrohrmann and others added 30 commits September 27, 2018 18:15
Sometimes it can happen that Netty does not properly initialize the channel
pipeline when sending a request from the RestClient. In this situation, we
need to fail the response so that the caller will be notified about the un-
successful call.
This adds tests to verify that jobs can be cancelled properly when there are
standby masters. To enable this, we added tests to install Flink as a
standalone cluster. The first two DB nodes will be running the master
processes, while the others are running the TaskManagers. All Flink processes
are supervised by runit – this allows killing for Flink processes by the
nemesis.

The client now implements a cancel operation. The model used by the checker
had to be rewritten to address the fact that the job can be canceled. The
cancel operation is "reliable", i.e., it either cancels the job successfully,
or it fails the whole test fatally. This way we can be that the job should
be eventually not running if the cancel operation completes successfully.

This closes #6712.
Dynamic arguments hould be prefixed with -D. Despite that the config key is not
needed because it is set in the flink-conf.yaml already.
Previously the Mesos tests were disabled due to FLINK-9936, which is resolved
now. There are currently no known issues with Mesos so it is justified to enable
the tests.
Update the common methods to the new testing harness.
…ints by configuration

[FLINK-10371][tests] Regenerate configuration docs

[FLINK-10371] Adapt to code review

This closes #6727.
…ompletedCheckpointStore"

This reverts commit 6f570e7.

This closes #6704
zjffdu and others added 26 commits January 11, 2019 10:27
Logging configuration was set only for scala-shell in yarn mode. This commit sets the configuration for local and remote mode in start-scala-shell.sh script as well.
…location

This commit removes container requests after containers have been allocated. This prevents that
we will request more and more containers from Yarn in case of a recovery.

Since we cannot rely on the reported container Resource, we remove the container request by
using the requested Resource. This is due Yarn's DefaultResourceCalculator which neglects the
number of vCores when allocating containers.
Added check if minikube is running. If it is not we try to start it couple of times. If we do not succeed we fail with a descriptive message.
…Large State`

This closes #7603.

(cherry picked from commit 3abb3de)
…LifeCycle

Split #testConsumerLifeCycle into two methods which represent the two if-else
branches.

This closes #7606.
…izedTaskInformation in class TaskDeploymentDescriptor
The cause of the instability seems to be that due to a not-so-rare timing,
the thread that calls the `interrupt()` on the main thread, runs still
after its original test finishes and calls `interrupt()` during execution
of the next test. This causes the normal execution (or `sleep()` in this case)
to be interrupted.
@flinkbot
Copy link
Collaborator

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@klion26
Copy link
Member

klion26 commented May 10, 2019

@magic20191 Could you please close this pr, this pr wants to merge release-1.6 into master, which is unneeded

@zentol zentol closed this May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.