Skip to content

Conversation

@yangsongbai
Copy link

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

  • The TaskInfo is stored in the blob store on job creation time as a persistent artifact
  • Deployments RPC transmits only the blob storage reference
  • TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluser with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

azagrebin and others added 30 commits November 16, 2018 21:14
…tests

This avoids issues with eventual consistency/visibility on S3.
In order to not leak secrets we should not print the configuration in docker-entrypoint.sh.
Wait until all Executions reach the state DEPLOYING instead of having a resource assigned.
…onfigOptions from documentation

The annotation Documentation.ExcludeFromDocumentation can be used to annotate ConfigOptions
with in order to not include them in the documentation.
…rom documentation

This commit excludes the JobManagerOptions#EXECUTION_FAILOVER_STRATEGY from Flink's
configuration documentation.
This commit removes the Scala based Kafka010Example from flink-end-to-end-tests/
flink-streaming-kafka010-test module. Moreover, it adds relative paths to their
parent pom's and cleans up the flink-streaming-kafka*/pom.xml.
…link-connector-kafka

In order to satisfy dependency convergence we need to exclude the kafka-clients from the base
flink-connector-kafka-x dependency in every flink-connector-kafka-y module.
This commit fixes the wrong result type of interval joins in
the DataStream Scala API. The API did not call the Scala type
extraction stack but was using the default one offered by the
DataStream Java API.

This closes #7141.
Because state clean up happens in processing time, it might be
the case that retractions are arriving after the state has
been cleaned up. Before these changes, a new accumulator was
created and invalid retraction messages were emitted. This
change drops retraction messages for which no accumulator
exists. It also refactors the tests and adds more test
cases and explanations.

This closes #7147.
…actSet

Instead of including every dependency and then limiting the set of included
files via a filter condition of the maven-shade-pluging, this commit defines
an artifact set of included dependencies. That way we will properly include
all files belonging to the listed dependencies (e.g. also the NOTICE file).

This closes #7176.
Do not use /tmp directory to place log files or the HDFS data directory.

Reconfigure dfs.replication to 1 because file availability is irrelevant in
tests.

Increase heap size of HDFS DataNodes and NameNode.

Change find-files! function to not fail if directory does not exist.
dawidwys and others added 27 commits March 5, 2019 11:13
The test did not actually run since the class was refactored with JUnit's parameterized, because it was always running into a NPE and the NPE was then silently swallowed in a shutdown catch-block.
…nd MiniCluster

The io executor is responsible for running io operations like discarding checkpoints.
By using the io executor, we don't risk that the RpcService is blocked by blocking
io operations.

This closes #7926.
… CoGroupedStreams.UnionSerializer

UnionSerializer did not perform a proper duplication of inner serializers. It also violated the assumption that createInstance never produces null.
Original intent was to never reach this code path except on programmer errors, but it has turned into an accepted code path for unhandled exceptions.
We've seen quite some flakyness in the end-to-end tests on Travis
lately. On most tests it takes about 5-8 secs for the dispatcher to come
up so 10 secs might be to low.
…shipCall

Fix the race condition between executing EmbeddedLeaderService#GrantLeadershipCall
and a concurrent shutdown of the leader service by making GrantLeadershipCall not
accessing mutable state outside of a lock.

This closes #7937.
Due to changes how the slot futures are completed and due to the fact that the
ResultConjunctFuture does not maintain the order in which the futures were specified,
it could happen that executions were not deployed in topological order. This commit
fixes this problem by changing the ResultConjunctFuture so that it maintains the order
of the specified futures in its result collection.

This closes #8065.
1. Rename the section title from `standalone` to `Flink session cluster on Mesos`.
2. Add one new section `Flink job cluster on Mesos` which describes the usage of mesos-appmaster-job.sh.

This closes #8084.
…ault

This commits updates the flink-conf.yaml to contain the new rest options and comments
out the rest.port per default.
This commit changes the YarnEntrypointUtils#loadConfiguration so that it only
sets RestOptions.BIND_PORT to 0 if it has not been specified. This allows to
explicitly set a port range for Yarn applications which are running behind a
firewall, for example.
@flinkbot
Copy link
Collaborator

flinkbot commented Apr 6, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.