Skip to content

Conversation

@Paper-plane123
Copy link

What is the purpose of the change

(For example: This pull request makes task deployment go through the blob server, rather than through RPC. That way we avoid re-transferring them on each deployment (during recovery).)

Brief change log

(for example:)

  • The TaskInfo is stored in the blob store on job creation time as a persistent artifact
  • Deployments RPC transmits only the blob storage reference
  • TaskManagers retrieve the TaskInfo from the blob cache

Verifying this change

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (100MB)
  • Extended integration test for recovery after master (JobManager) failure
  • Added test that validates that TaskInfo is transferred only once across recoveries
  • Manually verified the change by running a 4 node cluser with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

StephanEwen and others added 30 commits July 26, 2019 19:16
FLINK-13249 was a bug where a deadlock occurred when the network thread got blocked on a lock
while requesting partitions to be read by remote channels. The test mimicks that situation
to guard the fix applied in an earlier commit.
…stUtils

The method Unsafe.defineClass() is removed in Java 11. To support Java 11, we rework the method
"CommonTestUtils.createClassNotInClassPath()" to use a different mechanism.

This commit now writes the class byte code out to a temporary file and create a new URLClassLoader that
loads the class from that file.  That solution is not a complete drop-in replacement, because it cannot
add the class to an existing class loader, but can only create a new pair of (classloader & new-class-in-that-classloader).
Because of that, the commit also adjusts the existing tests to work with that new mechanism.

This closes #9251
…ing partition request

On producer side the netty handler receives the CancelPartitionRequest for releasing the SubpartitionView resource.
In previous implementation we try to find the corresponding view via available queue in PartitionRequestQueue. But
in reality the view is not always available to stay in this queue, then the view would never be released.

Furthermore the release of ResultPartition/ResultSubpartitions is based on the reference counter in ReleaseOnConsumptionResultPartition,
but while handling the CancelPartitionRequest in PartitionRequestQueue, the ReleaseOnConsumptionResultPartition is never
notified of consumed subpartition. That means the reference counter would never decrease to 0 to trigger partition release,
which would bring file resource leak in the case of BoundedBlockingSubpartition.

In order to fix above two issues, the corresponding view is released via all reader queue instead, and then it would call
ReleaseOnConsumptionResultPartition#onConsumedSubpartition meanwhile to solve this bug.
Currently test cases will fail when trying to close the output stream if all data written
but ClosedByInterruptException occurs at the ending phase. This commit fixes it.

This closes #9235
Only kill Yarn application if it does not properly terminate.

This closes #9175.
…ed memory size into wrong configuration instance.

[FLINK-13241][yarn][test] Update YarnResourceManagerTest#testCreateSlotsPerWorker to compute tmCalculatedResourceProfile based on the RM altered configuration.

[FLINK-13241][yarn][test] Update YarnConfigurationITCase to verify that TMs are started with correct managed memory size.

[FLINK-13241][runtime] Calculating and set managed memory size outside of ResourceManager.

[FLINK-13241][rumtime/yarn][test] Move YarnResourceManagerTest#testCreateSlotsPerWorker to ResourceManagerTest#testCreateWorkerSlotProfiles, and update to verify slot profile calculation with determinate managed memory size.

[FLINK-13241][runtime] Move getResourceManagerConfiguration from ResourceManagerFactory to ResourceManagerUtil.

This closes #9246.
docete and others added 26 commits August 9, 2019 11:05
… function and DIV(), DIV_INT() function from blink planner

This commit remove BITAND, BITOR, BITNOT, BITXOR scalar functions because they are not standard.
This commit also removes DIV(), DIV_INT() because we already have "/" and "/INT" operators.
… keep it compatible with old planner

The behavior of AVG aggregate function in blink planner always return double/decimal type which is not standard.
…de of "explainTerms" to generate operator names
…to keep it compatible with old planner

CONCAT(string1, string2, ...) should returns NULL if any argument is NULL.
CONCAT_WS(sep, string1, string2,...) should returns NULL if sep is NULL and automatically skips NULL arguments.
…RING type instead of BINARY

This fix the behavior of FROM_BASE64() to align with old planner.
…ALUE(), SUBSTR() builtin functions which are not standard.

LENGTH, SUBSTR, KEYVALUE can be covered by existing functions, e.g. CHAR_LENGTH, SUBSTRING, STR_TO_MAP(str)[key].
…nCallResolver for class name more meaningful.

This closes #9281
… stream group aggregate in FlinkRelMdColumnInterval

This closes #9346
…nt toString method to explain more info

This closes #9347
…base crashes sql-client

Avoid crashing sql-client when switching to non-existing catalog or database.

This closes #9399.
Hive documentation is currently spread across a number of pages and fragmented. In particular:

- An example was added to getting-started/examples, however, this section is being removed
- There is a dedicated page on hive integration but also a lot of hive specific information is on the catalog page

This closes #9308.
Fix the issue that Flink cannot access Hive table with decimal columns.

This closes #9390.
…in blink planner to fix TPC-H e2e test failed

This closes #9427
@flinkbot
Copy link
Collaborator

flinkbot commented Nov 20, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 5a5966b (Wed Dec 04 15:52:27 UTC 2019)

Warnings:

  • 148 pom.xml files were touched: Check for build and licensing issues.
  • Invalid pull request title: No valid Jira ID provided

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

Details
The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.