Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-9295] Add Flink 1.10 build target and Make FlinkRunner compatible with Flink 1.10 #10945

Merged
merged 1 commit into from Mar 11, 2020

Conversation

sunjincheng121
Copy link
Member

Add Flink 1.10 build target and Make FlinkRunner compatible with Flink 1.10

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@sunjincheng121
Copy link
Member Author

Run CommunityMetrics PreCommit

@sunjincheng121
Copy link
Member Author

Run Python2_PVR_Flink PreCommit

@iemejia iemejia requested a review from mxm February 24, 2020 08:29
Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There a large amount of code duplication. Is that really necessary? AFAIK the changes between 1.9 and 1.10 should be minimal. Could you please revise this or explain why it is necessary?

@sunjincheng121
Copy link
Member Author

There a large amount of code duplication. Is that really necessary? AFAIK the changes between 1.9 and 1.10 should be minimal. Could you please revise this or explain why it is necessary?

There are a lot of changes related to the job client API(https://issues.apache.org/jira/browse/FLINK-14392, https://issues.apache.org/jira/browse/FLINK-14376) in 1.10. I have added comments at the header of each test case copied to 1.10 for the reason making a copy of them. Could you help to take a look at if the reason makes sense for you? :)

@sunjincheng121
Copy link
Member Author

Hi @mxm I found that the test case test_large_elements failed with execption:

java.io.IOException: Cannot write record to fresh sort buffer. Record too large.
        at org.apache.flink.runtime.operators.chaining.SynchronousChainedCombineDriver.collect(SynchronousChainedCombineDriver.java:176)
        at org.apache.flink.runtime.operators.util.metrics.CountingCollector.collect(CountingCollector.java:35)
        at org.apache.flink.runtime.operators.MapDriver.run(MapDriver.java:103)
        at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
        at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
        at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
        at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)  

I guess it's related to the feature "Unified Memory Configuration for TaskExecutors"(https://issues.apache.org/jira/browse/FLINK-13980) which is introduced in Flink 1.10. Before 1.10, the memory managed by the Flink's MemoryManager is calculated dynamically if not configured and I have checked that it will be about 2500 MB in my local machine. Since 1.10, it will be 128 MB if not configured (taskmanager.memory.managed.size).

I have performed a simple test and the failed test test_large_elements could pass after adding the following code at

:

flinkConfiguration.set(TaskManagerOptions.MANAGED_MEMORY_SIZE, MemorySize.parse("2048m"));

I'm still investigating the best way to address this issue at Beam side. And appreciate if you have any suggestion on this :)

@@ -67,6 +69,7 @@ static ExecutionEnvironment createBatchExecutionEnvironment(

// depending on the master, create the right environment.
if ("[local]".equals(flinkMasterHostPort)) {
flinkConfiguration.set(TaskManagerOptions.MANAGED_MEMORY_SIZE, MemorySize.parse("2048m"));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has sets the default managed memory size to 128MB for MiniCluster in https://issues.apache.org/jira/browse/FLINK-15763. Have set it to a large value when the master host is [local]. Appreciate for any suggestions on a better way to address this issue.

@sunjincheng121
Copy link
Member Author

Run Java PreCommit

@sunjincheng121
Copy link
Member Author

Run Python PreCommit

@sunjincheng121
Copy link
Member Author

Test failure is not caused by the current PR.

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about removing 1.7 first and then revising this PR? I don't feel comfortable now with the amount of code duplication. Also I suspect that this PR will break support for 1.7 to the RemoteEnvironment removal.

@sunjincheng121
Copy link
Member Author

@mxm Thanks for the review. I have updated the PR according to your comments. Currently it only copied 3 tests and I think even we drop 1.7 support, we still need to copy these tests. What's your thought?

@sunjincheng121
Copy link
Member Author

Run Spotless PreCommit

@sunjincheng121
Copy link
Member Author

Run CommunityMetrics PreCommit

@mxm mxm self-requested a review March 5, 2020 14:29
Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sunjincheng121. It generally looks good to me. IMHO the test code duplication is not optimal and I would like to change that but this could also be done in a follow-up. Could you squash the commits?

@mxm
Copy link
Contributor

mxm commented Mar 9, 2020

retest this please

@sunjincheng121
Copy link
Member Author

sunjincheng121 commented Mar 10, 2020

Thanks for the review @mxm ! And the suggestion about test case is great :) , I have update the PR accordingly :)

@@ -77,6 +67,7 @@ static ExecutionEnvironment createBatchExecutionEnvironment(

// depending on the master, create the right environment.
if ("[local]".equals(flinkMasterHostPort)) {
flinkConfiguration.setString("taskmanager.memory.managed.size", "2048m");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling this won't be reliable enough. Why not instead taskmanager.memory.managed.fraction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will set the taskmanager.memory.managed.size as 128MB for MiniCluster if it's not set. I think set taskmanager.memory.managed.fraction" doesn't take effect here. Thoughts? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant with this default because it will always pre-allocate 2GB of memory which won't be used most of the time, except for the one large record test case you mentioned.

We could set I'd go for something like https://github.com/apache/flink/blob/42a56f4c75693773e21fa2dea45df640c2d7f9da/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/TaskExecutorProcessUtils.java#L287 based on the memory available.

Actually, that is what the Flink 1.8 code used to do: https://github.com/apache/flink/blob/60d9b96456f142f8d18d5882016840a00159403e/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskManagerServices.java#L296

So let's just check the free memory and use a fraction for memory managed memory by default. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mxm, Sounds good to me ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the changes.

@mxm
Copy link
Contributor

mxm commented Mar 10, 2020

Thanks for making the adjustments. A couple more comments but otherwise it looks very good! Great work.

@sunjincheng121
Copy link
Member Author

Run Python PreCommit

1 similar comment
@mxm
Copy link
Contributor

mxm commented Mar 10, 2020

Run Python PreCommit

@sunjincheng121
Copy link
Member Author

Run Python2_PVR_Flink PreCommit

@sunjincheng121
Copy link
Member Author

Run Python PreCommit

@sunjincheng121
Copy link
Member Author

Run Python2_PVR_Flink PreCommit

1 similar comment
@sunjincheng121
Copy link
Member Author

Run Python2_PVR_Flink PreCommit

@sunjincheng121
Copy link
Member Author

Run Java PreCommit

@mxm
Copy link
Contributor

mxm commented Mar 11, 2020

Run Python2_PVR_Flink PreCommit

@mxm
Copy link
Contributor

mxm commented Mar 11, 2020

Run Java PreCommit

@mxm
Copy link
Contributor

mxm commented Mar 11, 2020

Run Flink ValidatesRunner

@mxm
Copy link
Contributor

mxm commented Mar 11, 2020

Run Java Flink PortableValidatesRunner Streaming

@mxm
Copy link
Contributor

mxm commented Mar 11, 2020

Run Java Flink PortableValidatesRunner Batch

@mxm
Copy link
Contributor

mxm commented Mar 11, 2020

Run Java PreCommit

Copy link
Contributor

@mxm mxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sunjincheng121. I'll merge this once the tests pass.

@mxm mxm merged commit be8b4ff into apache:master Mar 11, 2020
@sunjincheng121 sunjincheng121 deleted the BEAM-9285-PR branch March 12, 2020 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants