[FLINK-1908] JobManager startup delay isn't considered when using start-cluster.sh script by lukasraska · Pull Request #609 · apache/flink

lukasraska · 2015-04-18T11:29:17Z

Creates dependency on netcat package (nc)

… 30 seconds timeout

mxm · 2015-04-20T09:07:47Z

Thanks for the pull request. Seems to work fine. I was wondering, shouldn't the task managers repeatably try to build up a connection to the job manager? For me, that seems to be a nicer way to solve this problem. That way, the startup script doesn't need to be aware of the job manager's rpc port.

tillrohrmann · 2015-04-20T12:44:42Z

The TaskManager uses an exponential backoff strategy to resolve connection
problems with the JobManager.

On Mon, Apr 20, 2015 at 11:07 AM, Max notifications@github.com wrote:

Thanks for the pull request. Seems to work fine. I was wondering,
shouldn't the task managers repeatably try to build up a connection to the
job manager? For me, that seems to be a nicer way to solve this problem.
That way, the startup script doesn't need to be aware of the job manager's
rpc port.

—
Reply to this email directly or view it on GitHub
#609 (comment).

rmetzger · 2015-04-20T12:50:23Z

flink-dist/src/main/flink-bin/bin/start-cluster.sh

Is akka logging anything for this requests? (I suspect its logging a WARNING that an invalid client tried to connect?)

@rmetzger
Since "-z" does only port-pinging (no actual payload is sent), nothing is visible in logs (if you send some data, its correctly logged as WARN "incorrect header" by org.apache.flink.runtime.ipc.Server)

tillrohrmann · 2015-04-20T12:51:30Z

The TaskManager's maximum registration duration is configured by the config value taskmanager.maxRegistrationDuration. The default value is set to infinity for a dedicated Flink cluster.

Therefore, I'm wondering what exactly the problem with the startup delay is? @DarkKnightCZ maybe you can elaborate a little bit more on the problem you had.

rmetzger · 2015-04-20T12:51:41Z

As far as I understood it, the goal of the change is to wait until the JM has been started before starting the TMs.
So the TMs would not start if the JM failed to start.

lukasraska · 2015-04-20T13:30:29Z

@tillrohrmann
The problem that occurred was that JM bound the IP:PORT with some delay, so TMs failed to start, since they couldn't connect.

When i tried in 5-node environment, sometimes 2 or 3 TMs failed because JM wasn't ready there. There was no subsequential checking done, TMs just stopped. I agree that TM should indeed try to check several times, if the JM is available, so i will try to look at it also.

tillrohrmann · 2015-04-20T22:13:47Z

@DarkKnightCZ that sounds strange. The TM should not terminate itself if it cannot connect to the JM unless the maximum registration duration has been configured. Is it possible that you link the log file of one of the failed TM? That would allow to investigate the problem more thoroughly.

StephanEwen · 2015-04-21T19:44:33Z

Forwarding comments from JIRA:

I think @DarkKnightCZ is using versiob 0.8.x and Till Rohrmann is talking about 0.9
The startup is handled very differently in 0.9 and should actually fix the issue. The selection of the communication interface is in a backoff loop and should happen for many minutes before the TaskManager falls back to heuristics.

I don't think that this issue will be fixed in 0.8.x.

@DarkKnightCZ Can you verify whether 0.9 works for you?

lukasraska · 2015-04-25T11:46:48Z

@StephanEwen: Hi, yes, you're correct. In 0.9 it works as it should (i.e. tries connecting to JM several times)

So i guess this PR can be closed

uce · 2015-04-29T13:24:49Z

@DarkKnightCZ, only you can close the PR. Could you do so?

[FLINK-1908] Use netcat to check if JobManager is accessible via RPC,…

a48eea0

… 30 seconds timeout

rmetzger reviewed Apr 20, 2015
View reviewed changes

lukasraska closed this Apr 29, 2015

rmetzger added the component=Runtime/Coordination label Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-1908] JobManager startup delay isn't considered when using start-cluster.sh script#609

[FLINK-1908] JobManager startup delay isn't considered when using start-cluster.sh script#609
lukasraska wants to merge 1 commit intoapache:masterfrom
lukasraska:FLINK-1908

lukasraska commented Apr 18, 2015

Uh oh!

mxm commented Apr 20, 2015

Uh oh!

tillrohrmann commented Apr 20, 2015

Uh oh!

rmetzger Apr 20, 2015

Uh oh!

lukasraska Apr 20, 2015

Uh oh!

tillrohrmann commented Apr 20, 2015

Uh oh!

rmetzger commented Apr 20, 2015

Uh oh!

lukasraska commented Apr 20, 2015

Uh oh!

tillrohrmann commented Apr 20, 2015

Uh oh!

StephanEwen commented Apr 21, 2015

Uh oh!

lukasraska commented Apr 25, 2015

Uh oh!

uce commented Apr 29, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lukasraska commented Apr 18, 2015

Uh oh!

mxm commented Apr 20, 2015

Uh oh!

tillrohrmann commented Apr 20, 2015

Uh oh!

rmetzger Apr 20, 2015

Choose a reason for hiding this comment

Uh oh!

lukasraska Apr 20, 2015

Choose a reason for hiding this comment

Uh oh!

tillrohrmann commented Apr 20, 2015

Uh oh!

rmetzger commented Apr 20, 2015

Uh oh!

lukasraska commented Apr 20, 2015

Uh oh!

tillrohrmann commented Apr 20, 2015

Uh oh!

StephanEwen commented Apr 21, 2015

Uh oh!

lukasraska commented Apr 25, 2015

Uh oh!

uce commented Apr 29, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants