Skip to content

[FLINK-1908] JobManager startup delay isn't considered when using start-cluster.sh script#609

Closed
lukasraska wants to merge 1 commit intoapache:masterfrom
lukasraska:FLINK-1908
Closed

[FLINK-1908] JobManager startup delay isn't considered when using start-cluster.sh script#609
lukasraska wants to merge 1 commit intoapache:masterfrom
lukasraska:FLINK-1908

Conversation

@lukasraska
Copy link
Copy Markdown

Creates dependency on netcat package (nc)

@mxm
Copy link
Copy Markdown
Contributor

mxm commented Apr 20, 2015

Thanks for the pull request. Seems to work fine. I was wondering, shouldn't the task managers repeatably try to build up a connection to the job manager? For me, that seems to be a nicer way to solve this problem. That way, the startup script doesn't need to be aware of the job manager's rpc port.

@tillrohrmann
Copy link
Copy Markdown
Contributor

The TaskManager uses an exponential backoff strategy to resolve connection
problems with the JobManager.

On Mon, Apr 20, 2015 at 11:07 AM, Max notifications@github.com wrote:

Thanks for the pull request. Seems to work fine. I was wondering,
shouldn't the task managers repeatably try to build up a connection to the
job manager? For me, that seems to be a nicer way to solve this problem.
That way, the startup script doesn't need to be aware of the job manager's
rpc port.


Reply to this email directly or view it on GitHub
#609 (comment).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is akka logging anything for this requests? (I suspect its logging a WARNING that an invalid client tried to connect?)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmetzger
Since "-z" does only port-pinging (no actual payload is sent), nothing is visible in logs (if you send some data, its correctly logged as WARN "incorrect header" by org.apache.flink.runtime.ipc.Server)

@tillrohrmann
Copy link
Copy Markdown
Contributor

The TaskManager's maximum registration duration is configured by the config value taskmanager.maxRegistrationDuration. The default value is set to infinity for a dedicated Flink cluster.

Therefore, I'm wondering what exactly the problem with the startup delay is? @DarkKnightCZ maybe you can elaborate a little bit more on the problem you had.

@rmetzger
Copy link
Copy Markdown
Contributor

As far as I understood it, the goal of the change is to wait until the JM has been started before starting the TMs.
So the TMs would not start if the JM failed to start.

@lukasraska
Copy link
Copy Markdown
Author

@tillrohrmann
The problem that occurred was that JM bound the IP:PORT with some delay, so TMs failed to start, since they couldn't connect.

When i tried in 5-node environment, sometimes 2 or 3 TMs failed because JM wasn't ready there. There was no subsequential checking done, TMs just stopped. I agree that TM should indeed try to check several times, if the JM is available, so i will try to look at it also.

@tillrohrmann
Copy link
Copy Markdown
Contributor

@DarkKnightCZ that sounds strange. The TM should not terminate itself if it cannot connect to the JM unless the maximum registration duration has been configured. Is it possible that you link the log file of one of the failed TM? That would allow to investigate the problem more thoroughly.

@StephanEwen
Copy link
Copy Markdown
Contributor

Forwarding comments from JIRA:

I think @DarkKnightCZ is using versiob 0.8.x and Till Rohrmann is talking about 0.9
The startup is handled very differently in 0.9 and should actually fix the issue. The selection of the communication interface is in a backoff loop and should happen for many minutes before the TaskManager falls back to heuristics.

I don't think that this issue will be fixed in 0.8.x.

@DarkKnightCZ Can you verify whether 0.9 works for you?

@lukasraska
Copy link
Copy Markdown
Author

@StephanEwen: Hi, yes, you're correct. In 0.9 it works as it should (i.e. tries connecting to JM several times)

So i guess this PR can be closed

@uce
Copy link
Copy Markdown
Contributor

uce commented Apr 29, 2015

@DarkKnightCZ, only you can close the PR. Could you do so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants