[FLINK-1908] JobManager startup delay isn't considered when using start-cluster.sh script#609
[FLINK-1908] JobManager startup delay isn't considered when using start-cluster.sh script#609lukasraska wants to merge 1 commit intoapache:masterfrom
Conversation
… 30 seconds timeout
|
Thanks for the pull request. Seems to work fine. I was wondering, shouldn't the task managers repeatably try to build up a connection to the job manager? For me, that seems to be a nicer way to solve this problem. That way, the startup script doesn't need to be aware of the job manager's rpc port. |
|
The TaskManager uses an exponential backoff strategy to resolve connection On Mon, Apr 20, 2015 at 11:07 AM, Max notifications@github.com wrote:
|
There was a problem hiding this comment.
Is akka logging anything for this requests? (I suspect its logging a WARNING that an invalid client tried to connect?)
There was a problem hiding this comment.
@rmetzger
Since "-z" does only port-pinging (no actual payload is sent), nothing is visible in logs (if you send some data, its correctly logged as WARN "incorrect header" by org.apache.flink.runtime.ipc.Server)
|
The TaskManager's maximum registration duration is configured by the config value Therefore, I'm wondering what exactly the problem with the startup delay is? @DarkKnightCZ maybe you can elaborate a little bit more on the problem you had. |
|
As far as I understood it, the goal of the change is to wait until the JM has been started before starting the TMs. |
|
@tillrohrmann When i tried in 5-node environment, sometimes 2 or 3 TMs failed because JM wasn't ready there. There was no subsequential checking done, TMs just stopped. I agree that TM should indeed try to check several times, if the JM is available, so i will try to look at it also. |
|
@DarkKnightCZ that sounds strange. The TM should not terminate itself if it cannot connect to the JM unless the maximum registration duration has been configured. Is it possible that you link the log file of one of the failed TM? That would allow to investigate the problem more thoroughly. |
|
Forwarding comments from JIRA: I think @DarkKnightCZ is using versiob 0.8.x and Till Rohrmann is talking about 0.9 I don't think that this issue will be fixed in 0.8.x. @DarkKnightCZ Can you verify whether 0.9 works for you? |
|
@StephanEwen: Hi, yes, you're correct. In 0.9 it works as it should (i.e. tries connecting to JM several times) So i guess this PR can be closed |
|
@DarkKnightCZ, only you can close the PR. Could you do so? |
Creates dependency on netcat package (nc)