-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STORM-1516 Fixed issue in writing pids with distributed cluster mode. #1084
Conversation
;; because in local mode, its not a separate | ||
;; process. supervisor will register it in this case | ||
(when (= :distributed (ConfigUtils/clusterMode conf)) | ||
;; if (ConfigUtils/isLocalMode conf) returns false then it is in distributed mode. | ||
(when-not (ConfigUtils/isLocalMode conf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this different from checking state == "distributed"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Satish already explains why it occurs via JIRA comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy explanation:
=> (= :dist "dist")
false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @HeartSaVioR . Got it now.
@satishd Is it happening with all topologies ? |
@arunmahadevan Right, this issue happens with all topologies. There is nothing specific about the mentioned topology in JIRA. |
+1 |
The code looks fine to me, but for some reason nimbus-test is failing fairly consistently. Also storm-hdfs is very unstable right now. |
I just did some poking around and it looks like these failures are likely unrelated. |
@satishd can you upmerge this and also open another PR for 1.x-branch if we want this there as well. Thanks. |
@harshach Upmerged and resolved conflicts. We do not need this on 1.x. |
+1 |
Which version was this fixed in? I'm seeing the same thing in 1.0.1 |
@ndtreviv This patch is only for 2.0.0. You might hit STORM-1934 which is fixed for 1.0.2. There're lots of things fixed from 1.0.2 so you're encouraged to give it a try. |
@HeartSaVioR I'm not sure that's the one. I'm pretty sure that I'm seeing this issue. I can see the supervisor.log saying that it can't find the worker file in workers-users. As a result, the worker processes don't get shut down. It's not a race condition, either, as I've killed the topology and waited for it all to settle before re-deploying, and done this three times over, but the worker processes are still related to the very first topology being run. |
@ndtreviv What you're explaining seems to STORM-1879 which is occurred after supervisor hits STORM-1934 and deletes worker's' (worker root) directory instead of one of them. |
@HeartSaVioR Perfect. Thanks |
Whenever a topology is submitted, it creates respective workers on supervisor/s. These worker pids are stored as files in ${storm-localdir}/workers/{worker-id}/pids/ on supervisor. But there is an issue in storing worker pids. So, supervisor could not find respective worker pids when a topology is killed. Subsequent topology deployment workers are failed because of earlier workers are still alive and bound to the respective ports.
Fixed worker.clj to have right checks while writing pids to respective locations.