Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STORM-1516 Fixed issue in writing pids with distributed cluster mode. #1084

Merged
merged 1 commit into from
Feb 18, 2016

Conversation

satishd
Copy link
Member

@satishd satishd commented Feb 5, 2016

Whenever a topology is submitted, it creates respective workers on supervisor/s. These worker pids are stored as files in ${storm-localdir}/workers/{worker-id}/pids/ on supervisor. But there is an issue in storing worker pids. So, supervisor could not find respective worker pids when a topology is killed. Subsequent topology deployment workers are failed because of earlier workers are still alive and bound to the respective ports.

Fixed worker.clj to have right checks while writing pids to respective locations.

;; because in local mode, its not a separate
;; process. supervisor will register it in this case
(when (= :distributed (ConfigUtils/clusterMode conf))
;; if (ConfigUtils/isLocalMode conf) returns false then it is in distributed mode.
(when-not (ConfigUtils/isLocalMode conf)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this different from checking state == "distributed"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easy explanation:

=> (= :dist "dist")
false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HeartSaVioR . Got it now.

@arunmahadevan
Copy link
Contributor

@satishd Is it happening with all topologies ?

@satishd
Copy link
Member Author

satishd commented Feb 5, 2016

@arunmahadevan Right, this issue happens with all topologies. There is nothing specific about the mentioned topology in JIRA.

@arunmahadevan
Copy link
Contributor

+1

@revans2
Copy link
Contributor

revans2 commented Feb 8, 2016

The code looks fine to me, but for some reason nimbus-test is failing fairly consistently. Also storm-hdfs is very unstable right now.

@revans2
Copy link
Contributor

revans2 commented Feb 8, 2016

I just did some poking around and it looks like these failures are likely unrelated.

@harshach
Copy link
Contributor

@satishd can you upmerge this and also open another PR for 1.x-branch if we want this there as well. Thanks.

@satishd
Copy link
Member Author

satishd commented Feb 17, 2016

@harshach Upmerged and resolved conflicts. We do not need this on 1.x.

@harshach
Copy link
Contributor

+1

@asfgit asfgit merged commit 8749523 into apache:master Feb 18, 2016
@ndtreviv
Copy link

Which version was this fixed in? I'm seeing the same thing in 1.0.1

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Aug 24, 2016

@ndtreviv This patch is only for 2.0.0. You might hit STORM-1934 which is fixed for 1.0.2. There're lots of things fixed from 1.0.2 so you're encouraged to give it a try.

@ndtreviv
Copy link

@HeartSaVioR I'm not sure that's the one. I'm pretty sure that I'm seeing this issue. I can see the supervisor.log saying that it can't find the worker file in workers-users. As a result, the worker processes don't get shut down. It's not a race condition, either, as I've killed the topology and waited for it all to settle before re-deploying, and done this three times over, but the worker processes are still related to the very first topology being run.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Aug 24, 2016

@ndtreviv
This bug was from ported code which only resides on master (2.0.0) so not related to 1.x.

What you're explaining seems to STORM-1879 which is occurred after supervisor hits STORM-1934 and deletes worker's' (worker root) directory instead of one of them.

@ndtreviv
Copy link

@HeartSaVioR Perfect. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants