Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Handle build_wait_timeout >= 0 consistently #2141

Conversation

@aelsabbahy
Copy link
Contributor

commented Apr 19, 2016

WIP, looking for feedback on approach.

Fixes #3528

@mention-bot

This comment has been minimized.

Copy link

commented Apr 19, 2016

By analyzing the blame information on this pull request, we identified @tomprince, @djmitche and @rutsky to be potential reviewers

@tomprince

This comment has been minimized.

Copy link
Member

commented Apr 19, 2016

I don't think this a correct fix. I think the intent of build_wait_timeout == 0 is that the worker will only run a single build. I don't think this change preserves that behavior.

@aelsabbahy

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2016

Hmm, that's definitely not what I want then. Since it's very important for us that we only have single use workers.

Would a d.addCallback(self._soft_disconnect) right after the d = self.insubstantiate() preserve the behavior while properly disconnecting?

@codecov-io

This comment has been minimized.

Copy link

commented Apr 19, 2016

Current coverage is 84.26%

Merging #2141 into master will not affect coverage as of 2fadb6c

@@            master   #2141   diff @@
======================================
  Files          333     333       
  Stmts        32026   32024     -2
  Branches         0       0       
  Methods          0       0       
======================================
- Hit          26987   26985     -2
  Partial          0       0       
  Missed        5039    5039       

Review entire Coverage Diff as of 2fadb6c


Uncovered Suggestions

  1. +0.11% via ...ot/status/builder.py#384...417
  2. +0.09% via ...ot/status/builder.py#456...484
  3. +0.09% via ...dbot/changes/mail.py#480...507
  4. See 7 more...

Powered by Codecov. Updated on successful CI builds.

@tomprince

This comment has been minimized.

Copy link
Member

commented Apr 19, 2016

Would a d.addCallback(self._soft_disconnect) right after the d = self.insubstantiate() preserve the behavior while properly disconnecting?

I'm not sure. The code here is somewhat subtle and I suspect has a number of undocumented dependencies and interactions. I don't think I would be comfortable with any changes that don't explain why the change is correct.

It certainly isn't clear to me that this is any better (or worse) than the existing code.

@aelsabbahy

This comment has been minimized.

Copy link
Contributor Author

commented Apr 19, 2016

K, I'm taking a slightly different approach which might be a little more clear and direct, which is basically to ensure that a worker is disconnected after calling d.insubstantiate().

For some reason, our EC2 workers aren't disconnecting properly with the code that's in master as of 20cac3f

To reproduce our issue:

  • Build queued
  • Worker starts
  • Worker terminated (but not disconnected)
  • Worker shows up as status: 1 connection, even though the EC2 instance is terminated
  • Future builds are hung

@sa2ajj sa2ajj added the Do Not Merge label Apr 19, 2016

@@ -781,6 +781,8 @@ def buildFinished(self, sb):
if not self.building:
if self.build_wait_timeout == 0:
d = self.insubstantiate()
d.addCallback(lambda _:
AbstractWorker.disconnect(self))

This comment has been minimized.

Copy link
@tardyp

tardyp Apr 20, 2016

Member

I would believe insubstantiate is supposed to do the diconnection.

This comment has been minimized.

Copy link
@aelsabbahy

aelsabbahy Apr 20, 2016

Author Contributor

Is that something insubstantiate should always do, or should it be a kwarg defaulting to False?

I didn't put it in insubstantiate, because I wasn't sure on the intended usage of the method and didn't want to break something else unintentionally.

This comment has been minimized.

Copy link
@tomprince

tomprince Apr 20, 2016

Member

It isn't clear that anybody actually knows what any of this is supposed to do.

This comment has been minimized.

Copy link
@aelsabbahy

aelsabbahy Apr 20, 2016

Author Contributor

This change makes sense to me to be added to insubstantiate, just not 100% sure if it should be kwarg flagged or not.

Also, I noticed insubstantiate() is calling self.botmaster.maybeStartBuildsForWorker(self.name) at the end, and we turn around and call that method again here. If we move the disconnect logic to insubstantiate can we drop the call to self.botmaster.maybeStartBuildsForWorker(self.workername) from this method?

This comment has been minimized.

Copy link
@tomprince

tomprince Apr 20, 2016

Member

This change makes sense to me to be added to insubstantiate, [...]

It might. But I'd be inclined to reject any change to this code that is "I think this makes sense ..." instead of "This makes sense because a) ... b) ... c) ...". I suspect the code is in the state that it is largely because of changes like the former.

@aelsabbahy aelsabbahy changed the title Handle build_wait_timeout >= 0 consistently [WIP] Handle build_wait_timeout >= 0 consistently May 2, 2016

@aelsabbahy aelsabbahy force-pushed the aelsabbahy:issues/3528-EC2LatentWorker-hang branch from 9f0d091 to 96c4fbb May 10, 2016

@aelsabbahy

This comment has been minimized.

Copy link
Contributor Author

commented May 10, 2016

For what it's worth, I've updated this pull request with what we're currently using in our CI environment. Without this change build_wait_timeout=0 is causing instances to terminate but not disconnect or ever relaunch so our server just hangs.

@tardyp

This comment has been minimized.

Copy link
Member

commented Jul 28, 2016

@aelsabbahy How can the instance terminate with the tcp connection being closed as well?

@tardyp

This comment has been minimized.

Copy link
Member

commented Aug 5, 2016

@aelsabbahy

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2016

Sorry for the late response, I think this PR can be closed out all together.

I believe some of the changes @tomprince made conflicted with the changes here. There were some initial discussions and @tomprince had some thoughts on how to handle this use-case properly/thoroughly, but I'm not 100% sure if any action was taken since then.

@aelsabbahy aelsabbahy closed this Aug 5, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.