Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Handle build_wait_timeout >= 0 consistently #2141

Conversation

aelsabbahy
Copy link
Contributor

@aelsabbahy aelsabbahy commented Apr 19, 2016

WIP, looking for feedback on approach.

Fixes #3528

@mention-bot
Copy link

@mention-bot mention-bot commented Apr 19, 2016

By analyzing the blame information on this pull request, we identified @tomprince, @djmitche and @rutsky to be potential reviewers

@tomprince
Copy link
Member

@tomprince tomprince commented Apr 19, 2016

I don't think this a correct fix. I think the intent of build_wait_timeout == 0 is that the worker will only run a single build. I don't think this change preserves that behavior.

@aelsabbahy
Copy link
Contributor Author

@aelsabbahy aelsabbahy commented Apr 19, 2016

Hmm, that's definitely not what I want then. Since it's very important for us that we only have single use workers.

Would a d.addCallback(self._soft_disconnect) right after the d = self.insubstantiate() preserve the behavior while properly disconnecting?

@codecov-io
Copy link

@codecov-io codecov-io commented Apr 19, 2016

Current coverage is 84.26%

Merging #2141 into master will not affect coverage as of 2fadb6c

@@            master   #2141   diff @@
======================================
  Files          333     333       
  Stmts        32026   32024     -2
  Branches         0       0       
  Methods          0       0       
======================================
- Hit          26987   26985     -2
  Partial          0       0       
  Missed        5039    5039       

Review entire Coverage Diff as of 2fadb6c


Uncovered Suggestions

  1. +0.11% via ...ot/status/builder.py#384...417
  2. +0.09% via ...ot/status/builder.py#456...484
  3. +0.09% via ...dbot/changes/mail.py#480...507
  4. See 7 more...

Powered by Codecov. Updated on successful CI builds.

@tomprince
Copy link
Member

@tomprince tomprince commented Apr 19, 2016

Would a d.addCallback(self._soft_disconnect) right after the d = self.insubstantiate() preserve the behavior while properly disconnecting?

I'm not sure. The code here is somewhat subtle and I suspect has a number of undocumented dependencies and interactions. I don't think I would be comfortable with any changes that don't explain why the change is correct.

It certainly isn't clear to me that this is any better (or worse) than the existing code.

@aelsabbahy
Copy link
Contributor Author

@aelsabbahy aelsabbahy commented Apr 19, 2016

K, I'm taking a slightly different approach which might be a little more clear and direct, which is basically to ensure that a worker is disconnected after calling d.insubstantiate().

For some reason, our EC2 workers aren't disconnecting properly with the code that's in master as of 20cac3f

To reproduce our issue:

  • Build queued
  • Worker starts
  • Worker terminated (but not disconnected)
  • Worker shows up as status: 1 connection, even though the EC2 instance is terminated
  • Future builds are hung

@@ -781,6 +781,8 @@ def buildFinished(self, sb):
if not self.building:
if self.build_wait_timeout == 0:
d = self.insubstantiate()
d.addCallback(lambda _:
AbstractWorker.disconnect(self))
Copy link
Member

@tardyp tardyp Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would believe insubstantiate is supposed to do the diconnection.

Copy link
Contributor Author

@aelsabbahy aelsabbahy Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that something insubstantiate should always do, or should it be a kwarg defaulting to False?

I didn't put it in insubstantiate, because I wasn't sure on the intended usage of the method and didn't want to break something else unintentionally.

Copy link
Member

@tomprince tomprince Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't clear that anybody actually knows what any of this is supposed to do.

Copy link
Contributor Author

@aelsabbahy aelsabbahy Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me to be added to insubstantiate, just not 100% sure if it should be kwarg flagged or not.

Also, I noticed insubstantiate() is calling self.botmaster.maybeStartBuildsForWorker(self.name) at the end, and we turn around and call that method again here. If we move the disconnect logic to insubstantiate can we drop the call to self.botmaster.maybeStartBuildsForWorker(self.workername) from this method?

Copy link
Member

@tomprince tomprince Apr 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me to be added to insubstantiate, [...]

It might. But I'd be inclined to reject any change to this code that is "I think this makes sense ..." instead of "This makes sense because a) ... b) ... c) ...". I suspect the code is in the state that it is largely because of changes like the former.

@aelsabbahy aelsabbahy changed the title Handle build_wait_timeout >= 0 consistently [WIP] Handle build_wait_timeout >= 0 consistently May 2, 2016
@aelsabbahy aelsabbahy force-pushed the issues/3528-EC2LatentWorker-hang branch from 9f0d091 to 96c4fbb Compare May 10, 2016
@aelsabbahy
Copy link
Contributor Author

@aelsabbahy aelsabbahy commented May 10, 2016

For what it's worth, I've updated this pull request with what we're currently using in our CI environment. Without this change build_wait_timeout=0 is causing instances to terminate but not disconnect or ever relaunch so our server just hangs.

@tardyp
Copy link
Member

@tardyp tardyp commented Jul 28, 2016

@aelsabbahy How can the instance terminate with the tcp connection being closed as well?

@tardyp
Copy link
Member

@tardyp tardyp commented Aug 5, 2016

@aelsabbahy
Copy link
Contributor Author

@aelsabbahy aelsabbahy commented Aug 5, 2016

Sorry for the late response, I think this PR can be closed out all together.

I believe some of the changes @tomprince made conflicted with the changes here. There were some initial discussions and @tomprince had some thoughts on how to handle this use-case properly/thoroughly, but I'm not 100% sure if any action was taken since then.

@aelsabbahy aelsabbahy closed this Aug 5, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants