Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataSourceOpenStack: retry also when first waiting #1529

Closed
wants to merge 1 commit into from

Conversation

david-caro
Copy link
Contributor

Proposed Commit Message

There's a first step to wait and filter out non responsive metadata
services, and the retries option is not taken into account there.
This patch adds the retrying logic to that first step too.

summary: DataSourceOpenStack: retry also when first waiting

There's a first step to wait and filter out non responsive metadata
services, and the `retries` option is not taken into account there.
This patch adds the retrying logic to that first step too.

LP: #1979049

Test Steps

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • I have updated or added any documentation accordingly

This depends on #1527 (CLA signing)

There's a first step to wait and filter out non responsive metadata
services, and the `retries` option is not taken into account there.
This patch adds the retrying logic to that first step too.

LP: #1979049

Signed-off-by: David Caro <me@dcaro.es>
Signed-off-by: David Caro <dcaro@wikimedia.org>
Copy link
Contributor

@andrewbogott andrewbogott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please! This would help with overloaded metadata servers and could also help if we're rotating through a set of servers some of which are unresponsive.

@TheRealFalcon
Copy link
Member

Depends on #1527

@TheRealFalcon TheRealFalcon self-assigned this Jun 17, 2022
@TheRealFalcon
Copy link
Member

@david-caro , We should already be retrying until we hit the max_wait. See my inline comment. Does that sound right to you?

On some level, I think that if the datasource isn't ready within 10 seconds, that's really an issue that should be fixed on the datasource side, but I understand why longer timeouts can help. Historically, we've been hesitant to increase the default timeouts in openstack because on certain architectures that could lead to longer timeouts for non-openstack datasources. See #501 for context.

That said, those concerns aren't as relevant anymore as most other datasources do get reported correctly during early boot.

Can the issues you're seeing be fixed by updating the configuration for your particular openstack installation? Configuration can be set in /etc/cloud/cloud.cfg, /etc/cloud/cloud.cfg.d, or as vendordata to all of your instances.

@david-caro
Copy link
Contributor Author

@david-caro , We should already be retrying until we hit the max_wait. See my inline comment. Does that sound right to you?

Hmm, I think I understand the logic now, tell me if I'm right:

So when doing the wait_for_metadata_service, the helper function it uses, url_helper.wait_for_url, will try the urls passed, potentially one by one, with the timeout given for each connection, and until a maximum elapsed time of max_wait_seconds, so in order to make it retry the urls that are passed, expecting each of them to timeout in timeout_seconds, should be max_wait_seconds > timeout_seconds * num_urls, is that so?

If so, I can try tweaking the configuration yes,

On some level, I think that if the datasource isn't ready within 10 seconds, that's really an issue that should be fixed on the datasource side, but I understand why longer timeouts can help. Historically, we've been hesitant to increase the default timeouts in openstack because on certain architectures that could lead to longer timeouts for non-openstack datasources. See #501 for context.

I agree with that, but we are trying to minimize the impact until we fix it properly.

That said, those concerns aren't as relevant anymore as most other datasources do get reported correctly during early boot.

Can the issues you're seeing be fixed by updating the configuration for your particular openstack installation? Configuration can be set in /etc/cloud/cloud.cfg, /etc/cloud/cloud.cfg.d, or as vendordata to all of your instances.

@TheRealFalcon
Copy link
Member

so in order to make it retry the urls that are passed, expecting each of them to timeout in timeout_seconds, should be max_wait_seconds > timeout_seconds * num_urls, is that so?

Yes, but note that timeout is only relevant if the server is blocking on returning a response. If the request returns with error immediately then the timeout_seconds isn't really relevant.

@github-actions
Copy link

github-actions bot commented Jul 6, 2022

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

@github-actions github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Jul 6, 2022
@david-caro
Copy link
Contributor Author

Oh yes, changing the config seemed to work, thanks!

I'll close the pr/issue

@david-caro david-caro closed this Jul 6, 2022
@david-caro david-caro deleted the lp_1979049 branch July 6, 2022 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale-pr Pull request is stale; will be auto-closed soon
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants