DataSourceOpenStack: retry also when first waiting #1529

david-caro · 2022-06-17T12:01:59Z

Proposed Commit Message

There's a first step to wait and filter out non responsive metadata
services, and the retries option is not taken into account there.
This patch adds the retrying logic to that first step too.

summary: DataSourceOpenStack: retry also when first waiting

There's a first step to wait and filter out non responsive metadata
services, and the `retries` option is not taken into account there.
This patch adds the retrying logic to that first step too.

LP: #1979049

Test Steps

Checklist:

My code follows the process laid out in the documentation
I have updated or added any unit tests accordingly
I have updated or added any documentation accordingly

This depends on #1527 (CLA signing)

There's a first step to wait and filter out non responsive metadata services, and the `retries` option is not taken into account there. This patch adds the retrying logic to that first step too. LP: #1979049 Signed-off-by: David Caro <me@dcaro.es> Signed-off-by: David Caro <dcaro@wikimedia.org>

andrewbogott

Yes please! This would help with overloaded metadata servers and could also help if we're rotating through a set of servers some of which are unresponsive.

TheRealFalcon · 2022-06-17T18:59:33Z

Depends on #1527

TheRealFalcon · 2022-06-20T18:45:19Z

@david-caro , We should already be retrying until we hit the max_wait. See my inline comment. Does that sound right to you?

On some level, I think that if the datasource isn't ready within 10 seconds, that's really an issue that should be fixed on the datasource side, but I understand why longer timeouts can help. Historically, we've been hesitant to increase the default timeouts in openstack because on certain architectures that could lead to longer timeouts for non-openstack datasources. See #501 for context.

That said, those concerns aren't as relevant anymore as most other datasources do get reported correctly during early boot.

Can the issues you're seeing be fixed by updating the configuration for your particular openstack installation? Configuration can be set in /etc/cloud/cloud.cfg, /etc/cloud/cloud.cfg.d, or as vendordata to all of your instances.

david-caro · 2022-06-20T21:32:32Z

@david-caro , We should already be retrying until we hit the max_wait. See my inline comment. Does that sound right to you?

Hmm, I think I understand the logic now, tell me if I'm right:

So when doing the wait_for_metadata_service, the helper function it uses, url_helper.wait_for_url, will try the urls passed, potentially one by one, with the timeout given for each connection, and until a maximum elapsed time of max_wait_seconds, so in order to make it retry the urls that are passed, expecting each of them to timeout in timeout_seconds, should be max_wait_seconds > timeout_seconds * num_urls, is that so?

If so, I can try tweaking the configuration yes,

On some level, I think that if the datasource isn't ready within 10 seconds, that's really an issue that should be fixed on the datasource side, but I understand why longer timeouts can help. Historically, we've been hesitant to increase the default timeouts in openstack because on certain architectures that could lead to longer timeouts for non-openstack datasources. See #501 for context.

I agree with that, but we are trying to minimize the impact until we fix it properly.

That said, those concerns aren't as relevant anymore as most other datasources do get reported correctly during early boot.

Can the issues you're seeing be fixed by updating the configuration for your particular openstack installation? Configuration can be set in /etc/cloud/cloud.cfg, /etc/cloud/cloud.cfg.d, or as vendordata to all of your instances.

TheRealFalcon · 2022-06-21T16:29:43Z

so in order to make it retry the urls that are passed, expecting each of them to timeout in timeout_seconds, should be max_wait_seconds > timeout_seconds * num_urls, is that so?

Yes, but note that timeout is only relevant if the server is blocking on returning a response. If the request returns with error immediately then the timeout_seconds isn't really relevant.

github-actions · 2022-07-06T00:46:21Z

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon.

(If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.)

david-caro · 2022-07-06T08:54:21Z

Oh yes, changing the config seemed to work, thanks!

I'll close the pr/issue

andrewbogott approved these changes Jun 17, 2022

View reviewed changes

TheRealFalcon self-assigned this Jun 17, 2022

github-actions bot added the stale-pr Pull request is stale; will be auto-closed soon label Jul 6, 2022

david-caro closed this Jul 6, 2022

david-caro deleted the lp_1979049 branch July 6, 2022 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataSourceOpenStack: retry also when first waiting #1529

DataSourceOpenStack: retry also when first waiting #1529

david-caro commented Jun 17, 2022

andrewbogott left a comment

TheRealFalcon commented Jun 17, 2022

TheRealFalcon commented Jun 20, 2022

david-caro commented Jun 20, 2022

TheRealFalcon commented Jun 21, 2022

github-actions bot commented Jul 6, 2022

david-caro commented Jul 6, 2022

DataSourceOpenStack: retry also when first waiting #1529

DataSourceOpenStack: retry also when first waiting #1529

Conversation

david-caro commented Jun 17, 2022

Proposed Commit Message

Test Steps

Checklist:

andrewbogott left a comment

Choose a reason for hiding this comment

TheRealFalcon commented Jun 17, 2022

TheRealFalcon commented Jun 20, 2022

david-caro commented Jun 20, 2022

TheRealFalcon commented Jun 21, 2022

github-actions bot commented Jul 6, 2022

david-caro commented Jul 6, 2022