Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure WireServer endpoint not found #94

Open
jsturtevant opened this issue Apr 28, 2022 · 37 comments
Open

Azure WireServer endpoint not found #94

jsturtevant opened this issue Apr 28, 2022 · 37 comments

Comments

@jsturtevant
Copy link

Overview

We are seeing random VM timeouts in our k8s cluster api tests, an example is: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows/1519329033286848512

We are using the UserData plugin to execute run commands and set up ssh keys. Here is an example output from a successful run. If understand correctly, everything that is executed during UserData is running during Azure's "Creating" phase of the vm because it runs during Azures VM provisioning phase.

I've found that this VM is timing out during this Azure VM provisioning phase. When the time out occurs the response from azure is the following and the vm never finishes provision so we are able to get logs.

{\"code\":\"OSProvisioningTimedOut\",\"message\":\"OS Provisioning for VM 'capz-conf-jjd9j' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle.\"}]}}",

what is happening

Was able to get a vm that was in this state and found:

2022-04-28 15:24:14.315 4296 DEBUG cloudbaseinit.metadata.services.azureservice [-] Azure WireServer endpoint not found load C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:457
2022-04-28 15:24:14.316 4296 ERROR cloudbaseinit.init [-] No metadata service found: cloudbaseinit.exception.MetadataNotFoundException: No available service found
2022-04-28 15:24:14.316 4296 INFO cloudbaseinit.init [-] Plugins execution done
2022-04-28 15:24:14.316 4296 INFO cloudbaseinit.init [-] Stopping Cloudbase-Init service
2022-04-28 15:24:17.316 4296 DEBUG cloudbaseinit.osutils.windows [-] Stopping service cloudbase-init stop_service C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1140

This looks to be triggering from

def load(self):
try:
wire_server_endpoint = self._get_wire_server_endpoint_address()
self._base_url = "http://%s" % wire_server_endpoint
except Exception:
LOG.debug("Azure WireServer endpoint not found")
return False

which is calling a failing to reach the endpoint over dhcp:

def _get_wire_server_endpoint_address(self):
total_time = 300
poll_time = 5
retries = total_time / poll_time
while True:
try:
options = dhcp.get_dhcp_options()
endpoint = (options or {}).get(WIRESERVER_DHCP_OPTION)
if not endpoint:
raise exception.MetadataNotFoundException(
"Cannot find Azure WireServer endpoint address")
return socket.inet_ntoa(endpoint)
except Exception:
if not retries:
raise
time.sleep(poll_time)
retries -= 1

It seems because of this exception the service is never setting the failed/succeed of the provisioning step:

def provisioning_completed(self):
self._post_health_status(HEALTH_STATE_READY)
def provisioning_failed(self):
self._post_health_status(
HEALTH_STATE_NOT_READY, HEALTH_SUBSTATE_PROVISIONING_FAILED,
"Provisioning failed")

questions

The metadata service ip is well known at 169.254.169.254 why is it querying for it via dhcp? Also it get the wireservice ip of 168.63.129.16 in the success case.

If it can't get it via dhcp, then could if fall back to the well known ips for last try?

When this exception happens shouldn't the failed status be set on the VM provisioning? This way it doesn't get stuck in the Azure Creating step.

Thanks for the help!

additional info

full logs:

cloudbase-init.log

``` 2022-04-28 15:14:06.975 4244 WARNING cloudbaseinit.utils.log [-] Serial port: COM1,115200,N,8 could not be opened: serial.serialutil.SerialException: could not open port 'COM1': FileNotFoundError(2, 'The system cannot find the file specified.', None, 2) 2022-04-28 15:14:06.976 4244 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:07.190 4244 DEBUG cloudbaseinit.osutils.windows [-] Checking if service exists: cloudbase-init check_service_exists C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1092 2022-04-28 15:14:07.193 4244 DEBUG cloudbaseinit.osutils.windows [-] Getting service username: cloudbase-init get_service_username C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1225 2022-04-28 15:14:07.193 4244 DEBUG cloudbaseinit.osutils.windows [-] Resetting password for service user: .\cloudbase-init reset_service_password C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1248 2022-04-28 15:14:07.240 4244 DEBUG cloudbaseinit.osutils.windows [-] Setting service credentials: cloudbase-init set_service_credentials C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1208 2022-04-28 15:14:07.250 4244 INFO cloudbaseinit.init [-] Respawning current process with updated credentials. 2022-04-28 15:14:07.250 4244 DEBUG cloudbaseinit.osutils.windows [-] Creating logon session for user: .\cloudbase-init create_user_logon_session C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:649 2022-04-28 15:14:07.255 4244 DEBUG cloudbaseinit.osutils.windows [-] Executing process as user, command line: ['C:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\Python\\Scripts\\cloudbase-init', '--config-file', 'C:\\Program Files\\Cloudbase Solutions\\Cloudbase-Init\\conf\\cloudbase-init.conf', '--noreset_service_password'] execute_process_as_user C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:703 2022-04-28 15:14:07.791 4296 WARNING cloudbaseinit.utils.log [-] Serial port: COM1,115200,N,8 could not be opened: serial.serialutil.SerialException: could not open port 'COM1': FileNotFoundError(2, 'The system cannot find the file specified.', None, 2) 2022-04-28 15:14:07.791 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:07.997 4296 INFO cloudbaseinit.init [-] Cloudbase-Init version: 1.1.2 2022-04-28 15:14:07.998 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.createuser.CreateUserPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.085 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.setuserpassword.SetUserPasswordPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.171 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.extendvolumes.ExtendVolumesPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.224 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.userdata.UserDataPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.346 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.ephemeraldisk.EphemeralDiskPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.369 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.azureguestagent.AzureGuestAgentPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.391 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.409 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.sethostname.SetHostNamePlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.457 4296 INFO cloudbaseinit.init [-] Executing plugins for stage 'PRE_NETWORKING': 2022-04-28 15:14:08.458 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.createuser.CreateUserPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.458 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.setuserpassword.SetUserPasswordPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.458 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.extendvolumes.ExtendVolumesPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.459 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.userdata.UserDataPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.459 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.ephemeraldisk.EphemeralDiskPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.459 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.azureguestagent.AzureGuestAgentPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.460 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.460 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.sethostname.SetHostNamePlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.460 4296 INFO cloudbaseinit.init [-] Executing plugins for stage 'PRE_METADATA_DISCOVERY': 2022-04-28 15:14:08.460 4296 INFO cloudbaseinit.init [-] Executing plugin 'MTUPlugin' 2022-04-28 15:14:08.461 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.474 4296 DEBUG cloudbaseinit.plugins.common.mtu [-] Could not obtain the MTU configuration via DHCP for interface "60:45:BD:0F:3D:F9" execute C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\plugins\common\mtu.py:46 2022-04-28 15:14:08.475 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.metadata.services.azureservice.AzureService' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:14:08.800 4296 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:24:14.315 4296 DEBUG cloudbaseinit.metadata.services.azureservice [-] Azure WireServer endpoint not found load C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:457 2022-04-28 15:24:14.316 4296 ERROR cloudbaseinit.init [-] No metadata service found: cloudbaseinit.exception.MetadataNotFoundException: No available service found 2022-04-28 15:24:14.316 4296 INFO cloudbaseinit.init [-] Plugins execution done 2022-04-28 15:24:14.316 4296 INFO cloudbaseinit.init [-] Stopping Cloudbase-Init service 2022-04-28 15:24:17.316 4296 DEBUG cloudbaseinit.osutils.windows [-] Stopping service cloudbase-init stop_service C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1140 ```

cloudbase-init-unattend.log

``` 2022-04-28 15:13:46.682 1476 WARNING cloudbaseinit.utils.log [-] Serial port: COM1,115200,N,8 could not be opened: serial.serialutil.SerialException: could not open port 'COM1': FileNotFoundError(2, 'The system cannot find the file specified.', None, 2) 2022-04-28 15:13:46.682 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.318 1476 DEBUG cloudbaseinit.osutils.windows [-] Checking if service exists: cloudbase-init check_service_exists C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1092 2022-04-28 15:13:49.318 1476 DEBUG cloudbaseinit.osutils.windows [-] Getting service username: cloudbase-init get_service_username C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1225 2022-04-28 15:13:49.318 1476 DEBUG cloudbaseinit.osutils.windows [-] Resetting password for service user: .\cloudbase-init reset_service_password C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1248 2022-04-28 15:13:49.365 1476 DEBUG cloudbaseinit.osutils.windows [-] Setting service credentials: cloudbase-init set_service_credentials C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\osutils\windows.py:1208 2022-04-28 15:13:49.381 1476 DEBUG cloudbaseinit.init [-] No need to respawn process. Current user: capz-conf-r7bm9$. Service user: cloudbase-init _reset_service_password_and_respawn C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\init.py:141 2022-04-28 15:13:49.381 1476 INFO cloudbaseinit.init [-] Cloudbase-Init version: 1.1.2 2022-04-28 15:13:49.381 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.428 1476 INFO cloudbaseinit.init [-] Executing plugins for stage 'PRE_NETWORKING': 2022-04-28 15:13:49.428 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.428 1476 INFO cloudbaseinit.init [-] Executing plugins for stage 'PRE_METADATA_DISCOVERY': 2022-04-28 15:13:49.428 1476 INFO cloudbaseinit.init [-] Executing plugin 'MTUPlugin' 2022-04-28 15:13:49.428 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.osutils.windows.WindowsUtils' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.428 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.metadata.services.base.EmptyMetadataService' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.475 1476 INFO cloudbaseinit.init [-] Metadata service loaded: 'EmptyMetadataService' 2022-04-28 15:13:49.475 1476 DEBUG cloudbaseinit.init [-] Instance id: None configure_host C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\init.py:202 2022-04-28 15:13:49.475 1476 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27 2022-04-28 15:13:49.475 1476 INFO cloudbaseinit.init [-] Executing plugins for stage 'MAIN': 2022-04-28 15:13:49.475 1476 INFO cloudbaseinit.init [-] Plugins execution done ```

@ader1990
Copy link
Member

Hello @jsturtevant,

I think we can go forward with solving this issue by using a fallback magic IP in case the DHCP option is not present in the DHCP response.

Although I am afraid this might lead to the same end result, as the OS state might not be ready (basically this issue might happen because of a lazy first boot) or some other networking issue with Azure or the Windows image itself.

We can try a very broad solving approach, where we increase the retry count, improve logging and use the known fallback magic IP.

I can prepare a patch for this purpose as soon as possible. Are you able to build the MSI / new image with a new patch?

Thank you,
Adrian

@jsturtevant
Copy link
Author

We can try a very broad solving approach, where we increase the retry count, improve logging and use the known fallback magic IP.

sounds good

as the OS state might not be ready (basically this issue might happen because of a lazy first boot) or some other networking issue with Azure or the Windows image itself.

I was wondering the same thing my self. Might even be a race in same way... It only started happening frequently in the last month or so but has gotten to a point where it is troublesome.

We are doing the kubeadm bootstrap work at provisioning time. Is there another approach with cloudbase init, that we could do it just after provision finishes so that it would be avoid doing that work during normal Azure boot process? We also leverage the things like ssh, user creation, password setting from cloudbase init which seems to rely on this Service so even if we move that work out this could still happen if I understand correctly.

Are you able to build the MSI / new image with a new patch?

Are there instructions on how to build the msi? I can easily build a new azure image with a URL to the MSI.

@jsturtevant
Copy link
Author

@claudiubelu thought maybe adding a networking service dependency might help kubernetes/kubernetes#109444 (comment)

As to why it didn't detect it needs the Azure Metadata Service: basically, it tries to detect if it can contact the Azure Wire Server by making a DHCP request, but the socket timeout is 5 seconds, and I assume that timed out. I also notice that this is the only dependency for the cloudbase-init: DEPENDENCIES : Winmgmt, which is not Networking related. I think there's a possibility that cloudbase-init started before the networking started, resulting in this issue. A solution might be to add another networking dependency to cloudbase-init, ensuring that it will start afterwards. Other than that, the 5 second timeout could be touched, IMO (maybe the Azure Security Group rules are not applied yet before the VM starts?).

ader1990 added a commit to ader1990/cloudbase-init-1 that referenced this issue May 3, 2022
* Added more retries for wireserver discovery
* Added logging to wireserver discovery errors
* Fallback to the well known wireserver IP in case of discovery failure

Github Issue: cloudbase#94

Change-Id: I91460f9ad674294a83c9b3c3f05bc04ab61de56f
@ader1990
Copy link
Member

ader1990 commented May 3, 2022

Hello @jsturtevant,

I have fixed the github actions in order to produce a cloudbase-init MSI on request if there is the need for it.
You can find in the artifacts tab of this github action - https://github.com/ader1990/cloudbase-init-installer-1/actions/runs/2262765578, the MSI with the following patches:

In short, I have added more retries, more logging in case we need it, a fallback to the default wireserver IP and to have no chance of transient lazy first boot failures, delayed-start on the cloudbase-init service.

Please let me know how it works and if all in order, I can clean up the patches for upstream merge.

Thank you,
Adrian.

@jsturtevant
Copy link
Author

Thanks! I am going to build a new vm image for our WS2022 tests where we have seen this so we can test this and the other mitigation I've added to capz.

I will run it for the next week or so and report back.

@jsturtevant
Copy link
Author

Just an update, I haven't seen this issue since I've updated the cloudbase-init to use this patch. We've had a few other unrelated issues in CI so going to continue to run for a while longer.

@jsturtevant
Copy link
Author

@ader1990 Just got an instance of the issue. This patch seems to have helped but the time out before cutting over to the fall back was longer than our timeout in the tests. The VM didn't get suck in Azure's Creating step.

Job that failed: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-capz-master-containerd-windows-2022/1527248958894641152

You can see it fails and starts to collect logs at 12:07:08.273: Capz suite:

�[1mSTEP�[0m: Waiting for the workload nodes to exist
[AfterEach] Conformance Tests
  /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:231
�[1mSTEP�[0m: Dumping logs from the "capz-conf-ew9isz" workload cluster
�[1mSTEP�[0m: Dumping workload cluster capz-conf-ew9isz/capz-conf-ew9isz logs
May 19 12:07:08.273: INFO: Collecting logs for Linux node capz-conf-ew9isz-control-plane-pdqw7 in cluster capz-conf-ew9isz in namespace capz-conf-ew9isz

May 19 12:07:20.205: INFO: Collecting boot logs for AzureMachine capz-conf-ew9isz-control-plane-pdqw7

May 19 12:07:22.425: INFO: Unable to collect logs as node doesn't have addresses
May 19 12:07:22.425: INFO: Collecting logs for Windows node capz-conf-ew9isz-md-win-sj9hc in cluster capz-conf-ew9isz in namespace capz-conf-ew9isz

May 19 12:07:29.660: INFO: Collecting boot logs for AzureMachine capz-conf-ew9isz-md-win-sj9hc

�[1mSTEP�[0m: Redacting sensitive information from logs

�[91m�[1m• Failure [1898.993 seconds]�[0m
Conformance Tests
�[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:48�[0m
  �[91m�[1mconformance-tests [Measurement]�[0m
  �[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:103�[0m

  �[91mTimed out after 1500.000s.
  Expected
      <int>: 0
  to equal
      <int>: 2�[0m

  /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/test@v1.1.2/framework/machinedeployment_helpers.go:121

  �[91mFull Stack Trace�[0m
  sigs.k8s.io/cluster-api/test/framework.WaitForMachineDeploymentNodesToExist({0x3a3f8a8?, 0xc00013e008}, {{0x7f27b0f7f208?, 0xc000343810?}, 0xc00088aa80?, 0xc0006c0290?}, {0xc0008c12e0, 0x2, 0x2})

In the cloudbase-init logs you can see the DCHP fail and the fallback successfully at 2022-05-19 12:12:20.363 which lets the vm provision properly but not in time for the test which has a generous time out for nodes to start up (25mins)

022-05-19 12:12:20.363 4428 DEBUG cloudbaseinit.metadata.services.azureservice [-] Cannot find Azure WireServer endpoint address _get_wire_server_endpoint_address C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:83
2022-05-19 12:12:20.363 4428 DEBUG cloudbaseinit.metadata.services.azureservice [-] Azure WireServer endpoint not found. Using default endpoint 168.63.129.16. load C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:460
2022-05-19 12:12:20.363 4428 DEBUG cloudbaseinit.metadata.services.base [-] Executing http request GET at http://168.63.129.16?comp=Versions _http_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\base.py:315
2022-05-19 12:12:20.473 4428 DEBUG urllib3.connectionpool [-] Starting new HTTP connection (1): 168.63.129.16:80 _new_conn C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:230
2022-05-19 12:12:20.473 4428 DEBUG urllib3.connectionpool [-] http://168.63.129.16:80 "GET /?comp=Versions HTTP/1.1" 200 510 _make_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:442
2022-05-19 12:12:20.535 4428 DEBUG cloudbaseinit.metadata.services.azureservice [-] ovs-env path: E:\ovf-env.xml _get_ovf_env_path C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:341
2022-05-19 12:12:20.551 4428 INFO cloudbaseinit.init [-] Metadata service loaded: 'AzureService'
2022-05-19 12:12:20.551 4428 INFO cloudbaseinit.init [-] Reporting provisioning started
2022-05-19 12:12:20.551 4428 DEBUG cloudbaseinit.metadata.services.base [-] Executing http request GET at http://168.63.129.16/machine?comp=goalstate _http_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\base.py:315
2022-05-19 12:12:20.551 4428 DEBUG urllib3.connectionpool [-] Starting new HTTP connection (1): 168.63.129.16:80 _new_conn C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:230
2022-05-19 12:12:20.561 4428 DEBUG urllib3.connectionpool [-] http://168.63.129.16:80 "GET /machine?comp=goalstate HTTP/1.1" 200 2154 _make_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:442
2022-05-19 12:12:20.567 4428 DEBUG cloudbaseinit.metadata.services.azureservice [-] Health data: b"<?xml version='1.0' encoding='utf-8'?>\n<Health><GoalStateIncarnation>1</GoalStateIncarnation><Container><ContainerId>d478084b-e06f-419b-9a36-9ae68ca62648</ContainerId><RoleInstanceList><Role><InstanceId>0c07e5b6-255e-4f12-9fd0-e6f42d491599._capz-conf-sj9hc</InstanceId><Health><State>NotReady</State><Details><SubStatus>Provisioning</SubStatus><Description>Cloudbase-Init is preparing your computer for first use...</Description></Details></Health></Role></RoleInstanceList></Container></Health>" _post_health_status C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\azureservice.py:210
2022-05-19 12:12:20.567 4428 DEBUG cloudbaseinit.metadata.services.base [-] Executing http request POST at http://168.63.129.16/machine?comp=health _http_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\metadata\services\base.py:315
2022-05-19 12:12:20.567 4428 DEBUG urllib3.connectionpool [-] Starting new HTTP connection (1): 168.63.129.16:80 _new_conn C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:230
2022-05-19 12:12:20.567 4428 DEBUG urllib3.connectionpool [-] http://168.63.129.16:80 "POST /machine?comp=health HTTP/1.1" 200 0 _make_request C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\urllib3\connectionpool.py:442
2022-05-19 12:12:20.567 4428 DEBUG cloudbaseinit.init [-] Instance id: 0c07e5b6-255e-4f12-9fd0-e6f42d491599._capz-conf-sj9hc configure_host C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\init.py:202
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.createuser.CreateUserPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.setuserpassword.SetUserPasswordPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.extendvolumes.ExtendVolumesPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.userdata.UserDataPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.ephemeraldisk.EphemeralDiskPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.windows.azureguestagent.AzureGuestAgentPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.582 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.mtu.MTUPlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27
2022-05-19 12:12:20.598 4428 DEBUG cloudbaseinit.utils.classloader [-] Loading class 'cloudbaseinit.plugins.common.sethostname.SetHostNamePlugin' load_class C:\Program Files\Cloudbase Solutions\Cloudbase-Init\Python\lib\site-packages\cloudbaseinit\utils\classloader.py:27

full log here: https://gist.github.com/jsturtevant/2ed46f8cc9f52fe2a767a2f698e45377

@jsturtevant
Copy link
Author

/cc @marosset

@ader1990
Copy link
Member

Hello,

The wait time now is around 20 minutes just for the retries to get the dhcp option, I can make another patch with way faster falling back to the default IP (maybe /4 the time to 5 minutes?). Would that be preferable for your case?

@jsturtevant
Copy link
Author

What would be the long term fix for cloudbase-init? It seems like the delay-start for the service helped. Would taking a dependency on a networking service as suggested with kubernetes/kubernetes#109444 (comment) be viable?

My main concern with using a 5 minute timeout is that this seems the dhcp has a failure mode that is not recoverable the way it is executed now, so if we do hit this edge case we have an outlying VM startup time that will take 5+ minutes to provision which isn't a great experience for the end user.

@jsturtevant
Copy link
Author

@ader1990 any thoughts here?

@marosset
Copy link

I think taking a dependency on is a good idea.
It seems safer and more deterministic vs waits + retries.

@ader1990
Copy link
Member

ader1990 commented Jun 2, 2022

Hello,

I have created a new MSI installer using the github actions, downloadable from here https://github.com/ader1990/cloudbase-init-installer-1/actions/runs/2426984970 .

Implemented changes:

  • Reverted automatic delayed start to automatic start for the cloudbase-init service
  • Decreased wait time for the DHCP option that should contain the wireserver endpoint
  • Added cloudbase-init service dependencies: nis and profsvc ("Network Store Interface Service", "User Profile Service")

Branches:

@jsturtevant we have already given this version of the MSI some runs on Azure and it works as expected.
Once we get a confirmation it works well, we can merge the two fixes upstream asap.

Thank you,
Adrian Vladu

@jsturtevant
Copy link
Author

I will get an image into testing and report back but otherwise looks good to me

@jsturtevant
Copy link
Author

The CI now has the patch, we can let it run over the weekend and check in.

@jsturtevant
Copy link
Author

@ader1990 We are still seeing failures with node coming online with the latest patches. It is happening more frequently then the prior patch with had the delayed start to the service. Seems that might be the best approach?

I am trying to collect logs but it seems the serial com port logging is not working. Is this the correct configuration kubernetes-sigs/image-builder#875? Also can you confirm the api where we can get the logs? We are using https://docs.microsoft.com/en-us/azure/virtual-machines/boot-diagnostics via https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/0ecbf62ecca79af77d1d902a9e8d46c0a0ba38c1/test/e2e/azure_logcollector.go#L416

thanks for the help!

@ader1990
Copy link
Member

Hello @jsturtevant,

First of all, maybe there are some other issues that we need to investigate, there might be a posibility the entire Windows networking is not available till a point later in the boot sequence, but we need logs to get to the bottom of it.

COM2 is the port opened for serial logging, and it should work AFAIK, so should the azure collector logs.

What we can do is to try to reproduce this bug in isolation. In order to do this, can you please share the VM flavor being used and the region of Azure, if it s possible.

In the meantime, maybe you can leverage the older MSI / image with higher timeouts, but if I remember correctly, there was the need to increase a timeout in the testing framework too?

Thank you,
Adrian Vladu

@jsturtevant
Copy link
Author

We run the tests in the following random regions: "canadacentral" "centralus" "eastus" "eastus2" "northeurope" "uksouth" "westeurope" "westus2" "westus3". I've seen it in all the regions but recent instances where in eastus, eastus2, centralus, and westus3

vm size: Standard_D4s_v3

In the meantime, maybe you can leverage the older MSI / image with higher timeouts, but if I remember correctly, there was the need to increase a timeout in the testing framework too?

Yes I am going to try this.

@jsturtevant
Copy link
Author

COM2 is the port opened for serial logging, and it should work AFAIK, so should the azure collector logs.

I am not seeing the logs show up with by using that API, when you do some further testing let me know if you are seeing them there or if I am using the wrong API.

@ader1990
Copy link
Member

I have also created another MSI with automatic delayed start re-enabled. I think this hybrid approach would be best while we work to reproduce the issue in isolation. https://github.com/ader1990/cloudbase-init-installer-1/actions/runs/2576044017 .

I will try to reproduce this issue and will verify the COM port behavious asap.

Thanks.

@jsturtevant
Copy link
Author

@ader1990 any luck reproducing?

Since taking the build from https://github.com/ader1990/cloudbase-init-installer-1/actions/runs/2576044017 I haven't seen any provisioning errors related to this.

@marosset
Copy link

@jsturtevant @ader1990 We've been running with these changes in CAPZ for about a month now and have seen a reduction in cluster provisioning errors. Can we get a release with these changes in it?
Thanks!

@ader1990
Copy link
Member

ader1990 commented Sep 7, 2022

Hello,

There is work in progress to get the cloudbase-init patch merged here:

https://review.opendev.org/c/x/cloudbase-init/+/856306

If you can take a look and give a +1, it would be great. I expect no other functional additions to it, maybe some nitpicking on the code review, which should not take long.

Thank you,
Adrian Vladu

@jsturtevant
Copy link
Author

I don't seem to be able to plus one there but I'll give it here. We've been running this patch for a while now and no longer see the issues.

@marosset
Copy link

marosset commented Sep 7, 2022

Same, +1 for me!

ader1990 added a commit to ader1990/cloudbase-init-1 that referenced this issue Sep 19, 2022
* Added more retries for wireserver discovery
* Added logging to wireserver discovery errors
* Fallback to the well known wireserver IP in case of discovery failure

Github Issue: cloudbase#94

Change-Id: I91460f9ad674294a83c9b3c3f05bc04ab61de56f
ociuhandu pushed a commit that referenced this issue Sep 20, 2022
* Added more retries for wireserver discovery
* Added logging to wireserver discovery errors
* Fallback to the well known wireserver IP in case of discovery failure

Github Issue: #94

Change-Id: I91460f9ad674294a83c9b3c3f05bc04ab61de56f
@ader1990
Copy link
Member

Hello @jsturtevant @marosset,

The wireserver patch has been merged, also a few other patches.
We are switching to Python 3.10, I will update here once the new MSI containing the patches and Py3.10 is released.

cloudbase/cloudbase-init-installer#28

Thank you,
Adrian Vladu

@jsturtevant
Copy link
Author

@ader1990 Any update on the release?

@yashwanth-mannem
Copy link

yashwanth-mannem commented Jan 23, 2023

@ader1990, @jsturtevant How could I download the .msi for the above patch? The github actions artifact is expired on the above mentioned links

@ader1990
Copy link
Member

Hello @yashwanth-mannem, the beta cloudbase-init installer contains the stable Python version [3.10.7] and the required patch for this issue: https://www.cloudbase.it/downloads/CloudbaseInitSetup_x64.msi

@jsturtevant can you please check that the latest version works for you?

Thank you,
Adrian Vladu

@jsturtevant
Copy link
Author

We ran it in our test suite and looks to be passing! Let's ship it 🚢!

@yashwanth-mannem
Copy link

@ader1990 The beta cloudbase-init installer provided above is failing to provision the vm successfully as the vmagent on the vm is in 'NotReady' status. Do you have insights into how to fix the issue?

@jsturtevant
Copy link
Author

looks like this got released in https://github.com/cloudbase/cloudbase-init/releases/tag/1.1.4

@ader1990 can we close this?

@marosset
Copy link

marosset commented Mar 2, 2023

+1 for closing this issue!

@rahulgupta999
Copy link

The issue is frequently reproducible in my environment. I am using v1.1.4 with Windows Server 2022 core on Azure.
The CAPZ extension timeout is 300s, which is lower than what Cloud-init retries take to fetch the wire server address. Each attempt by cloud-init is taking 7sec.
I talked to Azure support and they mentioned that these retries should not be required. The wire server address is fixed.

@jsturtevant
Copy link
Author

@rahulgupta999 I don't know that we are seeing this issue any more. Is it something different?

@rahulgupta999
Copy link

@jsturtevant I could see this issue reproducing around 4/10 times with cloudbase-init v1.1.4 build.
Finally, as a workaround, I modified the cloudbase-init installation to reduce the number of wire server endpoint detection. This way the fallback to the known IP happens faster and the CAPZ bootstrap VM extension does not timeout. After this change, I have not seen this issue pop up in over 100 VM launches.
I am not sure why this logic exists in the first place. If there is a need we should look for some other alternative. One may be simply query/ping the fixed wire server address.

I also debugged this issue with Azure Premier support and they suggested no need for such logic to detect the wire server endpoint as that is a fixed IP.

@ader1990
Copy link
Member

Hello,

If the problem persists, I can prepare a new patch that removes the logic to detect the wire server endpoint. The logic was there due to legacy reasons / compatibility with cloud-init implementation / flexibility for testing.

Thank you,
Adrian Vladu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants