Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Conformance AKS workflow broken due to az tooling issue #32038

Closed
joestringer opened this issue Apr 18, 2024 · 4 comments
Closed

CI: Conformance AKS workflow broken due to az tooling issue #32038

joestringer opened this issue Apr 18, 2024 · 4 comments
Assignees
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!

Comments

@joestringer
Copy link
Member

CI failure

Example failure: https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540

Run # Create group
  # Create group
  az group create \
    --name cilium-cilium-873023[1](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:1)562-1-3 \
    --location eastus[2](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:2) \
    --tags usage=cilium-cilium owner=[3](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:3)2036
  
  # Create AKS cluster
  az aks create \
    --resource-group cilium-cilium-8730231562-1-3 \
    --name cilium-cilium-8730231562-1-3 \
    --location eastus2 \
    --kubernetes-version 1.28 \
    --network-plugin none \
    --node-count 2 \
    --node-vm-size Standard_B2s --node-osdisk-size 30 \
    --generate-ssh-keys
  shell: /usr/bin/bash -e {0}
  env:
    name: cilium-cilium-8730231562-1-3
    cost_reduction: --node-vm-size Standard_B2s --node-osdisk-size 30
    cilium_cli_ci_version: 
    job_name: Installation and Connectivity Test
    QUAY_ORGANIZATION: cilium
    QUAY_ORGANIZATION_DEV: cilium
    QUAY_CHARTS_ORGANIZATION_DEV: cilium-charts-dev
    EGRESS_GATEWAY_HELM_VALUES: --helm-set=egressGateway.enabled=true
    CILIUM_CLI_RELEASE_REPO: cilium/cilium-cli
    CILIUM_CLI_VERSION: v0.16.[4](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:4)
    PUSH_TO_DOCKER_HUB: true
    GCP_PERF_RESULTS_BUCKET: gs://cilium-scale-results
    KIND_VERSION: v0.22.0
    KIND_K8S_IMAGE: kindest/node:v1.29.2@sha2[5](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:5)6:51a1434a5397193442f0be2a297b488b[6](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:6)c919ce8a3931be0ce822606ea5ca245
    KIND_K8S_VERSION: v1.29.2
{
  "id": "/subscriptions/986ec55c-1e[7](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:7)7-4e2e-9ca5-dcbb34c7a110/resourceGroups/cilium-cilium-[8](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:8)730231562-1-3",
  "location": "eastus2",
  "managedBy": null,
  "name": "cilium-cilium-8730231562-1-3",
  "properties": {
    "provisioningState": "Succeeded"
  },
  "tags": {
    "owner": "32036",
    "usage": "cilium-cilium"
  },
  "type": "Microsoft.Resources/resourceGroups"
}
WARNING: The behavior of this command has been altered by the following extension: aks-preview
WARNING: SSH key files '/home/runner/.ssh/id_rsa' and '/home/runner/.ssh/id_rsa.pub' have been generated under ~/.ssh to allow SSH access to the VM. If using machines without permanent storage like Azure Cloud Shell without an attached file share, back up your keys to a safe location
ERROR: The command failed with an unexpected error. Here is the traceback:
ERROR: Invalid HTTP date
Traceback (most recent call last):
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/policies/_utils.py", line 57, in parse_retry_after
    delay = int(retry_after)
            ^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '15[9](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:9),120'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/az/lib/python3.11/site-packages/knack/cli.py", line 233, in invoke
    cmd_result = self.invocation.execute(args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/cli/core/commands/__init__.py", line 664, in execute
    raise ex
  File "/opt/az/lib/python3.11/site-packages/azure/cli/core/commands/__init__.py", line 731, in _run_jobs_serially
    results.append(self._run_job(expanded_arg, cmd_copy))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/cli/core/commands/__init__.py", line 701, in _run_job
    result = cmd_copy(params)
             ^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/cli/core/commands/__init__.py", line 334, in __call__
    return self.handler(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/cli/core/commands/command_operation.py", line 121, in handler
    return op(**command_args)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/az/azcliextensions/aks-preview/azext_aks_preview/custom.py", line 680, in aks_create
    return aks_create_decorator.create_mc(mc)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/cli/command_modules/acs/managed_cluster_decorator.py", line 6818, in create_mc
    cluster = self.put_mc(mc)
              ^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/cli/command_modules/acs/managed_cluster_decorator.py", line 6796, in put_mc
    cluster = sdk_no_wait(
              ^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/cli/core/util.py", line 7[10](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:10), in sdk_no_wait
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.[11](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:11)/site-packages/azure/core/tracing/decorator.py", line 76, in wrapper_use_tracer
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/azcliextensions/aks-preview/azext_aks_preview/vendored_sdks/azure_mgmt_preview_aks/v2024_02_02_preview/operations/_managed_clusters_operations.py", line 2110, in begin_create_or_update
    raw_result = self._create_or_update_initial(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/azcliextensions/aks-preview/azext_aks_preview/vendored_sdks/azure_mgmt_preview_aks/v2024_02_02_preview/operations/_managed_clusters_operations.py", line 1988, in _create_or_update_initial
    pipeline_response: PipelineResponse = self._client._pipeline.run(  # pylint: disable=protected-access
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/_base.py", line 2[13](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:13), in run
    return first_node.send(pipeline_request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/_base.py", line 70, in send
    response = self.next.send(request)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/_base.py", line 70, in send
    response = self.next.send(request)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/_base.py", line 70, in send
    response = self.next.send(request)
               ^^^^^^^^^^^^^^^^^^^^^^^
  [Previous line repeated 2 more times]
  File "/opt/az/lib/python3.11/site-packages/azure/mgmt/core/policies/_base.py", line 47, in send
    response = self.next.send(request)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/policies/_redirect.py", line 181, in send
    response = self.next.send(request)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/policies/_retry.py", line 471, in send
    self.sleep(retry_settings, request.context.transport, response=response)
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/policies/_retry.py", line 440, in sleep
    slept = self._sleep_for_retry(response, transport)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/policies/_retry.py", line 407, in _sleep_for_retry
    retry_after = self.get_retry_after(response)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/policies/_retry.py", line [14](https://github.com/cilium/cilium/actions/runs/8730231562/job/23953682540#step:12:14)1, in get_retry_after
    return _utils.get_retry_after(response)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/policies/_utils.py", line 76, in get_retry_after
    return parse_retry_after(retry_after)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/policies/_utils.py", line 60, in parse_retry_after
    retry_date = _parse_http_date(retry_after)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/az/lib/python3.11/site-packages/azure/core/pipeline/policies/_utils.py", line 43, in _parse_http_date
    raise ValueError("Invalid HTTP date")
ValueError: Invalid HTTP date
To check existing issues, please visit: https://github.com/Azure/azure-cli/issues
Error: Process completed with exit code 1.
@joestringer joestringer added area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me! labels Apr 18, 2024
@marseel
Copy link
Contributor

marseel commented Apr 22, 2024

Fixed by #32118

@marseel marseel closed this as completed Apr 22, 2024
@marseel marseel reopened this Apr 23, 2024
@marseel
Copy link
Contributor

marseel commented Apr 23, 2024

It seems like CI failures are around the same time always. I also observed a similar issue in cilium-cli github action, but with a bit different error.:

Client assertion is not within its valid time range. Current time: 2024-04-20T18:29:43.3183451Z, assertion valid from 2024-04-20T18:02:11.0000000Z, expiry time of assertion **2024-04-20T18:07:11.0000000Z.**

This led me to find this issue: Azure/login#372 (comment)

@marseel
Copy link
Contributor

marseel commented Apr 23, 2024

Okay, I did a bit deeper dive.

tldr Azure API returns retry-after as a float with a comma 🤷
There was some recent fix in azure-sdk-python for handling floats, but that won't work with floats with commas like we have here.

So I've opened a new issue in azure-sdk-for-python Azure/azure-sdk-for-python#35314

It will probably be quite a long time for us to get it, so I will work on these two ideas:

  • Increase some quota in Azure so we don't get retry-after for API calls
  • Spread cluster creations over time

marseel added a commit that referenced this issue Apr 23, 2024
This causes a few issues with cloud-providers based workflows:
- GKE - we were hitting quota issues: https://github.com/cilium/cilium/actions/runs/8746299915/job/24002950173
- AKS - we are hitting similar throttling on API in Azure, which is triggering #32038

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
marseel added a commit that referenced this issue Apr 23, 2024
This causes a few issues with cloud-providers based workflows:
- GKE - we were hitting quota issues: https://github.com/cilium/cilium/actions/runs/8746299915/job/24002950173
- AKS - we are hitting similar throttling on API in Azure, which is triggering #32038

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
github-merge-queue bot pushed a commit that referenced this issue Apr 23, 2024
This causes a few issues with cloud-providers based workflows:
- GKE - we were hitting quota issues: https://github.com/cilium/cilium/actions/runs/8746299915/job/24002950173
- AKS - we are hitting similar throttling on API in Azure, which is triggering #32038

Signed-off-by: Marcel Zieba <marcel.zieba@isovalent.com>
@marseel
Copy link
Contributor

marseel commented Apr 24, 2024

Okay, I was trying to figure out if it's possible to increase the quota in Azure for such requests, but apparently, it's not possible.

Anyway, spreading workflows worked like a charm.
All runs were passing, few of them hitting some true connectivity issues.

@marseel marseel closed this as completed Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake ci/flake This is a known failure that occurs in the tree. Please investigate me!
Projects
None yet
Development

No branches or pull requests

2 participants