Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context Deadline Exceeded only on certain resources #1401

Closed
micsport13 opened this issue Jun 17, 2024 · 11 comments
Closed

Context Deadline Exceeded only on certain resources #1401

micsport13 opened this issue Jun 17, 2024 · 11 comments
Labels
🐛 bug Something isn't working topic:import

Comments

@micsport13
Copy link

Describe the bug
I'm unable to make changes to resources because I get an error stating that there was an issue retrieving the status of the vm/container. The only information I get back is context deadline exceeded

To Reproduce
Steps to reproduce the behavior:

  1. Create 2 VM resources
  2. Edit one of the VM resources to trigger an update
  3. Terraform Apply
  4. See Error

Please also provide a minimal Terraform configuration that reproduces the issue.

resource "proxmox_virtual_environment_vm" "TestArch" {
  acpi            = true
  bios            = "seabios"
  description     = <<-EOF
  A test arch vm box
  EOF
  keyboard_layout = null
  kvm_arguments   = null
  mac_addresses   = ["3A:88:CC:4D:B3:A3"]
  name            = "TestArch"
  node_name       = "server"
  protection      = false
  scsi_hardware   = "virtio-scsi-single"
  started         = false
  tablet_device   = true
  tags            = []
  template        = false
  vm_id           = 102

  cpu {
    affinity     = null
    architecture = "x86_64"
    cores        = 4
    flags        = []
    hotplugged   = 0
    limit        = 0
    numa         = false
    sockets      = 1
    type         = "qemu64"
    units        = 1024
  }

  cdrom {
    enabled   = true
    file_id   = "lvm:iso/archlinux-2023.03.01-x86_64.iso"
    interface = "ide2"
  }

  disk {
    aio               = "io_uring"
    backup            = true
    cache             = "none"
    datastore_id      = "Data_Pool"
    discard           = "ignore"
    file_format       = "raw"
    file_id           = null
    interface         = "scsi0"
    iothread          = true
    path_in_datastore = "vm-102-disk-0"
    replicate         = true
    serial            = null
    size              = 32
    ssd               = false
  }

  memory {
    dedicated      = 4096
    floating       = 0
    hugepages      = null
    keep_hugepages = false
    shared         = 0
  }

  network_device {
    bridge       = "vmbr0"
    disconnected = false
    enabled      = true
    firewall     = true
    mac_address  = "3A:88:CC:4D:B3:A3"
    model        = "virtio"
    mtu          = 0
    queues       = 0
    rate_limit   = 0
    trunks       = null
    vlan_id      = 0
  }

  operating_system {
    type = "l26"
  }

}

resource "proxmox_virtual_environment_vm" "arch-test" {
  acpi          = true
  bios          = "seabios"
  description   = "Arch latest, generated on 2023-12-19T05:51:38Z"
  name          = "arch-test"
  node_name     = "server"
  protection    = false
  scsi_hardware = "lsi"
  started       = false
  tablet_device = true
  tags          = ["arch", "vm", "testing"]
  template      = false
  vm_id         = 105

  agent {
    enabled = true
    timeout = "15m"
    trim    = false
  }

  cpu {
    affinity     = null
    architecture = "x86_64"
    cores        = 4
    hotplugged   = 0
    limit        = 0
    numa         = false
    sockets      = 1
    type         = "kvm64"
    units        = 1024
  }

  disk {
    aio               = "io_uring"
    backup            = true
    cache             = "none"
    datastore_id      = "Data_Pool"
    discard           = "ignore"
    file_format       = "raw"
    file_id           = null
    interface         = "virtio0"
    iothread          = false
    path_in_datastore = "vm-105-disk-0"
    replicate         = true
    serial            = null
    size              = 50
    ssd               = false
  }



  memory {
    dedicated      = 4096
    floating       = 0
    keep_hugepages = false
    shared         = 0
  }

  network_device {
    bridge       = "vmbr0"
    disconnected = false
    enabled      = true
    firewall     = false
    model        = "e1000"
    mtu          = 0
    queues       = 0
    rate_limit   = 0
    vlan_id      = 0
  }

}

Expected behavior
Both instances can be modified

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
All of my resources were imported but somehow modifying resources doesn't cause issues but adding/destroying does. Since there were some configurations that didn't allow for modification, it triggers a replace. From there, I just get the generic context deadline exceeded. It appears that modification doesn't trigger the API failure, but recreating does. Additionally, I ran pvesh get and it didn't return any errors and nothing appeared out of order.

  • Single or clustered Proxmox: Single
  • Proxmox version: 8.2.2
  • Provider version (ideally it should be the latest version): 0.60.0
  • Terraform/OpenTofu version: v1.8.5
  • OS (where you run Terraform/OpenTofu from):
  • Debug logs (TF_LOG=DEBUG terraform apply):
    Snippet of logs:
2024-06-17T00:05:03.712-0600 [ERROR] provider.terraform-provider-proxmox_v0.60.0: Response contains error diagnostic: diagnostic_severity=ERROR tf_req_id=e77696ef-cc05-f29f-5da2-ba40bd02351f tf_resource_type=proxmox_virtual_environment_container @caller=github.com/hashicorp/terraform-plugin-go@v0.23.0/tfprotov6/internal/diag/diagnostics.go:58 @module=sdk.proto diagnostic_detail="" diagnostic_summary="error retrieving container status: failed to perform HTTP GET request (path: nodes/server/lxc/104/status/current) - Reason: Get \"https://10.0.0.100:8006/api2/json/nodes/server/lxc/104/status/current\": context deadline exceeded" tf_proto_version=6.6 tf_provider_addr=registry.terraform.io/bpg/proxmox tf_rpc=ApplyResourceChange timestamp=2024-06-17T00:05:03.712-0600
2024-06-17T00:05:03.712-0600 [ERROR] provider.terraform-provider-proxmox_v0.60.0: Failed to parse request bytes for logging: error="context deadline exceeded" tf_http_trans_id=24afb185-bba1-14ed-b759-f89ac5e43b7e tf_provider_addr=registry.terraform.io/bpg/proxmox tf_req_id=35e04b98-b007-e8f3-6138-8b59671f33cc tf_rpc=ApplyResourceChange tf_mux_provider=tf5to6server.v5tov6Server tf_resource_type=proxmox_virtual_environment_container @caller=github.com/hashicorp/terraform-plugin-sdk/v2@v2.34.0/helper/logging/logging_http_transport.go:170 @module=proxmox timestamp=2024-06-17T00:05:03.712-0600
2024-06-17T00:05:03.712-0600 [ERROR] provider.terraform-provider-proxmox_v0.60.0: Response contains error diagnostic: diagnostic_severity=ERROR tf_proto_version=6.6 tf_provider_addr=registry.terraform.io/bpg/proxmox tf_req_id=43d1151d-b471-f796-9f4c-b79455962645 @caller=github.com/hashicorp/terraform-plugin-go@v0.23.0/tfprotov6/internal/diag/diagnostics.go:58 @module=sdk.proto diagnostic_detail="" diagnostic_summary="error retrieving container status: failed to perform HTTP GET request (path: nodes/server/lxc/101/status/current) - Reason: Get \"https://10.0.0.100:8006/api2/json/nodes/server/lxc/101/status/current\": context deadline exceeded" tf_resource_type=proxmox_virtual_environment_container tf_rpc=ApplyResourceChange timestamp=2024-06-17T00:05:03.712-0600
2024-06-17T00:05:03.713-0600 [ERROR] provider.terraform-provider-proxmox_v0.60.0: Response contains error diagnostic: @caller=github.com/hashicorp/terraform-plugin-go@v0.23.0/tfprotov6/internal/diag/diagnostics.go:58 @module=sdk.proto diagnostic_detail="" diagnostic_severity=ERROR diagnostic_summary="error retrieving container status: failed to perform HTTP GET request (path: nodes/server/lxc/103/status/current) - Reason: Get \"https://10.0.0.100:8006/api2/json/nodes/server/lxc/103/status/current\": context deadline exceeded" tf_proto_version=6.6 tf_provider_addr=registry.terraform.io/bpg/proxmox tf_req_id=35e04b98-b007-e8f3-6138-8b59671f33cc tf_resource_type=proxmox_virtual_environment_container tf_rpc=ApplyResourceChange timestamp=2024-06-17T00:05:03.713-0600
2024-06-17T00:05:03.714-0600 [ERROR] provider.terraform-provider-proxmox_v0.60.0: Failed to parse request bytes for logging: @caller=github.com/hashicorp/terraform-plugin-sdk/v2@v2.34.0/helper/logging/logging_http_transport.go:170 @module=proxmox tf_mux_provider=tf5to6server.v5tov6Server tf_provider_addr=registry.terraform.io/bpg/proxmox error="context deadline exceeded" tf_http_trans_id=e0fc6211-1252-2c90-1c33-70aa826efd6f tf_req_id=3dbed1ba-fc44-3ac6-0331-629836be7f90 tf_resource_type=proxmox_virtual_environment_vm tf_rpc=ApplyResourceChange timestamp=2024-06-17
T00:05:03.714-0600
2024-06-17T00:05:03.715-0600 [ERROR] provider.terraform-provider-proxmox_v0.60.0: Response contains error diagnostic: diagnostic_detail="" diagnostic_severity=ERROR tf_resource_type=proxmox_virtual_environment_vm @module=sdk.proto diagnostic_summary="error retrieving VM status: failed to perform HTTP GET request (path: nodes/server/qemu/105/status/current) - Reason: Get \"https://10.0.0.100:8006/api2/json/nodes/server/qemu/105/status/current\": context deadline exceeded" tf_proto_version=6.6 tf_provider_addr=registry.terraform.io/bpg/proxmox tf_req_id=3dbed1ba-fc44-3ac6-0331-629836be7f90 tf_rpc=ApplyResourceChange @caller=github.com/hashicorp/terraform-plugin-go@v0.23.0/tfprotov6/internal/diag/diagnostics.go:58 timestamp=2024-06-17T00:05:03.715-0600
2024-06-17T00:05:03.741-0600 [DEBUG] State storage *statemgr.Filesystem declined to persist a state snapshot
2024-06-17T00:05:03.741-0600 [ERROR] vertex "proxmox_virtual_environment_container.postgres (destroy)" error: error retrieving container status: failed to perform HTTP GET request (path: nodes/server/lxc/101/status/current) - Reason: Get "https://10.0.0.100:8006/api2/json/nodes/server/lxc/101/status/current": context deadline exceeded
2024-06-17T00:05:03.775-0600 [DEBUG] State storage *statemgr.Filesystem declined to persist a state snapshot
2024-06-17T00:05:03.775-0600 [ERROR] vertex "proxmox_virtual_environment_container.syncthing (destroy)" error: error retrieving container status: failed to perform HTTP GET request (path: nodes/server/lxc/104/status/current) - Reason: Get "https://10.0.0.100:8006/api2/json/nodes/server/lxc/104/status/current": context deadline exceeded
2024-06-17T00:05:03.808-0600 [DEBUG] State storage *statemgr.Filesystem declined to persist a state snapshot
2024-06-17T00:05:03.808-0600 [ERROR] vertex "proxmox_virtual_environment_vm.arch-test (destroy)" error: error retrieving VM status: failed to perform HTTP GET request (path: nodes/server/qemu/105/status/current) - Reason: Get "https://10.0.0.100:8006/api2/json/nodes/server/qemu/105/status/current": context deadline exceeded
2024-06-17T00:05:03.833-0600 [DEBUG] State storage *statemgr.Filesystem declined to persist a state snapshot
2024-06-17T00:05:03.833-0600 [ERROR] vertex "proxmox_virtual_environment_container.mariadb (destroy)" error: error retrieving container status: failed to perform HTTP GET request (path: nodes/server/lxc/103/status/current) - Reason: Get "https://10.0.0.100:8006/api2/json/nodes/server/lxc/103/status/current": context deadline exceeded
╷
│ Error: error retrieving container status: failed to perform HTTP GET request (path: nodes/server/lxc/101/status/current) - Reason: Get "https://10.0.0.100:8006/api2/json/nodes/server/lxc/101/status/current": context deadline exceeded
│
│
╵
╷
│ Error: error retrieving container status: failed to perform HTTP GET request (path: nodes/server/lxc/103/status/current) - Reason: Get "https://10.0.0.100:8006/api2/json/nodes/server/lxc/103/status/current": context deadline exceeded
│
│
╵
╷
│ Error: error retrieving container status: failed to perform HTTP GET request (path: nodes/server/lxc/104/status/current) - Reason: Get "https://10.0.0.100:8006/api2/json/nodes/server/lxc/104/status/current": context deadline exceeded
│
│
╵
╷
│ Error: error retrieving VM status: failed to perform HTTP GET request (path: nodes/server/qemu/105/status/current) - Reason: Get "https://10.0.0.100:8006/api2/json/nodes/server/qemu/105/status/current": context deadline exceeded
│
│
@micsport13 micsport13 added the 🐛 bug Something isn't working label Jun 17, 2024
@bpg
Copy link
Owner

bpg commented Jun 18, 2024

I'm not quite sure what's going on here. The log does not match the description of the use case.

According to the log, Terraform is trying to delete four resources simultaneously: containers 101, 103, and 104, and VM 105.

Are they all provisioned on the same datastore, Data_Pool? What type of datastore is it? Do you see any significant spikes in the IO delay metrics on the node?
Do you see anything suspicious in the syslog?

You can also try using parallelism=1 to update one resource at a time and see if it reduces contention.

@bpg bpg added the ⌛ pending author's response Requested additional information from the reporter label Jun 18, 2024
@micsport13
Copy link
Author

It's a ZFS pool. I'll have to take a look at see if that's what's happening. Is it documented anywhere about what forces a container recreation?

@bpg
Copy link
Owner

bpg commented Jun 18, 2024

You can run terraform plan first, it will explain what operations terraform is going to perform, and if there is any resource re-creating then which attribute changes triggered it.

@bpg
Copy link
Owner

bpg commented Jun 18, 2024

Also, take a look at #995, there are some interesting bits. If IO is a bottleneck you may get better performance by tweaking VM storage / interface types. I noticed you're using a mix of scsi and virio in your VMs.

@micsport13
Copy link
Author

Still no luck. Tried it with parallelism and don't see any spikes in the IO logs. Is there a specific set of logs that I can look at that might help me find the problem?

@bpg
Copy link
Owner

bpg commented Jun 18, 2024

"context deadline exceeded" is a suspicious error tho. Usually it occurs when there is some connectivity issues between client and server. Are the PVE node and the host where you're running terraform from on the same network?

@micsport13
Copy link
Author

Yes, but what's odd is that I can modify resources just fine. It's just the destroy and recreate that is failing.

@bpg
Copy link
Owner

bpg commented Jun 18, 2024

Could you run just terraform plan for your resources, and post the output here?

@micsport13
Copy link
Author


Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # proxmox_virtual_environment_container.mariadb must be replaced
-/+ resource "proxmox_virtual_environment_container" "mariadb" {
      ~ id             = "103" -> (known after apply)
      + start_on_boot  = true
        tags           = []
      + timeout_clone  = 1800
      + timeout_create = 1800
      + timeout_delete = 60
      + timeout_start  = 300
      + timeout_update = 1800
      + unprivileged   = true # forces replacement
      + vm_id          = 103
        # (4 unchanged attributes hidden)

      ~ initialization {
            # (1 unchanged attribute hidden)

          - ip_config {
              - ipv4 {
                  - address = "10.0.0.103/24" -> null
                  - gateway = "10.0.0.1" -> null
                }
            }
        }

      ~ operating_system {
          + template_file_id = "local:vztmpl/archlinux-base_2023-06-08-1_amd64.tar.zst" # forces replacement
            # (1 unchanged attribute hidden)
        }

        # (5 unchanged blocks hidden)
    }

  # proxmox_virtual_environment_container.postgres must be replaced
-/+ resource "proxmox_virtual_environment_container" "postgres" {
      ~ id             = "101" -> (known after apply)
      + start_on_boot  = true
        tags           = []
      + timeout_clone  = 1800
      + timeout_create = 1800
      + timeout_delete = 60
      + timeout_start  = 300
      + timeout_update = 1800
      + unprivileged   = true # forces replacement
      + vm_id          = 101
        # (4 unchanged attributes hidden)

      ~ initialization {
            # (1 unchanged attribute hidden)

          - ip_config {
              - ipv4 {
                  - address = "10.0.0.101/24" -> null
                  - gateway = "10.0.0.1" -> null
                }
            }
        }

      ~ operating_system {
          + template_file_id = "local:vztmpl/archlinux-base_2023-06-08-1_amd64.tar.zst" # forces replacement
            # (1 unchanged attribute hidden)
        }

        # (5 unchanged blocks hidden)
    }

  # proxmox_virtual_environment_container.syncthing must be replaced
-/+ resource "proxmox_virtual_environment_container" "syncthing" {
      ~ id             = "104" -> (known after apply)
      + start_on_boot  = true
      ~ tags           = [
          + "arch",
          + "sycnthing",
        ]
      + timeout_clone  = 1800
      + timeout_create = 1800
      + timeout_delete = 60
      + timeout_start  = 300
      + timeout_update = 1800
      + unprivileged   = false # forces replacement
      + vm_id          = 104
        # (4 unchanged attributes hidden)

      ~ operating_system {
          + template_file_id = "local:vztmpl/archlinux-base_2023-06-08-1_amd64.tar.zst" # forces replacement
            # (1 unchanged attribute hidden)
        }

        # (5 unchanged blocks hidden)
    }

@bpg
Copy link
Owner

bpg commented Jun 18, 2024

Ok, the imported state is clearly messed up. I suspect the issue is with timeouts, they were not defined in initial import, and now tf is trying to add them. Which means the current timeout value during apply can be ridiculously small, like a nanosecond, which could explain "content deadline" error.

But regardless, after apply, all your resources are going to be re-created because of discrepancies between the imported state and what is currently defined in the config.
The import functionality of the provider is pretty much untested, as it's not something that I'm actively using. I don't have a good solution for your current situation, except perhaps manually editing the tf state to reconcile it with the current config. Which is not straightforward by any means.

@bpg bpg added topic:import and removed ⌛ pending author's response Requested additional information from the reporter labels Jun 18, 2024
@micsport13
Copy link
Author

If you want, I can at least post an image of the state for one of the containers so if it comes up again, you might have some insight to what the problem might be. I'll just remove them from the state and create them.

resource "proxmox_virtual_environment_container" "postgres" {
    description = null
    id          = "101"
    node_name   = "server"
    started     = false
    tags        = []
    template    = false

    console {
        enabled   = true
        tty_count = 2
        type      = "tty"
    }

    cpu {
        architecture = "amd64"
        cores        = 2
        units        = 1024
    }

    disk {
        datastore_id = "Data_Pool"
        size         = 15
    }

    initialization {
        hostname = "postgres"

        ip_config {
            ipv4 {
                address = "10.0.0.101/24"
                gateway = "10.0.0.1"
            }
        }
    }

    memory {
        dedicated = 512
        swap      = 512
    }

    network_interface {
        bridge      = "vmbr0"
        enabled     = true
        firewall    = true
        mac_address = "<mac_address>"
        mtu         = 0
        name        = "eth0"
        rate_limit  = 0
        vlan_id     = 0
    }

    operating_system {
        template_file_id = null
        type             = "archlinux"
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working topic:import
Projects
None yet
Development

No branches or pull requests

2 participants