Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when creating machines? #402

Closed
abhinavdahiya opened this issue Sep 7, 2018 · 13 comments
Closed

Race condition when creating machines? #402

abhinavdahiya opened this issue Sep 7, 2018 · 13 comments
Labels
libvirt bug Upstream bugs in libvirt

Comments

@abhinavdahiya
Copy link
Contributor

Version Reports:

Distro version of host:

Fedora 28

Terraform Version Report

Terraform v0.11.8

Libvirt version

4.1.0

terraform-provider-libvirt plugin version (git-hash)

6c9b294


Description of Issue/Question

Setup

provider "libvirt" {
  uri = "qemu:///system"
}

resource "libvirt_network" "tectonic_net" {
  name = "tectonic"

  mode   = "nat"
  bridge = "tt0"

  domain = "tt.testing"

  addresses = ["192.168.124.0/24"]

  dns = [{
    local_only = true
  }]

  autostart = true
}

locals {
  master_ips = ["192.168.124.11", "192.168.124.12"]
  worker_ips = ["192.168.124.51", "192.168.124.52"]
}

resource "libvirt_domain" "master" {
  count = "2"

  name = "master${count.index}"

  memory = "2048"
  vcpu   = "2"

  console {
    type        = "pty"
    target_port = 0
  }

  network_interface {
    network_id = "${libvirt_network.tectonic_net.id}"
    hostname   = "adahiya-master-${count.index}"
    addresses  = ["${local.master_ips[count.index]}"]
  }
}

resource "libvirt_domain" "worker" {
  count = "2"

  name   = "worker${count.index}"
  memory = "1024"
  vcpu   = "2"

  network_interface {
    network_id = "${libvirt_network.tectonic_net.id}"
    hostname   = "adahiya-worker-${count.index}"
    addresses  = ["${local.worker_ips[count.index]}"]
  }
}

Steps to Reproduce Issue

terraform init
terraform apply

Terraform should have created tectonic network, 2 master machines and 2 worker machines.

But terraform exists with error:

Error: Error applying plan:

2 error(s) occurred:

* libvirt_domain.worker[0]: 1 error(s) occurred:

* libvirt_domain.worker.0: Error creating libvirt domain: virError(Code=43, Domain=19, Message='Network not found: no network with matching name 'tectonic'')
* libvirt_domain.master[0]: 1 error(s) occurred:

* libvirt_domain.master.0: Error creating libvirt domain: virError(Code=43, Domain=19, Message='Network not found: no network with matching name 'tectonic'')

Complete output

Plan: 5 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

libvirt_network.tectonic_net: Creating...
  addresses.#:      "" => "1"
  addresses.0:      "" => "192.168.124.0/24"
  autostart:        "" => "true"
  bridge:           "" => "tt0"
  dns.#:            "" => "1"
  dns.0.local_only: "" => "true"
  domain:           "" => "tt.testing"
  mode:             "" => "nat"
  name:             "" => "tectonic"
libvirt_network.tectonic_net: Creation complete after 5s (ID: 7bcaf736-3b2d-4509-ab1e-7680d6e17fbf)
libvirt_domain.worker[1]: Creating...
  arch:                             "" => "<computed>"
  emulator:                         "" => "<computed>"
  machine:                          "" => "<computed>"
  memory:                           "" => "1024"
  name:                             "" => "worker1"
  network_interface.#:              "" => "1"
  network_interface.0.addresses.#:  "" => "1"
  network_interface.0.addresses.0:  "" => "192.168.124.52"
  network_interface.0.hostname:     "" => "adahiya-worker-1"
  network_interface.0.mac:          "" => "<computed>"
  network_interface.0.network_id:   "" => "7bcaf736-3b2d-4509-ab1e-7680d6e17fbf"
  network_interface.0.network_name: "" => "<computed>"
  qemu_agent:                       "" => "false"
  running:                          "" => "true"
  vcpu:                             "" => "2"
libvirt_domain.worker[0]: Creating...
  arch:                             "" => "<computed>"
  emulator:                         "" => "<computed>"
  machine:                          "" => "<computed>"
  memory:                           "" => "1024"
  name:                             "" => "worker0"
  network_interface.#:              "" => "1"
  network_interface.0.addresses.#:  "" => "1"
  network_interface.0.addresses.0:  "" => "192.168.124.51"
  network_interface.0.hostname:     "" => "adahiya-worker-0"
  network_interface.0.mac:          "" => "<computed>"
  network_interface.0.network_id:   "" => "7bcaf736-3b2d-4509-ab1e-7680d6e17fbf"
  network_interface.0.network_name: "" => "<computed>"
  qemu_agent:                       "" => "false"
  running:                          "" => "true"
  vcpu:                             "" => "2"
libvirt_domain.master[1]: Creating...
  arch:                             "" => "<computed>"
  console.#:                        "" => "1"
  console.0.target_port:            "" => "0"
  console.0.type:                   "" => "pty"
  emulator:                         "" => "<computed>"
  machine:                          "" => "<computed>"
  memory:                           "" => "2048"
  name:                             "" => "master1"
  network_interface.#:              "" => "1"
  network_interface.0.addresses.#:  "" => "1"
  network_interface.0.addresses.0:  "" => "192.168.124.12"
  network_interface.0.hostname:     "" => "adahiya-master-1"
  network_interface.0.mac:          "" => "<computed>"
  network_interface.0.network_id:   "" => "7bcaf736-3b2d-4509-ab1e-7680d6e17fbf"
  network_interface.0.network_name: "" => "<computed>"
  qemu_agent:                       "" => "false"
  running:                          "" => "true"
  vcpu:                             "" => "2"
libvirt_domain.master[0]: Creating...
  arch:                             "" => "<computed>"
  console.#:                        "" => "1"
  console.0.target_port:            "" => "0"
  console.0.type:                   "" => "pty"
  emulator:                         "" => "<computed>"
  machine:                          "" => "<computed>"
  memory:                           "" => "2048"
  name:                             "" => "master0"
  network_interface.#:              "" => "1"
  network_interface.0.addresses.#:  "" => "1"
  network_interface.0.addresses.0:  "" => "192.168.124.11"
  network_interface.0.hostname:     "" => "adahiya-master-0"
  network_interface.0.mac:          "" => "<computed>"
  network_interface.0.network_id:   "" => "7bcaf736-3b2d-4509-ab1e-7680d6e17fbf"
  network_interface.0.network_name: "" => "<computed>"
  qemu_agent:                       "" => "false"
  running:                          "" => "true"
  vcpu:                             "" => "2"
libvirt_domain.worker[1]: Creation complete after 2s (ID: 72dae79b-dc84-4e3a-b6d0-65f478f53038)
libvirt_domain.master[1]: Creation complete after 2s (ID: ed8f694e-3c22-4d36-9e77-63187ee68ba1)

Error: Error applying plan:

2 error(s) occurred:

* libvirt_domain.worker[0]: 1 error(s) occurred:

* libvirt_domain.worker.0: Error creating libvirt domain: virError(Code=43, Domain=19, Message='Network not found: no network with matching name 'tectonic'')
* libvirt_domain.master[0]: 1 error(s) occurred:

* libvirt_domain.master.0: Error creating libvirt domain: virError(Code=43, Domain=19, Message='Network not found: no network with matching name 'tectonic'')

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.
@abhinavdahiya
Copy link
Contributor Author

abhinavdahiya commented Sep 7, 2018

I suspect it is a race condition because, this main.tf completes to success

provider "libvirt" {
  uri = "qemu:///system"
}

resource "libvirt_network" "tectonic_net" {
  name = "tectonic"

  mode   = "nat"
  bridge = "tt0"

  domain = "tt.testing"

  addresses = ["192.168.124.0/24"]

  dns = [{
    local_only = true
  }]

  autostart = true
}

locals {
  master_ips = ["192.168.124.11", "192.168.124.12"]
  worker_ips = ["192.168.124.51", "192.168.124.52"]
}

resource "libvirt_domain" "master" {
  count = "2"

  name = "master${count.index}"

  memory = "2048"
  vcpu   = "2"

  console {
    type        = "pty"
    target_port = 0
  }

  network_interface {
    network_id = "${libvirt_network.tectonic_net.id}"
    hostname   = "adahiya-master-${count.index}"
    addresses  = ["${local.master_ips[count.index]}"]
  }
}

resource "libvirt_domain" "worker" {
  count = "2"

  name   = "worker${count.index}"
  memory = "1024"
  vcpu   = "2"

  network_interface {
    network_id = "${libvirt_network.tectonic_net.id}"
    hostname   = "adahiya-worker-${count.index}"
    addresses  = ["${local.worker_ips[count.index]}"]
  }

  depends_on = ["libvirt_domain.master"]
}

depends_on section in workers forces workers to be created after masters have been created, serializing the whole thing.

@wking
Copy link
Contributor

wking commented Sep 7, 2018

It's surprising that serializing the domains helps with what seems (from the error message) to be a domain/network race. Does adding depends_on = ["libvirt_network.tectonic_net"] to both master and worker domains have any affect?

@wking
Copy link
Contributor

wking commented Sep 7, 2018

It's surprising that serializing the domains helps with what seems (from the error message) to be a domain/network race.

Ah, the error message is just busted. Your full log shows libvirt_network.tectonic_net's full creation and the injection of its ID as network_interface.0.network_id in all four domains.

@wking
Copy link
Contributor

wking commented Sep 7, 2018

I cannot reproduce this on RHEL 7.5's 3.10.0-891.el7.x86_64 kernel with Terraform 0.11.7, libvirt 3.9.0, QEMU 2.9.0, and terraform-provider-libvirt 7e52bbe or 6c9b294. I also failed to reproduce this with Terraform 0.11.8 and terraform-provider-libvirt 6c9b294. Maybe it's a kernel/libvirt/QEMU bug?

@MalloZup
Copy link
Collaborator

MalloZup commented Sep 7, 2018

hi @wking @abhinavdahiya thx for reporting.
@abhinavdahiya can you attach

TF_LOG=debug terraform apply ? so we can have more debug info

@wking for libvirt logs https://wiki.libvirt.org/page/DebugLogs. ( could be interesting if you find something there with libvirt.logs)

Just looking for your logs to me looks a race condition, namely that the network is created but notyet for the domains which don't see it. but i'm curious to see the logs of terraform debug

tia

@steveej
Copy link

steveej commented Sep 7, 2018

I'm able to reproduce it in about 9/10 cases.

Running this on:

# system
Linux steveej-laptop 4.17.19 #1-NixOS SMP Fri Aug 24 11:07:17 UTC 2018 x86_64 GNU/Linux

# libvirt
Compiled against library: libvirt 3.10.0
Using library: libvirt 3.10.0
Using API: QEMU 3.10.0
Running hypervisor: QEMU 2.11.2

Here's are the terraform and libvirt logs.

@wking
Copy link
Contributor

wking commented Sep 7, 2018

Here is the libvirt bug for "Hash operation not allowed during iteration".

@MalloZup
Copy link
Collaborator

MalloZup commented Sep 7, 2018

@wking @steveej thank you for infos.

@wking yeah this is really a know and annoying bug on libvirt side. I have also encountered many times, and we are also impacted in other projects... 🎶

@abhinavdahiya i don't think that reverting on your project would help in that case (from my pov).

The actual solution is to upgrade the libvirt pkg with the patch. ( this the solution and we will also upgrade the systems as we have the patch). ( we might also add a note in future version to reccomend a libvirt version which will contain the fix for this bug)

Afaik there are workarounds on the user side:

  • use parallelism option in terraform ( this i dont like personally)
  • use depends_on on resource terraform (see in this issue)

On the codebase:

  • use mutex as we already use ( but maybe we might find out also where we could apply them if possible. But this could also slowdown the performance).

  • do other research on it ( but i think there is not much we can do against this 🐑 )

@MalloZup MalloZup added libvirt bug Upstream bugs in libvirt and removed need info labels Sep 7, 2018
@wking
Copy link
Contributor

wking commented Sep 7, 2018

I'm running RHEL's libvirt 3.9.0 14.el7_5.6, which has a patch:

$ rpm -q --changelog libvirt | head -n3
* Tue Jun 05 2018 Jiri Denemark <jdenemar@redhat.com> - 3.9.0-14.el7_5.6
- logging: Don't inhibit shutdown in system daemon (rhbz#1573268)
- util: don't check for parallel iteration in hash-related functions (rhbz#1581364)

rhbz#1581364 is a backport to RHEL 7.5 of the rhbz#1576464 I linked earlier. The upstream commit fixing this (linked from rhbz#1576464) is 4d7384eb, which punts serialization up to the callers. I'm not sure if the issue we're seeing here is because:

a. The old check (before libvirt/libvirt@4d7384eb) was overly strict.
b. Folks seeing problems are running versions of libvirt with internal racy virhash.c consumers.
c. Folks seeing problems are running libvirt callers with racy virhash.c consumers.

(a) and (b) would be addressable by patching libvirt. (c) might be an issue with this repository.

@MalloZup, does that make sense? Do you know which case applies?

abhinavdahiya added a commit to abhinavdahiya/installer that referenced this issue Sep 7, 2018
We are seeing race like symptoms when creating multiple domainsets in parallel.
For more info: dmacvicar/terraform-provider-libvirt#402
The issue suggests that its most proabaly a libvirt bug that we can avoid by
serializing master and worker domain sets.
@MalloZup
Copy link
Collaborator

MalloZup commented Sep 10, 2018

@wking thank you for your precise comment.

From my pov is a great news the fact you cannot reproduce with latest libvirt (#402 (comment))

i plan to update to next libvirt-devel version on my setup containing that patch so i can verify if we don\t have other issues

To me i would 98.99% ( ,0.99 is for com purposes 😄 ) exclude the (c) hypothesis because the locking mechanism and race conditions due to repository:

the mutex mechanism is working well on the pool level when we create volumes.

client.poolMutexKV.Lock(poolName)

and we lock always before refreshing pools.

waitForSuccess("error refreshing pool for volume", func() error {

So to me once the libvirt version is higher the pb should disappear. ( i will test this week this).

For any info/question feel free to ping me and thanks for your collaboration and infos 👍

@steveej
Copy link

steveej commented Sep 10, 2018

@wking

I'm not sure if the issue we're seeing here is because:

a. The old check (before libvirt/libvirt@4d7384e) was overly strict.
b. Folks seeing problems are running versions of libvirt with internal racy virhash.c consumers.
c. Folks seeing problems are running libvirt callers with racy virhash.c consumers.

According to Bug 1581364 which you linked as the fix, it says:

Prior to this update, guest virtual machine actions that use a python library in some cases failed and "Hash operation not allowed during iteration" error messages were logged. Several redundant thread access checks have been removed, and the problem no longer occurs.

which reads to me as case (a), and I'm not sure if (b) is also the case because the callchain isn't obvious to me yet. I'm wondering if there's anything we can do to diminish the problem on the client side.

@wking
Copy link
Contributor

wking commented Sep 11, 2018

I'm wondering if there's anything we can do to diminish the problem on the client side.

openshift/installer#226 is a client-side workaround (using the depends_on approach @abhinavdahiya floated earlier). With the libvirt patch out for months, the easiest approach is probably "bump your libvirt to pick up the (backported) patch."

@MalloZup
Copy link
Collaborator

MalloZup commented Dec 15, 2018

Closing since cleared. Thx all contributors and Linux lover, no matter which distro., for the info sharing. Cu in next issue or pr :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libvirt bug Upstream bugs in libvirt
Projects
None yet
Development

No branches or pull requests

4 participants