Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: revisit how disks are added to domains #46

Closed
flavio opened this issue Jun 30, 2016 · 4 comments
Closed

RFC: revisit how disks are added to domains #46

flavio opened this issue Jun 30, 2016 · 4 comments

Comments

@flavio
Copy link
Collaborator

flavio commented Jun 30, 2016

The current state

Right now to add a disk to a domain we must:

  1. declare a volume
  2. create a disk entry linked with the volume

The problems

These are the minor issues/annoyances:

  • accidentally sharing a volume between different domains
  • using count on a domain introduces quite some complexity: you have to use count also inside of the volume definition and you have to be careful when writing the link (the terraform element syntax is not exactly user friendly)
  • a lot of typing: for each domain you have to define a dedicated volume

The biggest issue from my POV is that the volume life cycle is kinda disconnected from the one of the domain.

Doesn't play well when rebuilding a domain

When you change a domain in a way that it forces its rebuild the following actions will happen:

  • the domain will be deleted
  • the volume won't be deleted
  • a new domain will be created, the old volume will be used by it
  • the new domain is started
  • the provisioners of the domain are re-executed

Bad things can happen during the last step because most of the times the provisioners assume they are running on a "vanilla" system while they already ran against that volume.

Doesn't play well with changes

Imagine a domain has already been provisioned and the user changes its disks. The next apply will unlink the old volumes from the domain and link the freshly created ones.

This has two problems:

  • there's no way to trigger the execution of the provisioners
  • the old volume is left inside of the poll until the user removes its definition (which might be inside of another .tf file) or when the whole infra is brought to ashes

At the same time changing a linked volume won't cause the domain provisioners to be executed again. This is IMHO a pretty big issue.

How to fix that

Terraform original design

Terraform was designed to handle cloud workloads. In this context there are two types of storage devices: ephemeral and persistent ones.

Each instance has always an ephemeral disk. The life cycle of this disk is coupled with the instance. A change that forces the rebuilding of the instance will cause the disk to be deleted and the new instance will start with a clean slate.

The persistent storage is usually called volume and can be attached to an instance. This is not really tied to the life cycle of an instance, hence it can be used to store really important data that has to be preserved between instance rebuilds.

The proposal

There's no way to "taint" a linked resource. We cannot mark all the volumes linked to a domain as to be deleted whenever the domain is rebuilt. I think the feature is missing because it would not make sense inside of the original context for which Terraform has been designed.
We basically have to implement ephemeral and persistent storage with libvirt.

The persistent one is the simplest, we just have to use the volume resource to implement that. However we should introduce new directives to allow the user to create a volume from nothing (like by specifying just the size). This can be useful also outside of the context of this issue.

The ephemeral one is a bit tricky, especially from the user point of view. We could extend the disk type to have the following attributes:

  • pool: target pool to use
  • source: image to use. This is going to be uploaded to the pool and used directly (no linked disk underneath)
  • base_volume_id: link against an already defined volume
  • base_volume_pool: look for a base volume inside of that pool
  • base_volume_name: link against a volume that is identified by a name instead of an internal terraform ID

Defining a domain would be something like that:

resource "libvirt_domain" "chameleon" {
  name = "chameleon"
  disk {
    source = "opensuse-leap.qcow2"
  }
}

This would cause the opensuse-leap.qcow2 image to be uploaded to the default pool as a volume named: <domain name>.img, hence chameleon.img.

Everything is still handled as a volume from the libvirt POV, however the domain resource has full responsibility at managing this volume. It's just a matter of extending the create, delete and update functions of the domain resource.

The usability question

Ephemeral and persistent volumes have still to be added as disks to a domain. How can we differentiate between what is going to survive domain rebuilds and what is going to be nuked every time?

We could define persistent volumes in this way:

disk {
  volume_id = "${...}"
}```

And ephemeral ones in this way:
```json
disk {
   base_volume_{id|name} = "..."
}

However I think it could lead to confusion among users...

Feedback

What do you think about the whole thing?

If you are interested I can share a POF I made (it supports create and delete of ephemeral disks created with source).

@moio
Copy link
Collaborator

moio commented Jul 1, 2016

When you change a domain in a way that it forces its rebuild the following [...] will happen
There's no way to "taint" a linked resource.
changing a linked volume won't cause the domain provisioners to be executed again

I am not sure what you mean here. In our project we taint the disk whenever we change the domain or the disk, and that actually triggers the rebuild of both the disk and the domain, which makes total sense given Terraform concepts. Eg.:

# this triggers a rebuild of the disk and its domain, including provisioners
terraform taint -module=clisles12 libvirt_volume.main_disk

Bad things can happen during the last step because most of the times the provisioners assume they are running on a "vanilla" system while they already ran against that volume.

We use Salt for provisioning so it is safe to execute provisioners multiple times. So if we change the domain (eg. change the amount of RAM) the provisioners will run but nothing bad happens; if we change the disk the whole domain will be rebuilt from scratch, which is the expected behaviour.

introduce new directives to allow the user to create a volume from nothing (like by specifying just the size)

That's already possible, see below.

How can we differentiate between what is going to survive domain rebuilds and what is going to be nuked every time?

In our project, as you know, we have the problem that the package-mirror node has one "base" disk for the OS and another one for packages. We simply use lifecycle { prevent_destroy = true }, which is supported by any resource, on the latter to prevent accidental destruction.

IOW users can taint the domain or the main_disk but not the data_disk.

accidentally sharing a volume between different domains

Sharing of disks is actually a useful feature in some (albeit uncommon) clustered filesystem setups. I blogged about the manual steps to implement a test Oracle RAC cluster, and that' something that will at some point come to our project. From my blog:

RAC nodes will share one direct-attached virtual disk with ASM as a clustered file system on top of it. This is much easier to implement over a true SAN solution like iSCSI and serves well as a test setup.

About this last point:

a lot of typing: for each domain you have to define a dedicated volume

We solve this with Terraform modules.

Bottom line: I am personally happy with the current implementation, it does all I need. I am still not sure if and why you have specific corner cases which are not presently covered. I am not against this proposal, although other items would be more important FMPOV, eg. bridged networking!

@flavio
Copy link
Collaborator Author

flavio commented Jul 1, 2016

I am not sure what you mean here. In our project we taint the disk whenever we change the domain or the disk, and that actually triggers the rebuild of both the disk and the domain, which makes total sense given Terraform concepts. Eg.:

this triggers a rebuild of the disk and its domain, including provisioners

terraform taint -module=clisles12 libvirt_volume.main_disk

This is a command you issue by hand. I was talking about extending the update function of the domain to taint also all the ephemeral disks. Unfortunately it looks like there's no way to do that from the code.

When using terraform against a cloud you don't have to do this extra step because the volume is a "built-in" attribute of the instance (like CPU/RAM). Hence a rebuild of the instance will automatically invalidate the disk.

Bad things can happen during the last step because most of the times the provisioners assume >> they are running on a "vanilla" system while they already ran against that volume.

We use Salt for provisioning so it is safe to execute provisioners multiple times. So if we change the domain (eg. change the amount of RAM) the provisioners will run but nothing bad happens; if we change the disk the whole domain will be rebuilt from scratch, which is the expected behaviour.

This is not the expected behaviour for someone managing cloud instances with terraform (see the explanation above).

As for rerunning the provisioners, I agree that Salt is the way to go (I use that too as you know), but that must not be a requisite for everyone.
Yes, bash scripts could be made more robust to support multiple consecutive runs; however I think terraform plays really well with the concept of phoenix deployments. That from my POV explains also why the provisioners are not rerun when some "secondary" attributes of the node are changed.

That also leads to the big issue I mentioned in the issue description: if you change a linked volume the domain is not rebuilt and hence the provisioners are not run.

Finally, there are some domain attributes (like RAM) that, once changed, require the node to be powered off and restarted in order to take place. This is how libvirt works, there's no way to circumvent it.
Terraform has no concept of reboots, once you do apply the whole infrastructure is supposed to look exactly like you requested. In these case it would be more consistent to recreate the node from scratch. That's something worth another issue...

introduce new directives to allow the user to create a volume from nothing (like by specifying just the size)

That's already possible, see below.

Sorry, I missed the size attribute from the volume definition.

How can we differentiate between what is going to survive domain rebuilds and what is going to >>be nuked every time?

In our project, as you know, we have the problem that the package-mirror node has one "base" disk for the OS and another one for packages. We simply use lifecycle { prevent_destroy = true }, which is supported by any resource, on the latter to prevent accidental destruction.

The question was about the new semantics we would have to introduce to have ephemeral and persistent disks.

Right now my POC implements ephemeral disks by duplicating part of the volume attributes inside of the disk resource. On the other hand the persistent disks can be implemented by creating a disk that points to a volume (which is the current implementation). However I'm not particularly fond of it. Hence the request for feedback and ideas ;)

Bottom line: I am personally happy with the current implementation, it does all I need. I am still not sure if and why you have specific corner cases which are not presently covered. I am not against this proposal, although other items would be more important FMPOV, eg. bridged networking!

I'm really happy about this project too! I would have liked to work on bridged networking but @inercia was already doing it :)

The point is: terraform has been created thinking about cloud instances, which are slightly different from the VMs. We have to try to reach the good compromise between the two wolds and I think @dmacvicar did a really good job at recreating the same "feeling"/user experience.

I'm just exposing some corner cases that might affect someone used to manage cloud instance with terraform.

@dmacvicar
Copy link
Owner

I do agree this is a tricky issue and at some point I thought about making the full disk definition part of the domain.

I am not completely against it, but I don't see yourself convinced about this being the right solution either.

If there is a way that we can reuse part of the code from the volume handling and that is more or less elegant I would be ok on allowing the volume definition to be "embedded" in the disk definition so that it is completely part of the domain.

Now, where I disagree with you is in the proposal:

source: image to use. This is going to be uploaded to the pool and used directly (no linked disk underneath)
base_volume_id: link against an already defined volume
base_volume_pool: look for a base volume inside of that pool
base_volume_name: link against a volume that is identified by a name instead of an internal terraform ID

What is the point of source? If you have an ephemeral disk there are only 2 possibilities:

  • You want to use a base volume (which is completely independent). In this case the .qcow2 name is private and handled by the resource, and destroyed. The user should not even know about it. size is determined by the base image.
  • You don't want to use a base volume (eg. preinstalled OS), then the only relevant attribute is size. The volume itself is created and destroyed with the domain and the user should not even know much about it.

In both cases the user may want to say in which pool the volume is, but I am not even sure about this. The provider could create a private volume only for these disks if we want to make things simpler.

So only the base_volume[id|pool|name] or size are relevant IMHO.

@MalloZup
Copy link
Collaborator

closing this issues since we didn't come out with a real pragmatic implementation.

Feel free to reopen in case we have a plan in short-term

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants