Skip to content

Commit

Permalink
adding update to /etc/hosts (#16)
Browse files Browse the repository at this point in the history
* summary of changes

- We have a strategy for /etc/hosts
- _One_ curve cert is generated by the main pod and shared between them via a persistent volume
  - If we use a cloud K8s we might be able to use persistent volume claim and should be able to switch between the two
- A flux user is created in each container, assumes a creation command and this might not be the case depending on the base OS (we might have different templates for different OS and ask the user as part of the MiniCluster config)
- Events are moved into events.go and can be deleted if we don't need them. If this is the case we can also move the Client from an attribute on the Reconciler to be inherited (e.g., r.Get vs r.Client.Get)
- My first attempt at a podExec function are moved into extra.go - if we need/want this we can debug further, otherwise it's not used and safe to delete.
- Having the wait / startup script generate the certificate, and only given the main pod hostname, made the initContainers redundant (and I removed them).
- Configs/templates are moved into templates.go so they are easier to find.
- GetHostfileConfig is not GetConfigMap (and more generalized)
- listed pods are now sorted by name so they are returned consistently
- A ConfigMap volume at `/flux_operator` is where we are writing the entrypoint script (wait.sh) the start script, and the update_hosts.sh script.

* adding update to/etc/hosts

this is a bit of a hack that adds a script wrapper to the pod start,
and the wrapper waits until it sees a file populated with ip addresses
(or more specifically, echos to update /etc/hosts.). I think we can
do this because when the pod is re-created, the ip address does not
change! And what is happening while it is waiting is that a config
map is updated with the (now known) ip addresses. This seems to
allow the /etc/hosts to be correctly populated (determined by ping
working) and I think next I need to debug why the broker still
thinks it is waiting.

* clean up unused functions

* separating main node (to generate cert and start) from workers

my flux still is not connecting "Unable to connect to Flux"
so I need to debug this. But I (?)think it is more correct that
only one of the nodes is running the start command and generating
the certificate. Technically if this node knows that it can use
the other ones we should not need to run it multiple times.

* ensure we use a persistent volume for curve

with the emptydir strategy each node had its own mount. If
we use a persistent volume claim each node has access to
the same certificate, and we do not need to worry about race
because it is specifically written by just one hostname

* good state - we have flux almost running, quorum delayed

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Sep 12, 2022
1 parent 260fd38 commit 6f54a39
Show file tree
Hide file tree
Showing 14 changed files with 691 additions and 385 deletions.
15 changes: 15 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -106,11 +106,26 @@ vet: ## Run go vet against code.
test: manifests generate fmt vet envtest ## Run tests.
KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) -p path)" go test ./... -coverprofile cover.out

.PHONY: list
list:
kubectl get -n flux-operator pods

.PHONY: reset
reset:
minikube stop
minikube delete
minikube start
kubectl create namespace flux-operator
make install
make redo

.PHONY: clean
clean:
kubectl delete -n flux-operator svc --all
kubectl delete -n flux-operator secret --all
kubectl delete -n flux-operator cm --all
kubectl delete -n flux-operator pvc --all
kubectl delete -n flux-operator pv --all
kubectl delete -n flux-operator pods --all
kubectl delete -n flux-operator jobs --all
kubectl delete -n flux-operator MiniCluster --all
Expand Down
84 changes: 82 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ And you can find the following:

- A **MiniCluster** is an [indexed job](https://kubernetes.io/docs/tasks/job/indexed-parallel-processing-static/) so we can create N copies of the "same" base containers (each with flux, and the connected workers in our cluster)
- The flux config is written to a volume at `/etc/flux/config` (created via a config map) as a brokers.toml file.
- We use an [initContainer](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) with an Empty volume (shared between init and worker) to generate the curve certificates (`/mnt/curve/curve.cert`). The broker sees them via the definition of that path in the broker.toml in our config directory mentioned above.
- TODO we need to figure out how the pods can see one another (TBA)
- We use an [initContainer](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) with an Empty volume (shared between init and worker) to generate the curve certificates (`/mnt/curve/curve.cert`). The broker sees them via the definition of that path in the broker.toml in our config directory mentioned above. Currently ever container generates its own curve.cert so this needs to be updated to have just one.
- Networking is a bit of a hack - we have a wrapper starting script that essentially waits until a file is populated with hostnames. While it's waiting, we are waiting for the pods to be created and allocated an ip address, and then we write the addresses to this update file (that will echo into `/etc/hosts`). When the Pod is re-created with the same ip address, the second time around the file is run to update the hosts, and then we submit the job.

## Quick Start

Expand Down Expand Up @@ -90,6 +90,26 @@ And this is also:
$ make log
```

List running pods (each pod is part of a batch job)

```bash
$ make list
```

And shell into one with the helper script:

```bash
./shell.sh flux-sample-0-b5rw6
```

### Starting Fresh

If you want to blow up your minikube and start fresh (pulling the container again too):

```bash
make reset
```

## Using the Operator

If you aren't starting from scratch, then you can use the code here to see how things work!
Expand Down Expand Up @@ -156,6 +176,66 @@ And then:
$ minikube stop
```

## What is Happening?

If you follow the commands above, you'll see a lot of terminal output, and it might not be clear what
is happening. Let's talk about it here. Generally, you'll first see the config maps and supporting resources
being created. Since we are developing (for the time being) on a local machine, instead of a persistent volume
claim (which requires a Kubernetes cluster with a provisioner) you'll get a persistent volume
written to `/tmp` in the job namespace. If you try to use the latter it typically freezer.

The first time the pods are created, they won't have ips (yet) so you'll see an empty list in the logs.
As they are creating and getting ips, after that is finished you'll see the same output but with a
lookup of hostnames to ip addresses, and after it will tell you the cluster is ready.

```
1.6629325562267003e+09 INFO minicluster-reconciler 🌀 Mini Cluster is Ready!
```
When you are waiting and run `make log` in a separate terminal you'll see output from one of the pods
in the job. Typically the first bit of time you'll be waiting:

```bash
$ make log
kubectl logs -n flux-operator job.batch/flux-sample
Found 6 pods, using pod/flux-sample-0-njnnd
Host updating script not available yet, waiting...
```
It's waiting for the `/flux_operator/update_hosts.sh` script. When this is available, it will be found
and the job setup will continue, first adding the found hosts to `/etc/hosts` and then (for the main node,
which typically is `<name>-0`). When this happens, you'll see the host file cat to the screen:

```bash
Host updating script not available yet, waiting...
# Kubernetes-managed hosts file.
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
fe00::0 ip6-mcastprefix
fe00::1 ip6-allnodes
fe00::2 ip6-allrouters
172.17.0.4 flux-sample-1.flux-sample.flux-operator.svc.cluster.local flux-sample-1
172.17.0.2 flux-sample-0-flux-sample.flux-operator.svc.cluster.local flux-sample-0
172.17.0.4 flux-sample-1-flux-sample.flux-operator.svc.cluster.local flux-sample-1
172.17.0.6 flux-sample-2-flux-sample.flux-operator.svc.cluster.local flux-sample-2
172.17.0.7 flux-sample-3-flux-sample.flux-operator.svc.cluster.local flux-sample-3
172.17.0.5 flux-sample-4-flux-sample.flux-operator.svc.cluster.local flux-sample-4
172.17.0.8 flux-sample-5-flux-sample.flux-operator.svc.cluster.local flux-sample-5
flux-sample-1 is sleeping waiting for main flux node
```

And then final configs are created, the flux user is created, and the main
node creates the certificate and we start the cluster. You can look at
[controllers/flux/templates.go](controllers/flux/templates.go)
for all the scripts and logic that are run. It's not perfectly figured out
but we are close! The current state is that the nodes are waiting for one
another:

```bash
2022-09-12T02:25:21.793030Z broker.err[0]: quorum delayed: waiting for flux-sample-[1-5] (rank 1-5)
```

Probably because I mis-configured something - I've never been a flux admin before!

## Making the operator

This section will walk through some of the steps that @vsoch took to create the controller using the operator-sdk, and challenges she faced.
Expand Down
9 changes: 7 additions & 2 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,20 @@

### Design 3

- [ ] figure out where to put flux hostname / config - volume needs write
- [x] figure out where to put flux hostname / config - volume needs write
- [ ] I don't know what "cores" means - added to the MiniCluster config but maybe is just automatically derived?
- [ ] debug nodes finding on another (see How it works in README.md)
- [ ] can (and should) we use generics to reduce redudancy of code? (e.g., the `get<X>` functions)
- [ ] I think if a pod dies the IP address might change, so eventually we want to test that (and may need more logic for re-updating /etc/hosts)
- [x] debug pod containers not seeing config again (e.g., mounts not creating)
- [ ] Should there be a min/max size for the MiniCluster CRD?
- [x] Should the secondary (non-driver) pods have a different start command? (answer is no - with the Indexed job it's all the same command)
- [ ] MiniCluster - how should we handle deletion / update?
- [ ] Do we want to be able to launch additional tasks? (e.g., after the original job started)
- [ ] Currently we have no representation of quota - we need to be able to set (and check) hard limits from the scheduler (or maybe we get that out of the box)?
- [ ] Details for etc-hosts (or will this just work?)
- [x] Details for etc-hosts (or will this just work? - no it won't just work)
- [ ] klog can be changed to add V(2) to handle verbository from the command line, see https://pkg.go.dev/k8s.io/klog/v2
- [ ] At some point we want more intelligent use of labels/selectors (I haven't really read about them yet)

### Design 2 (not currently working on)

Expand Down
5 changes: 5 additions & 0 deletions api/v1alpha1/minicluster_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,11 @@ type MiniClusterSpec struct {
// +optional
Size int32 `json:"size"`

// Cores to be used per node for flux resources
// +kubebuilder:default=3
// +optional
Cores int32 `json:"cores"`

// Working directory to run command from
// +optional
WorkingDir string `json:"workingDir"`
Expand Down
5 changes: 5 additions & 0 deletions config/crd/bases/flux-framework.org_miniclusters.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,11 @@ spec:
command:
description: Single user executable to provide to flux start
type: string
cores:
default: 3
description: Cores to be used per node for flux resources
format: int32
type: integer
image:
default: fluxrm/flux-sched:focal
description: Container image must contain flux and flux-sched install
Expand Down
3 changes: 3 additions & 0 deletions config/rbac/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ rules:
verbs:
- create
- delete
- exec
- get
- list
- patch
Expand All @@ -72,6 +73,7 @@ rules:
verbs:
- create
- delete
- exec
- get
- list
- patch
Expand Down Expand Up @@ -108,6 +110,7 @@ rules:
verbs:
- create
- delete
- exec
- get
- list
- patch
Expand Down
96 changes: 96 additions & 0 deletions controllers/flux/events.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
/*
Copyright 2022 Lawrence Livermore National Security, LLC
(c.f. AUTHORS, NOTICE.LLNS, COPYING)
This is part of the Flux resource manager framework.
For details, see https://github.com/flux-framework.
SPDX-License-Identifier: Apache-2.0
*/

package controllers

// Events are added to the Reconciler directly. If we don't need them:
// 1. Delete this file
// 2. Delete the AddEventFilter(r)
// 3. (Optionally) the Reconciler Client can be inherited directly

import (
jobctrl "flux-framework/flux-operator/pkg/job"

api "flux-framework/flux-operator/api/v1alpha1"

"k8s.io/klog/v2"
"sigs.k8s.io/controller-runtime/pkg/event"
)

// Notify watchers (the FluxSetup) that we have a new job request
func (r *MiniClusterReconciler) notifyWatchers(job *api.MiniCluster) {
for _, watcher := range r.watchers {
watcher.NotifyMiniClusterUpdate(job)
}
}

// Called when a new job is created
func (r *MiniClusterReconciler) Create(e event.CreateEvent) bool {

// Only respond to job events!
job, match := e.Object.(*api.MiniCluster)
if !match {
return true
}

// Add conditions - they should never exist for a new job
job.Status.Conditions = jobctrl.GetJobConditions()

// We will tell FluxSetup there is a new job request
defer r.notifyWatchers(job)
r.log.Info("🌀 MiniCluster create event", "Name:", job.Name)

// Continue to creation event
r.log.Info("🌀 MiniCluster was added!", "Name:", job.Name, "Condition:", jobctrl.GetCondition(job))
return true
}

func (r *MiniClusterReconciler) Delete(e event.DeleteEvent) bool {

job, match := e.Object.(*api.MiniCluster)
if !match {
return true
}

defer r.notifyWatchers(job)
log := r.log.WithValues("job", klog.KObj(job))
log.Info("🌀 MiniCluster delete event")

// TODO should trigger a delete here
// Reconcile should clean up resources now
return true
}

func (r *MiniClusterReconciler) Update(e event.UpdateEvent) bool {
oldMC, match := e.ObjectOld.(*api.MiniCluster)
if !match {
return true
}

// Figure out the state of the old job
mc := e.ObjectNew.(*api.MiniCluster)

r.log.Info("🌀 MiniCluster update event")

// If the job hasn't changed, continue reconcile
// There aren't any explicit updates beyond conditions
if jobctrl.JobsEqual(mc, oldMC) {
return true
}

// TODO: check if ready or running, shouldn't be able to update
// OR if we want update, we need to completely delete and recreate
return true
}

func (r *MiniClusterReconciler) Generic(e event.GenericEvent) bool {
r.log.V(3).Info("Ignore generic event", "obj", klog.KObj(e.Object), "kind", e.Object.GetObjectKind().GroupVersionKind())
return false
}
58 changes: 58 additions & 0 deletions controllers/flux/extra.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
package controllers

// This file has extra (not used) functions that might be useful
// (and I didn't want to delete just yet)

import (
"context"
"os"

corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/client-go/tools/remotecommand"

api "flux-framework/flux-operator/api/v1alpha1"
)

// podExec executes a command to a named pod
// This is not currenty in use. This seems to run but I don't see expected output
func (r *MiniClusterReconciler) podExec(pod corev1.Pod, ctx context.Context, cluster *api.MiniCluster) error {

command := []string{
"/bin/sh",
"-c",
"echo",
"hello",
"world",
}

// Prepare a request to execute to the pod in the statefulset
execReq := r.RESTClient.Post().Namespace(cluster.Namespace).Resource("pods").
Name(pod.Name).
Namespace(cluster.Namespace).
SubResource("exec").
VersionedParams(&corev1.PodExecOptions{
Command: command,
Container: pod.Spec.Containers[0].Name,
Stdin: true,
Stdout: true,
Stderr: true,
TTY: true,
}, runtime.NewParameterCodec(r.Scheme))

exec, err := remotecommand.NewSPDYExecutor(r.RESTConfig, "POST", execReq.URL())
if err != nil {
r.log.Error(err, "🌀 Error preparing command to execute to pod", "Name:", pod.Name)
return err
}

// This is just for debugging for now :)
err = exec.Stream(remotecommand.StreamOptions{
Stdin: os.Stdin,
Stdout: os.Stdout,
Stderr: nil,
Tty: true,
})
r.log.Info("🌀 PodExec", "Container", pod.Spec.Containers[0].Name)
return err
}

0 comments on commit 6f54a39

Please sign in to comment.