adding update to /etc/hosts (#16)

* summary of changes - We have a strategy for /etc/hosts - _One_ curve cert is generated by the main pod and shared between them via a persistent volume - If we use a cloud K8s we might be able to use persistent volume claim and should be able to switch between the two - A flux user is created in each container, assumes a creation command and this might not be the case depending on the base OS (we might have different templates for different OS and ask the user as part of the MiniCluster config) - Events are moved into events.go and can be deleted if we don't need them. If this is the case we can also move the Client from an attribute on the Reconciler to be inherited (e.g., r.Get vs r.Client.Get) - My first attempt at a podExec function are moved into extra.go - if we need/want this we can debug further, otherwise it's not used and safe to delete. - Having the wait / startup script generate the certificate, and only given the main pod hostname, made the initContainers redundant (and I removed them). - Configs/templates are moved into templates.go so they are easier to find. - GetHostfileConfig is not GetConfigMap (and more generalized) - listed pods are now sorted by name so they are returned consistently - A ConfigMap volume at `/flux_operator` is where we are writing the entrypoint script (wait.sh) the start script, and the update_hosts.sh script. * adding update to/etc/hosts this is a bit of a hack that adds a script wrapper to the pod start, and the wrapper waits until it sees a file populated with ip addresses (or more specifically, echos to update /etc/hosts.). I think we can do this because when the pod is re-created, the ip address does not change! And what is happening while it is waiting is that a config map is updated with the (now known) ip addresses. This seems to allow the /etc/hosts to be correctly populated (determined by ping working) and I think next I need to debug why the broker still thinks it is waiting. * clean up unused functions * separating main node (to generate cert and start) from workers my flux still is not connecting "Unable to connect to Flux" so I need to debug this. But I (?)think it is more correct that only one of the nodes is running the start command and generating the certificate. Technically if this node knows that it can use the other ones we should not need to run it multiple times. * ensure we use a persistent volume for curve with the emptydir strategy each node had its own mount. If we use a persistent volume claim each node has access to the same certificate, and we do not need to worry about race because it is specifically written by just one hostname * good state - we have flux almost running, quorum delayed Signed-off-by: vsoch <vsoch@users.noreply.github.com>
flux-framework · Sep 12, 2022 · 6f54a39 · 6f54a39
1 parent 260fd38
commit 6f54a39
Show file tree

Hide file tree

Showing 14 changed files with 691 additions and 385 deletions.
diff --git a/Makefile b/Makefile
@@ -106,11 +106,26 @@ vet: ## Run go vet against code.
 test: manifests generate fmt vet envtest ## Run tests.
 	KUBEBUILDER_ASSETS="$(shell $(ENVTEST) use $(ENVTEST_K8S_VERSION) -p path)" go test ./... -coverprofile cover.out
 
+.PHONY: list
+list:
+	kubectl get -n flux-operator pods
+
+.PHONY: reset
+reset:
+	minikube stop
+	minikube delete
+	minikube start
+	kubectl create namespace flux-operator
+	make install
+	make redo
+
 .PHONY: clean
 clean:
 	kubectl delete -n flux-operator svc --all
 	kubectl delete -n flux-operator secret --all
 	kubectl delete -n flux-operator cm --all
+	kubectl delete -n flux-operator pvc --all
+	kubectl delete -n flux-operator pv --all
 	kubectl delete -n flux-operator pods --all
 	kubectl delete -n flux-operator jobs --all
 	kubectl delete -n flux-operator MiniCluster --all

diff --git a/README.md b/README.md
@@ -36,8 +36,8 @@ And you can find the following:
 
  - A **MiniCluster** is an [indexed job](https://kubernetes.io/docs/tasks/job/indexed-parallel-processing-static/) so we can create N copies of the "same" base containers (each with flux, and the connected workers in our cluster)
  - The flux config is written to a volume at `/etc/flux/config` (created via a config map) as a brokers.toml file.
- - We use an [initContainer](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) with an Empty volume (shared between init and worker) to generate the curve certificates (`/mnt/curve/curve.cert`). The broker sees them via the definition of that path in the broker.toml in our config directory mentioned above.
- - TODO we need to figure out how the pods can see one another (TBA)
+ - We use an [initContainer](https://kubernetes.io/docs/concepts/workloads/pods/init-containers/) with an Empty volume (shared between init and worker) to generate the curve certificates (`/mnt/curve/curve.cert`). The broker sees them via the definition of that path in the broker.toml in our config directory mentioned above. Currently ever container generates its own curve.cert so this needs to be updated to have just one.
+ - Networking is a bit of a hack - we have a wrapper starting script that essentially waits until a file is populated with hostnames. While it's waiting, we are waiting for the pods to be created and allocated an ip address, and then we write the addresses to this update file (that will echo into `/etc/hosts`). When the Pod is re-created with the same ip address, the second time around the file is run to update the hosts, and then we submit the job.
 
 ## Quick Start
 
@@ -90,6 +90,26 @@ And this is also:
 $ make log
 ```
 
+List running pods (each pod is part of a batch job)
+
+```bash
+$ make list
+```
+
+And shell into one with the helper script:
+
+```bash
+./shell.sh flux-sample-0-b5rw6
+```
+
+### Starting Fresh
+
+If you want to blow up your minikube and start fresh (pulling the container again too):
+
+```bash
+make reset
+```
+
 ## Using the Operator
 
 If you aren't starting from scratch, then you can use the code here to see how things work!
@@ -156,6 +176,66 @@ And then:
 $ minikube stop
 ```
 
+## What is Happening?
+
+If you follow the commands above, you'll see a lot of terminal output, and it might not be clear what
+is happening. Let's talk about it here. Generally, you'll first see the config maps and supporting resources 
+being created. Since we are developing (for the time being) on a local machine, instead of a persistent volume
+claim (which requires a Kubernetes cluster with a provisioner) you'll get a persistent volume
+written to `/tmp` in the job namespace. If you try to use the latter it typically freezer.
+
+The first time the pods are created, they won't have ips (yet) so you'll see an empty list in the logs.
+As they are creating and getting ips, after that is finished you'll see the same output but with a 
+lookup of hostnames to ip addresses, and after it will tell you the cluster is ready.
+
+```
+1.6629325562267003e+09  INFO    minicluster-reconciler  🌀 Mini Cluster is Ready!
+```
+When you are waiting and run `make log` in a separate terminal you'll see output from one of the pods 
+in the job. Typically the first bit of time you'll be waiting:
+
+```bash
+$ make log
+kubectl logs -n flux-operator job.batch/flux-sample
+Found 6 pods, using pod/flux-sample-0-njnnd
+Host updating script not available yet, waiting...
+```
+It's waiting for the `/flux_operator/update_hosts.sh` script. When this is available, it will be found
+and the job setup will continue, first adding the found hosts to `/etc/hosts` and then (for the main node,
+which typically is `<name>-0`). When this happens, you'll see the host file cat to the screen:
+
+```bash
+Host updating script not available yet, waiting...
+# Kubernetes-managed hosts file.
+127.0.0.1       localhost
+::1     localhost ip6-localhost ip6-loopback
+fe00::0 ip6-localnet
+fe00::0 ip6-mcastprefix
+fe00::1 ip6-allnodes
+fe00::2 ip6-allrouters
+172.17.0.4      flux-sample-1.flux-sample.flux-operator.svc.cluster.local       flux-sample-1
+172.17.0.2 flux-sample-0-flux-sample.flux-operator.svc.cluster.local flux-sample-0
+172.17.0.4 flux-sample-1-flux-sample.flux-operator.svc.cluster.local flux-sample-1
+172.17.0.6 flux-sample-2-flux-sample.flux-operator.svc.cluster.local flux-sample-2
+172.17.0.7 flux-sample-3-flux-sample.flux-operator.svc.cluster.local flux-sample-3
+172.17.0.5 flux-sample-4-flux-sample.flux-operator.svc.cluster.local flux-sample-4
+172.17.0.8 flux-sample-5-flux-sample.flux-operator.svc.cluster.local flux-sample-5
+flux-sample-1 is sleeping waiting for main flux node
+```
+
+And then final configs are created, the flux user is created, and the main
+node creates the certificate and we start the cluster. You can look at 
+[controllers/flux/templates.go](controllers/flux/templates.go)
+for all the scripts and logic that are run. It's not perfectly figured out
+but we are close! The current state is that the nodes are waiting for one
+another:
+
+```bash
+2022-09-12T02:25:21.793030Z broker.err[0]: quorum delayed: waiting for flux-sample-[1-5] (rank 1-5)
+```
+
+Probably because I mis-configured something - I've never been a flux admin before! 
+
 ## Making the operator
 
 This section will walk through some of the steps that @vsoch took to create the controller using the operator-sdk, and challenges she faced.

diff --git a/TODO.md b/TODO.md
@@ -2,15 +2,20 @@
 
 ### Design 3
 
- - [ ] figure out where to put flux hostname / config - volume needs write
+ - [x] figure out where to put flux hostname / config - volume needs write
+ - [ ] I don't know what "cores" means - added to the MiniCluster config but maybe is just automatically derived?
+ - [ ] debug nodes finding on another (see How it works in README.md)
+ - [ ] can (and should) we use generics to reduce redudancy of code? (e.g., the `get<X>` functions)
+ - [ ] I think if a pod dies the IP address might change, so eventually we want to test that (and may need more logic for re-updating /etc/hosts)
  - [x] debug pod containers not seeing config again (e.g., mounts not creating)
  - [ ] Should there be a min/max size for the MiniCluster CRD?
  - [x] Should the secondary (non-driver) pods have a different start command? (answer is no - with the Indexed job it's all the same command)
  - [ ] MiniCluster - how should we handle deletion / update?
  - [ ] Do we want to be able to launch additional tasks? (e.g., after the original job started)
  - [ ] Currently we have no representation of quota - we need to be able to set (and check) hard limits from the scheduler (or maybe we get that out of the box)?
- - [ ] Details for etc-hosts (or will this just work?)
+ - [x] Details for etc-hosts (or will this just work? - no it won't just work)
  - [ ] klog can be changed to add V(2) to handle verbository from the command line, see https://pkg.go.dev/k8s.io/klog/v2
+ - [ ] At some point we want more intelligent use of labels/selectors (I haven't really read about them yet)
 
 ### Design 2 (not currently working on)
 

diff --git a/api/v1alpha1/minicluster_types.go b/api/v1alpha1/minicluster_types.go
@@ -35,6 +35,11 @@ type MiniClusterSpec struct {
 	// +optional
 	Size int32 `json:"size"`
 
+	// Cores to be used per node for flux resources
+	// +kubebuilder:default=3
+	// +optional
+	Cores int32 `json:"cores"`
+
 	// Working directory to run command from
 	// +optional
 	WorkingDir string `json:"workingDir"`

diff --git a/config/crd/bases/flux-framework.org_miniclusters.yaml b/config/crd/bases/flux-framework.org_miniclusters.yaml
@@ -41,6 +41,11 @@ spec:
               command:
                 description: Single user executable to provide to flux start
                 type: string
+              cores:
+                default: 3
+                description: Cores to be used per node for flux resources
+                format: int32
+                type: integer
               image:
                 default: fluxrm/flux-sched:focal
                 description: Container image must contain flux and flux-sched install

diff --git a/config/rbac/role.yaml b/config/rbac/role.yaml
@@ -60,6 +60,7 @@ rules:
   verbs:
   - create
   - delete
+  - exec
   - get
   - list
   - patch
@@ -72,6 +73,7 @@ rules:
   verbs:
   - create
   - delete
+  - exec
   - get
   - list
   - patch
@@ -108,6 +110,7 @@ rules:
   verbs:
   - create
   - delete
+  - exec
   - get
   - list
   - patch

diff --git a/controllers/flux/events.go b/controllers/flux/events.go
@@ -0,0 +1,96 @@
+/*
+Copyright 2022 Lawrence Livermore National Security, LLC
+ (c.f. AUTHORS, NOTICE.LLNS, COPYING)
+
+This is part of the Flux resource manager framework.
+For details, see https://github.com/flux-framework.
+
+SPDX-License-Identifier: Apache-2.0
+*/
+
+package controllers
+
+// Events are added to the Reconciler directly. If we don't need them:
+// 1. Delete this file
+// 2. Delete the AddEventFilter(r)
+// 3. (Optionally) the Reconciler Client can be inherited directly
+
+import (
+	jobctrl "flux-framework/flux-operator/pkg/job"
+
+	api "flux-framework/flux-operator/api/v1alpha1"
+
+	"k8s.io/klog/v2"
+	"sigs.k8s.io/controller-runtime/pkg/event"
+)
+
+// Notify watchers (the FluxSetup) that we have a new job request
+func (r *MiniClusterReconciler) notifyWatchers(job *api.MiniCluster) {
+	for _, watcher := range r.watchers {
+		watcher.NotifyMiniClusterUpdate(job)
+	}
+}
+
+// Called when a new job is created
+func (r *MiniClusterReconciler) Create(e event.CreateEvent) bool {
+
+	// Only respond to job events!
+	job, match := e.Object.(*api.MiniCluster)
+	if !match {
+		return true
+	}
+
+	// Add conditions - they should never exist for a new job
+	job.Status.Conditions = jobctrl.GetJobConditions()
+
+	// We will tell FluxSetup there is a new job request
+	defer r.notifyWatchers(job)
+	r.log.Info("🌀 MiniCluster create event", "Name:", job.Name)
+
+	// Continue to creation event
+	r.log.Info("🌀 MiniCluster was added!", "Name:", job.Name, "Condition:", jobctrl.GetCondition(job))
+	return true
+}
+
+func (r *MiniClusterReconciler) Delete(e event.DeleteEvent) bool {
+
+	job, match := e.Object.(*api.MiniCluster)
+	if !match {
+		return true
+	}
+
+	defer r.notifyWatchers(job)
+	log := r.log.WithValues("job", klog.KObj(job))
+	log.Info("🌀 MiniCluster delete event")
+
+	// TODO should trigger a delete here
+	// Reconcile should clean up resources now
+	return true
+}
+
+func (r *MiniClusterReconciler) Update(e event.UpdateEvent) bool {
+	oldMC, match := e.ObjectOld.(*api.MiniCluster)
+	if !match {
+		return true
+	}
+
+	// Figure out the state of the old job
+	mc := e.ObjectNew.(*api.MiniCluster)
+
+	r.log.Info("🌀 MiniCluster update event")
+
+	// If the job hasn't changed, continue reconcile
+	// There aren't any explicit updates beyond conditions
+	if jobctrl.JobsEqual(mc, oldMC) {
+		return true
+	}
+
+	// TODO: check if ready or running, shouldn't be able to update
+	// OR if we want update, we need to completely delete and recreate
+	return true
+}
+
+func (r *MiniClusterReconciler) Generic(e event.GenericEvent) bool {
+	r.log.V(3).Info("Ignore generic event", "obj", klog.KObj(e.Object), "kind", e.Object.GetObjectKind().GroupVersionKind())
+	return false
+}
diff --git a/controllers/flux/extra.go b/controllers/flux/extra.go
@@ -0,0 +1,58 @@
+package controllers
+
+// This file has extra (not used) functions that might be useful
+// (and I didn't want to delete just yet)
+
+import (
+	"context"
+	"os"
+
+	corev1 "k8s.io/api/core/v1"
+	"k8s.io/apimachinery/pkg/runtime"
+	"k8s.io/client-go/tools/remotecommand"
+
+	api "flux-framework/flux-operator/api/v1alpha1"
+)
+
+// podExec executes a command to a named pod
+// This is not currenty in use. This seems to run but I don't see expected output
+func (r *MiniClusterReconciler) podExec(pod corev1.Pod, ctx context.Context, cluster *api.MiniCluster) error {
+
+	command := []string{
+		"/bin/sh",
+		"-c",
+		"echo",
+		"hello",
+		"world",
+	}
+
+	// Prepare a request to execute to the pod in the statefulset
+	execReq := r.RESTClient.Post().Namespace(cluster.Namespace).Resource("pods").
+		Name(pod.Name).
+		Namespace(cluster.Namespace).
+		SubResource("exec").
+		VersionedParams(&corev1.PodExecOptions{
+			Command:   command,
+			Container: pod.Spec.Containers[0].Name,
+			Stdin:     true,
+			Stdout:    true,
+			Stderr:    true,
+			TTY:       true,
+		}, runtime.NewParameterCodec(r.Scheme))
+
+	exec, err := remotecommand.NewSPDYExecutor(r.RESTConfig, "POST", execReq.URL())
+	if err != nil {
+		r.log.Error(err, "🌀 Error preparing command to execute to pod", "Name:", pod.Name)
+		return err
+	}
+
+	// This is just for debugging for now :)
+	err = exec.Stream(remotecommand.StreamOptions{
+		Stdin:  os.Stdin,
+		Stdout: os.Stdout,
+		Stderr: nil,
+		Tty:    true,
+	})
+	r.log.Info("🌀 PodExec", "Container", pod.Spec.Containers[0].Name)
+	return err
+}