docs updates -- traces, hooks, execute commands

acrlabs · Jun 26, 2024 · 9811d98 · 9811d98
1 parent fe18047
commit 9811d98
Show file tree

Hide file tree

Showing 15 changed files with 187 additions and 292 deletions.
diff --git a/Cargo.toml b/Cargo.toml
@@ -36,6 +36,7 @@ rstest = { version = "0.18.2", optional = true }
 
 anyhow = { version = "1.0.75", features = ["backtrace"] }
 async-recursion = "1.0.5"
+async-trait = "0.1.80"
 bytes = "1.5.0"
 chrono = "0.4.26"
 clap = { version = "4.3.21", features = ["cargo", "derive", "string"] }
@@ -62,7 +63,6 @@ tracing = "0.1.37"
 tracing-log = "0.1.3"
 tracing-subscriber = { version = "0.3.17", features = ["env-filter"] }
 url = "2.4.1"
-async-trait = "0.1.80"
 
 [dependencies.kube]
 version = "0.85.0"

diff --git a/README.md b/README.md
@@ -36,14 +36,14 @@ This package provides the following components:
 
 ## Documentation
 
-Full [documentation for SimKube](https://appliedcomputing.io/simkube/index.html) is available on Applied
+Full [documentation for SimKube](https://appliedcomputing.io/simkube/) is available on Applied
 Computing's website.  Here are some quick links to select topics:
 
-- [Installation](https://appliedcomputing.io/simkube/docs/intro/installation.html)
-- [Autoscaling](http://appliedcomputing.io/simkube/docs/adv/autoscaling.html)
-- [Metrics Collection](http://appliedcomputing.io/simkube/docs/adv/metrics.html)
-- [Component Reference](http://appliedcomputing.io/simkube/docs/components/sk-ctrl.html)
-- [Developing SimKube](http://appliedcomputing.io/simkube/docs/dev/contributing.html)
+- [Installation](https://appliedcomputing.io/simkube/docs/intro/installation/)
+- [Autoscaling](http://appliedcomputing.io/simkube/docs/adv/autoscaling/)
+- [Metrics Collection](http://appliedcomputing.io/simkube/docs/adv/metrics/)
+- [Component Reference](http://appliedcomputing.io/simkube/docs/components/skctl/)
+- [Developing SimKube](http://appliedcomputing.io/simkube/docs/dev/contributing/)
 
 ## Contributing
 

diff --git a/docs/adv/autoscaling.md b/docs/adv/autoscaling.md
@@ -85,8 +85,8 @@ are some initial instructions in there for installing the karpenter+KWOK binary
 it will automatically use KWOK to scale up nodes in the cluster just like Cluster Autoscaler.  As with Cluster
 Autoscaler, KWOK applies the `kwok-provider=true:NoSchedule` taint to the nodes it creates.
 
-> [!NOTE]
-> Unlike Cluster Autoscaler, karpenter does not take in a list of Kubernetes Node specs to determine what instances it
-> launches.  Instead, it uses a hard-coded list of "generic" instance types which roughly map to standard instance
-> offerings by the major cloud providers.  There is an [open PR](https://github.com/kubernetes-sigs/karpenter/pull/1048)
-> to enable configuring node types via an injected file.
+Unlike Cluster Autoscaler, karpenter does not take in a list of Kubernetes Node specs to determine what instances it
+launches.  Instead, it uses a hard-coded list of "generic" instance types which roughly map to standard instance
+offerings by the major cloud providers.  If you want to run Karpenter with a different set of configured instances, you
+need to modify the [embedded `instance_types.json`](https://github.com/kubernetes-sigs/karpenter/blob/main/kwok/cloudprovider/instance_types.json)
+file and rebuild Karpenter.
diff --git a/docs/adv/hooks.md b/docs/adv/hooks.md
@@ -0,0 +1,60 @@
+<!--
+project: SimKube
+template: docs.html
+-->
+
+# Simulation hooks
+
+SimKube supports running arbitrary setup or cleanup scripts at a number of different points during the simulation
+process.  The general method for configuring hooks is the same at each extension point: simply inject the following
+command into the Simulation custom resource:
+
+```yaml
+- cmd: echo           # required
+  args: ["foo"]       # required
+  ignoreFailure: true # optional, will not abort the simulation on failure
+  sendSim: true       # optional, will send the Simulation resource to the hook as JSON over stdin
+```
+
+## Extension points
+
+There are four places where hooks can be injected:
+
+### preStartHooks
+
+Pre-start hooks run once before any other simulation setup; you can use these hooks to create additional namespaces, set
+up monitoring, etc.
+
+### postStopHooks
+
+Similarly, post-stop hooks run once after _all_ simulation iterations have completed and after all other cleanup tasks
+are complete.  They can be used to clean up any resources or do additional reporting on the simulation results
+(extracting logs from relevant pods, for example).
+
+### preRunHooks
+
+Pre-run hooks run before _every_ iteration of the simulation, and can be used to re-create resources that should be
+"fresh" at the beginning of each iteration.  They are the first thing the SimKube driver runs, before executing any
+other setup.
+
+### postRunHooks
+
+Lastly, post-run hooks run at the end of _every_ simulation iteration, and can be used to clean up resources that might
+pollute future simulation iterations.  They are the last thing the SimKube driver runs.
+
+## Injecting hooks
+
+If you are using `skctl` to run your simulation, you can provide a set of hooks via a YAML file similar to the
+following, using the `--hooks` CLI argument:
+
+```bash exec="on" result="yaml"
+cat simkube/examples/hooks/example.yml
+```
+
+Otherwise, you can specify the hooks as part of the Simulation custom resource object.
+
+## Running hooks
+
+All executables needed to run hooks must be present and on the path in the `sk-ctrl` pod (for pre-start and post-stop
+hooks) or in the `sk-driver` pod (for pre-run and post-run hooks).  The standard Docker images built for SimKube include
+`kubectl`, `curl`, and `jq` for this purpose.
diff --git a/docs/adv/traces.md b/docs/adv/traces.md
@@ -0,0 +1,46 @@
+<!--
+project: SimKube
+template: docs.html
+-->
+
+# Traces
+
+The SimKube tracer collects timeseries data about the events happening in a live Kubernetes cluster and exports that
+data to a trace file for future replay and analysis.  These trace files can then be stored in a cloud provider or
+downloaded locally.  We describe configuration options for each of these use cases.
+
+## Cloud storage
+
+We support exporting traces to Amazon S3, Google Cloud Storage, and Microsoft Azure Storage through the
+[object\_store](https://docs.rs/object_store/latest/object_store/) crate.  The `sk-tracer` and `sk-driver` pods need to
+be configured with the correct permissions to write and read data to your chosen cloud storage.  One option is to inject
+environment variables into the pod that object\_store understands.
+
+- Amazon S3: use the `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables
+- Google Cloud Storage: use the `GOOGLE_SERVICE_ACCOUNT` environment variable and inject your service account JSON file
+  into the `sk-tracer` pod
+- Microsoft Azure: use the `AZURE_STORAGE_ACCOUNT_NAME` and `AZURE_STORAGE_ACCOUNT_KEY` environment variables
+
+The object\_store crate will try other authentication/authorization methods if these environment variables are not set
+(for example, it will try to get credentials from the instance metadata endpoint for AWS), so these are not the only
+ways to grant permissions to the tracer and the driver.  Configuring these permissions is beyond the scope of this
+documentation, and we encourage you to consult the IAM documentation for your chosen cloud provider(s).
+
+## Local storage
+
+If you do not have access to (or do not want to use) cloud storage, you can also save a trace file to local storage
+using, for example, `skctl export -o file:///path/to/trace`.  However, using this trace file in the simulator is a bit
+more complicated; it will need to be injected into the node(s) where your Simulation driver pods will run, and then
+volume-mounted into the driver pod.  If you are running locally via `kind`, you can add the following block to your
+`kind` config to mount the trace file directory on your laptop into the kind nodes:
+
+```yaml
+  - role: worker
+    extraMounts:
+      - hostPath: /tmp/kind-node-data
+        containerPath: /data
+```
+
+From there, when you run a simulation, you need to specify the trace data using `skctl run --trace-path
+file:///data/trace`.  This location is the location _inside the Kind node docker container_, not inside the driver pod.
+SimKube will automatically construct the appropriate volume mounts so that the driver pod can reference the trace.
diff --git a/docs/components/sk-ctrl.md b/docs/components/sk-ctrl.md
@@ -12,56 +12,34 @@ actually perform the Simulation.
 
 ## Usage
 
-```
-Usage: sk-ctrl [OPTIONS]
-
-Options:
-      --use-cert-manager
-      --cert-manager-issuer <CERT_MANAGER_ISSUER>  [default: ]
-  -v, --verbosity <VERBOSITY>                      [default: info]
-  -h, --help                                       Print help
+```bash exec="on" result="plain"
+sk-ctrl --help
 ```
 
 ## Details
 
 The Simulation Controller does the following on receipt of a new Simulation:
 
-0. Verifies that all the expected pre-existing objects are present in the cluster
-1. Creates a SimulationRoot object to hang all of the simulated objects off of
-2. Creates the namespace for the simulation driver to run in
-3. Creates custom resources for the [Prometheus operator](https://prometheus-operator.dev) to configure metrics
+0. Runs all preStart hooks
+1. Verifies that all the expected pre-existing objects are present in the cluster
+2. Creates a SimulationRoot "meta" object to hang objects off of that should persist for the whole simulation
+3. Creates the namespace for the simulation driver to run in
+4. Creates custom resources for the [Prometheus operator](https://prometheus-operator.dev) to configure metrics
    collection
-4. Creates a MutatingWebhookConfiguration for the simulation driver
-5. Creates a Service for the simulation driver
-6. Sets up certificates for the simulation driver mutating webhook (currently requires the use of
+5. Creates a MutatingWebhookConfiguration for the simulation driver
+6. Creates a Service for the simulation driver
+7. Sets up certificates for the simulation driver mutating webhook (currently requires the use of
    [cert-manager](https://cert-manager.io)).
-7. Creates the simulation driver Job
+8. Creates the simulation driver Job
+9. Waits for the driver to complete
+10. Cleans up all "meta" resources
+11. Runs all postStop hooks
 
 ## Simulation Custom Resource
 
-Here is an example Simulation object:
-
-```yaml
-apiVersion: simkube.io/v1
-kind: Simulation
-metadata:
-  name: testing
-spec:
-  driverNamespace: simkube
-  metricsConfig:
-    namespace: monitoring
-    serviceAccount: prometheus-k8s
-    remoteWriteConfigs:
-      - url: http://prom2parquet-svc.monitoring:1234/receive
-  trace: file:///data/trace
-```
-
-The `SimulationSpec` contains three fields, the location of the trace file which we want to use for the simulation,
-configuration for metrics collection, and the namespace to launch the driver into.  Currently the only trace location
-supported is `file:///`, i.e., the trace file already has to be present on the driver node at the specified location.
-In the future we will support downloading from an S3 bucket or other persistent storage.
-
-The Simulation CR is cluster-namespaced, because it must create SimulationRoots.
+Simulations are controlled by a Simulation custom resource object, which specifies, among other things, how to configure
+the Simulation driver, metrics collection, and any hooks.  The Simulation CR is cluster-namespaced, because it must
+create SimulationRoots.
 
 ## SimulationRoot Custom Resource
 
@@ -73,6 +51,10 @@ owned by the SimulationRoot, so that users can still see the results and logs fr
 
 ## Configuring Metrics Collection
 
+> [!NOTE] In the future we may move metrics collection out of SimKube proper and instead run it as a standard "hook".
+> If you do not want to use Prometheus for metrics collection, or wish to configure it differently, you can disable
+> metrics collection using `skctl --disable-metrics` and configure your own metrics solution with a preStart hook.
+
 SimKube depends on the [Prometheus operator](https://prometheus-operator.dev) being installed in your simulation
 cluster, as it creates custom resources understood by this operator.  The `metricsConfig` section of the Simulation spec
 controls how this is set up.  The `namespace` and the `serviceAccount` fields are the namespace and service account that

diff --git a/docs/components/sk-driver.md b/docs/components/sk-driver.md
@@ -12,32 +12,28 @@ selectors, and tolerations to ensure that the simulated pods end up on virtual n
 
 ## Usage
 
-```
-Usage: sk-driver [OPTIONS] --sim-name <SIM_NAME> --sim-root <SIM_ROOT> --virtual-ns-prefix <VIRTUAL_NS_PREFIX> \
-    --cert-path <CERT_PATH> --key-path <KEY_PATH> --trace-mount-path <TRACE_MOUNT_PATH>
-
-Options:
-      --sim-name <SIM_NAME>
-      --sim-root <SIM_ROOT>
-      --virtual-ns-prefix <VIRTUAL_NS_PREFIX>
-      --admission-webhook-port <ADMISSION_WEBHOOK_PORT>  [default: 8888]
-      --cert-path <CERT_PATH>
-      --key-path <KEY_PATH>
-      --trace-mount-path <TRACE_MOUNT_PATH>
-  -v, --verbosity <VERBOSITY>                            [default: info]
-  -h, --help                                             Print help
+```bash exec="on" result="plain"
+sk-driver --help
 ```
 
 ## Details
 
-The driver is launched by the [Simulation Controller](./sk-ctrl.md) when a new simulation is started.  On startup, it
-reads the cluster trace from the specified `--trace-path` and then replays all the events in the trace.  The driver
-shuts down when the trace is finished.
-
-The driver also exposes a `/mutate` endpoint on the specified `--admission-webhook-port`, which is called by the
-Kubernetes control plane whenever a new pod is created.  The mutation endpoint checks to see if the Pod is owned by any
-of the simulated resources, and if so, adds the following mutations to the object to ensure that it is scheduled on the
-virtual cluster:
+The driver is launched by the [Simulation Controller](./sk-ctrl.md) when a new simulation is started.  The driver
+performs the following steps:
+
+0. Runs all preRun hooks
+1. Creates the mutating webhook listener endpoint
+2. Creates a SimulationRoot object to hang all simulation objects off of
+3. Reads the specified trace from the specified path
+4. Replays the trace events
+5. Cleans up the SimulationRoot
+6. Shuts down the mutating webhook listener
+7. Runs all postRun hooks
+
+The driver exposes a `/mutate` endpoint on the specified `--admission-webhook-port`, which is called by the Kubernetes
+control plane whenever a new pod is created.  The mutation endpoint checks to see if the Pod is owned by any of the
+simulated resources, and if so, adds the following mutations to the object to ensure that it is scheduled on the virtual
+cluster:
 
 ```yaml
 labels:

diff --git a/docs/components/sk-tracer.md b/docs/components/sk-tracer.md
@@ -11,14 +11,8 @@ all or a portion of the trace to persistent storage so that it can be replayed l
 
 ## Usage
 
-```
-Usage: sk-tracer [OPTIONS] --config-file <CONFIG_FILE> --server-port <SERVER_PORT>
-
-Options:
-  -c, --config-file <CONFIG_FILE>
-      --server-port <SERVER_PORT>
-  -v, --verbosity <VERBOSITY>      [default: info]
-  -h, --help                       Print help
+```bash exec="on" result="plain"
+sk-tracer --help
 ```
 
 ## Config File Format