Skip to content

Commit

Permalink
[wip] second design for metrics operator (#63)
Browse files Browse the repository at this point in the history
* WIP to refactor

This is going to be a huge refactor to remove the application/storage "hard coded"
legos replaced by a more flexible setup where we have one base metric set (no
subtypes) and then metrics generate the replicated jobs (as many as they like, how
they please) and then addons are provided to them, which can range from additional
volumes to containers (that provide volumes) to any kind of customization. This
is not ready for any kind of testing but I am mostly concerned about my computer
blowing up and losing the work so I am saving for good measure :) Also, yay today! :D


* definitely making bad life decisions
* very satisfying deletion of things.
* lammps ran!
* amg is back
* bdas is back
* add back hpl

we did not get this completely working before (likely
the spack mpi install as a basic hostname does not work
) so a basic conversion is sufficient

* add back kripke
* laghos
* test signing again
* add back nekbone
* add back pennant
* add back quicksilver

also simplify logic of applications - the launcher worker
pattern is generic and can be shared

* workflow format bug
* add back fio
* add back host volume example
* add back ior
* add back osu benchmarks!
* add back chatterbug

it is accepted this does not fully work, we need to
come back to it.

* add back netmark
* systat and lammps working again
* hpctoolkit design at least works

but shared libraries are failing to load. HPCToolkit
you are a jerk. I am laughing. And crying. And mostly
crying.

* clean up docs a little bit
* addon documentation is good
* hopefully fix bug
* fixing workingdir bug!
* update to v1alpha2
* bugfix
* a single touch marker at the end of the copy is more reliable than a file that is part of it!
* support to customize container for any metric, and for hpctoolkit to run post commands
* support for custom container
* add print at end of post analysis for hpctoolkit
* fixing bug with internal crd state

if we do not make a copy (refect) of the interface,
the state seems to change (and perist) between runs. While
I am still worried about this design, this at least seems
to fix that bug. I am also wondering about garbage collection
(e.g., if making the copies means they stay around and the
operator will use increasing memory) but that is TBA
explored.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Sep 24, 2023
1 parent 24980db commit 67ad62f
Show file tree
Hide file tree
Showing 134 changed files with 4,604 additions and 4,683 deletions.
28 changes: 14 additions & 14 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
- name: Check Spelling
uses: crate-ci/typos@7ad296c72fa8265059cc03d1eda562fbdfcd6df2 # v1.9.0
with:
files: ./README.md ./config/samples ./docs/*.md ./docs/*/*.md
files: ./README.md ./docs/*.md ./docs/*/*.md ./docs/*/*/*.md

- name: Lint and format Python code
run: |
Expand Down Expand Up @@ -66,19 +66,19 @@ jobs:
strategy:
fail-fast: false
matrix:
test: [["perf-hello-world", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # performance test
["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60], # storage test
["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120], # storage test
["io-ior", "ghcr.io/converged-computing/metric-ior:latest", 120], # storage test
# ["network-chatterbug", "ghcr.io/converged-computing/metric-chatterbug:latest", 120], # network app test
["app-nekbone", "ghcr.io/converged-computing/metric-nekbone:latest", 120], # standalone app test
# ["app-ldms", "ghcr.io/converged-computing/metric-ovis-hpc:latest", 120], # standalone app test
["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120], # standalone app test
["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120], # standalone app test
["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120], # standalone app test
["app-bdas", "ghcr.io/converged-computing/metric-bdas:latest", 120], # standalone app test
["app-quicksilver", "ghcr.io/converged-computing/metric-quicksilver:latest", 120], # standalone app test
["app-lammps", "ghcr.io/converged-computing/metric-lammps:latest", 120]] # standalone app test
test: [["app-lammps", "ghcr.io/converged-computing/metric-lammps:latest", 120],
["perf-hello-world", "ghcr.io/converged-computing/metric-sysstat:latest", 60],
["io-host-volume", "ghcr.io/converged-computing/metric-sysstat:latest", 60],
["io-fio", "ghcr.io/converged-computing/metric-fio:latest", 120],
["io-ior", "ghcr.io/converged-computing/metric-ior:latest", 120],
## ["network-chatterbug", "ghcr.io/converged-computing/metric-chatterbug:latest", 120],
["app-nekbone", "ghcr.io/converged-computing/metric-nekbone:latest", 120],
["app-ldms", "ghcr.io/converged-computing/metric-ovis-hpc:latest", 120],
["app-amg", "ghcr.io/converged-computing/metric-amg:latest", 120],
["app-kripke", "ghcr.io/converged-computing/metric-kripke:latest", 120],
["app-pennant", "ghcr.io/converged-computing/metric-pennant:latest", 120],
["app-bdas", "ghcr.io/converged-computing/metric-bdas:latest", 120],
["app-quicksilver", "ghcr.io/converged-computing/metric-quicksilver:latest", 120]]

steps:
- name: Clone the code
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
run: |
export PATH="/usr/share/miniconda/bin:$PATH"
source activate mo
cd sdk/python/v1alpha1
cd sdk/python/v1alpha2
pip install .
pip install seaborn pandas
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ jobs:
run: |
export PATH="/usr/share/miniconda/bin:$PATH"
source activate mo
cd sdk/python/v1alpha1/
cd sdk/python/v1alpha2/
pip install -e .
python setup.py sdist bdist_wheel
cd dist
Expand Down
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -323,7 +323,8 @@ helm: manifests kustomize helmify

.PHONY: docs-data
docs-data:
go run hack/docs-gen/main.go docs/_static/data/metrics.json
go run hack/metrics-gen/main.go docs/_static/data/metrics.json
go run hack/addons-gen/main.go docs/_static/data/addons.json

.PHONY: pre-push
pre-push: generate build-config-arm build-config docs-data
Expand Down
4 changes: 2 additions & 2 deletions PROJECT
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,6 @@ resources:
controller: true
domain: flux-framework.org
kind: MetricSet
path: github.com/converged-computing/metrics-operator/api/v1alpha1
version: v1alpha1
path: github.com/converged-computing/metrics-operator/api/v1alpha2
version: v1alpha2
version: "3"
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ To learn more:

## Dinosaur TODO

- Figure out issue with errors.IsNotFound not working...
- We need a way for the entrypoint command to monitor (based on the container) to differ (potentially)
- For larger metric collections, we should have a log streaming mode (and not wait for Completed/Successful)
- For services we are measuring, we likely need to be able to kill after N seconds (to complete job) or to specify the success policy on the metrics containers instead of the application
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ See the License for the specific language governing permissions and
limitations under the License.
*/

// Package v1alpha1 contains API Schema definitions for the v1alpha1 API group
// Package v1alpha2 contains API Schema definitions for the v1alpha2 API group
// +kubebuilder:object:generate=true
// +groupName=flux-framework.org
package v1alpha1
package v1alpha2

import (
"k8s.io/apimachinery/pkg/runtime/schema"
Expand All @@ -26,7 +26,7 @@ import (

var (
// GroupVersion is group version used to register these objects
GroupVersion = schema.GroupVersion{Group: "flux-framework.org", Version: "v1alpha1"}
GroupVersion = schema.GroupVersion{Group: "flux-framework.org", Version: "v1alpha2"}

// SchemeBuilder is used to add go types to the GroupVersionKind scheme
SchemeBuilder = &scheme.Builder{GroupVersion: GroupVersion}
Expand Down

0 comments on commit 67ad62f

Please sign in to comment.