Skip to content

Commit

Permalink
Refactor memory graph (#32)
Browse files Browse the repository at this point in the history
* feat: basic io match is working

There is still more work to be done with jobspec-go
and parsing from raw values, and also checking the
other match types, but this is a start.

* refactor: memory graph database

Problem: we currently do not have a good model to support traversal
of more than one scheduled slot (a group of resources) and checking
of requires within and outside of the slot.
Solution: Jobspec nextgen provides a function to expose schedul-able
slots. A slot does not necessarily start at the top - it can have
some set of resources at the top level (with requirements) and then
the slot is below it. This means that the graph databases recursive
algorithm needs to first traverse into a vertex to find the slot,
but along the way check the subsystem requirements for types. For
example, even if we want N nodes, we should not continue search
if a node does not have an attribute we are interested in. Once
we find a slot, we create what is akin to a traverser, and the
traverser carries with it a resource counter. The resource
counter holds the count of needed slots vs. found slots, and
then is able to return as soon as we found as many as we need.
It also holds the current state (status) of a current search,
meaning we decrement either a resource or subsystem count
when we find it somewhere in the subgraph of the slot. This
is just the early prototype, and so far just working for the
simple case of submitting a job with some need for cores and
nodes. I am next going to go back through the more specific
IO cases and ensure that they still with, with the goal to
get back to the spack case. I am going back to sleep for a
bit first, kind of tired.

* io example is working

This example is needing to search both compatibility requirements
and look for resources within a slot.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed May 10, 2024
1 parent a79a19d commit 502bfbb
Show file tree
Hide file tree
Showing 37 changed files with 851 additions and 895 deletions.
1 change: 0 additions & 1 deletion .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@ ENV GO_VERSION=1.20.14
unzip && \
apt-get clean -y && apt -y autoremove

# Install go 19.10
RUN wget https://go.dev/dl/go${GO_VERSION}.linux-amd64.tar.gz && tar -xvf go${GO_VERSION}.linux-amd64.tar.gz && \
mv go /usr/local && rm go${GO_VERSION}.linux-amd64.tar.gz

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ For more information:

## TODO

- match/equals can have repeated fields, so we need to honor that list.
- cypher: when we have another cypher graph, move the memgraph cypher logic into the graph match algorithm, add an endpoint to return cypher. Currently the match algorithms (beyond basic containment) are not implemented
- subsystems
- make also a function to delete subsystems
Expand Down
1 change: 0 additions & 1 deletion cmd/rainbow/rainbow.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ import (

// Register database backends and selection algorithms
_ "github.com/converged-computing/rainbow/plugins/algorithms/match"
_ "github.com/converged-computing/rainbow/plugins/algorithms/range"
_ "github.com/converged-computing/rainbow/plugins/backends/memgraph"
_ "github.com/converged-computing/rainbow/plugins/backends/memory"
_ "github.com/converged-computing/rainbow/plugins/selection/constraint"
Expand Down
2 changes: 1 addition & 1 deletion cmd/rainbow/submit/submit.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import (
"fmt"
"log"

js "github.com/compspec/jobspec-go/pkg/jobspec/experimental"
js "github.com/compspec/jobspec-go/pkg/nextgen/v1"
"github.com/converged-computing/rainbow/pkg/client"
"github.com/converged-computing/rainbow/pkg/config"
jscli "github.com/converged-computing/rainbow/pkg/jobspec"
Expand Down
1 change: 0 additions & 1 deletion cmd/server/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ import (

// Register database backends
_ "github.com/converged-computing/rainbow/plugins/algorithms/match"
_ "github.com/converged-computing/rainbow/plugins/algorithms/range"
_ "github.com/converged-computing/rainbow/plugins/backends/memgraph"
_ "github.com/converged-computing/rainbow/plugins/backends/memory"
_ "github.com/converged-computing/rainbow/plugins/selection/constraint"
Expand Down
1 change: 1 addition & 0 deletions docs/advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,5 +23,6 @@ go run ./cmd/rainbow/rainbow.go submit --config-path ./docs/examples/scheduler/r
```

The above demonstrates using a more advanced selection algorithm.
Note that this cluster state requires further discussion and thinking about where and how to accommodate it - it currently uses the old design with attributes on the level of the Jobspec, and while this works, we likely want to be using the attributes on the level of schedule-able unit.

[home](/README.md#rainbow-scheduler)
8 changes: 2 additions & 6 deletions docs/algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,11 +141,7 @@ task:
max: "0.5.5"
```

This is the most realisic use case I think.

### Equals

The "equals" type is going to look exactly at some exact value for a field in the metadata. It will return true (match) if it matches what the subsystem needs. For example, given this task:
This is the most realisic use case I think. The above demonstrates the match "equals" and range types. Using just match (or equals) is shown below:

```yaml
task:
Expand Down Expand Up @@ -297,7 +293,7 @@ scheduler:
- select: random
```

The above is saying for first priority, filter down to clusters that have nodes free. Then calculate an estimate of the cost for the build. Here is the logic. If we have a linaer model (Y = mX + b) to describe memory and runtime, so `runtime = (slope * memory) + intercept` and here our intercept is some value we can derive on the level of the package (and write into the jobspec) and seconds_per_gb is the slope of the line, then we can get an estimated runtime (in seconds) with `seconds_per_gb * memory_per_node`. If we multiply by 60 we get minutes, and again we get hours. So the piece of the equation `memory_per_node * seconds_per_gb)/60/60` is giving us an estimated runtime in hours based on the package being built. If we multiply that by the cost per node hour, then we get an estimate of the cost for the build.
The above is saying for first priority, filter down to clusters that have nodes free. Then calculate an estimate of the cost for the build. Here is the logic. If we have a linear model (Y = mX + b) to describe memory and runtime, so `runtime = (slope * memory) + intercept` and here our intercept is some value we can derive on the level of the package (and write into the jobspec) and seconds_per_gb is the slope of the line, then we can get an estimated runtime (in seconds) with `seconds_per_gb * memory_per_node`. If we multiply by 60 we get minutes, and again we get hours. So the piece of the equation `memory_per_node * seconds_per_gb)/60/60` is giving us an estimated runtime in hours based on the package being built. If we multiply that by the cost per node hour, then we get an estimate of the cost for the build.

The "select" field is saying how to choose the final cluster from the set that remain. Options here can be first, last, or random.

Expand Down
42 changes: 21 additions & 21 deletions docs/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -495,23 +495,25 @@ The new portion from the above is seeing that the subsystem "io" is satisfied at

```console
...
🔍️ Exploring cluster keebler deeper with depth first search

👀️ Looking for 'node' in cluster keebler
=> Checking vertex 'cluster' (count=1) for 'node' (need=2)
=> Checking vertex 'cluster' (count=1) for 'node' (need=2)
=> Checking vertex 'rack' (count=1) for 'node' (need=2)
=> Checking vertex 'node' (count=1) for 'node' (need=2)
=> Checking vertex 'node' (count=1) for 'node' (need=2)
⏳️ keebler still contender, 3/2 of needed node satisfied
🍇️ Satisfy request to Graph 🍇️
jobspec: {"version":1,"resources":{"ior":{"type":"node","replicas":1,"with":[{"type":"core","count":2,"attributes":{}}],"requires":[{"field":"type","match":"shm","name":"io"}],"attributes":{}}}}
🎰️ Resources that that need to be satisfied with matcher match
node: (slot) 1
requires
field: type
match: shm
name: io

👀️ Looking for 'slot' in cluster keebler
=> Assessing needs for subsystem io
=> Resource 'node' satisfies subsystem io shm
🎯️ dfs: we found 1 clusters to satisfy the request
2024/03/09 13:49:09 SELECT * from clusters WHERE name LIKE "keebler" LIMIT 1: keebler
2024/03/09 13:49:09 📝️ received job ior for 1 contender clusters
2024/03/09 13:49:09 📝️ job ior is assigned to cluster keebler
🔍️ Exploring cluster keebler deeper with depth first search
=> Searching for resource type core from parent contains->rack
=> Searching for resource type core from parent contains->node
Found subsystem edge for io with type shm
Minimum slot needs are satisfied at node for io at shm, returning early.
slotNeeds are satisfied, returning 1 slots matched
Slots found 1/1 for vertex cluster
match: ✅️ there are 1 matches with sufficient resources
2024/05/07 19:40:56 📝️ received job app for 1 contender clusters
2024/05/07 19:40:56 📝️ job app is assigned to cluster [keebler]
```

And the work is still assigned to the cluster.
Expand Down Expand Up @@ -539,7 +541,7 @@ go run cmd/rainbow/rainbow.go register cluster --cluster-name spack-builder --no
go run cmd/rainbow/rainbow.go register subsystem --subsystem spack --nodes-json ./docs/examples/match-algorithms/range/spack-subsystem.json --config-path ./docs/examples/match-algorithms/range/rainbow-config.yaml

# Submit a job that asked for a valid range
go run ./cmd/rainbow/rainbow.go submit --config-path ./docs/examples/match-algorithms/range/rainbow-config.yaml --jobspec ./docs/examples/match-algorithms/range/jobspec-valid-range.yaml --match-algorithm range
go run ./cmd/rainbow/rainbow.go submit --config-path ./docs/examples/match-algorithms/range/rainbow-config.yaml --jobspec ./docs/examples/match-algorithms/range/jobspec-valid-range.yaml
```
For the above job, you'll see it's satisfied:

Expand All @@ -552,8 +554,8 @@ For the above job, you'll see it's satisfied:
Try submitting a job that can't be satisfied for the range.

```bash
# Submit a job that asked for a valid range
go run ./cmd/rainbow/rainbow.go submit --config-path ./docs/examples/match-algorithms/range/rainbow-config.yaml --jobspec ./docs/examples/match-algorithms/range/jobspec-invalid-range.yaml --match-algorithm range
# Submit a job that asked for an invalid range
go run ./cmd/rainbow/rainbow.go submit --config-path ./docs/examples/match-algorithms/range/rainbow-config.yaml --jobspec ./docs/examples/match-algorithms/range/jobspec-invalid-range.yaml
```
```console
Slots found 0/1 for vertex cluster
Expand Down Expand Up @@ -594,6 +596,4 @@ Awesome! Next we can put that logic in a flux instance (from the Python grpc to
accept some number of them. The response back to the rainbow scheduler will be those to accept, which will then be removed from the database. For another day.




[home](/README.md#rainbow-scheduler)
31 changes: 13 additions & 18 deletions docs/examples/match-algorithms/range/jobspec-invalid-range.yaml
Original file line number Diff line number Diff line change
@@ -1,23 +1,18 @@
version: 1
resources:
- count: 2
type: node
with:
- count: 1
label: default
type: slot
spack:
replicas: 2
type: node
requires:
- name: spack
field: version
min: "0.7.1"
max: "0.7.5"

with:
- count: 2
type: core
task:
command:
- spack
slot: default
count:
per_slot: 1
resources:
spack:
range:
- field: version
min: "0.7.1"
max: "0.7.5"

tasks:
- command: [ior]
resources: spack
31 changes: 13 additions & 18 deletions docs/examples/match-algorithms/range/jobspec-valid-range.yaml
Original file line number Diff line number Diff line change
@@ -1,23 +1,18 @@
version: 1
resources:
- count: 2
type: node
with:
- count: 1
label: default
type: slot
spack:
replicas: 2
type: node
requires:
- name: spack
field: version
min: "0.5.1"
max: "0.5.5"

with:
- count: 2
type: core
task:
command:
- ior
slot: default
count:
per_slot: 1
resources:
spack:
range:
- field: version
min: "0.5.1"
max: "0.5.5"

tasks:
- command: [ior]
resources: spack
2 changes: 1 addition & 1 deletion docs/examples/match-algorithms/range/rainbow-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ scheduler:
name: match
cluster:
name: spack-builder
secret: 37e5b798-189f-4c38-bc1c-0a14877acbcf
secret: 594c79ea-fc65-4d82-93bb-5e4dc3469276
graphdatabase:
name: memory
host: 127.0.0.1:50051
Expand Down
23 changes: 10 additions & 13 deletions docs/examples/scheduler/jobspec-constraint.yaml
Original file line number Diff line number Diff line change
@@ -1,20 +1,17 @@

version: 1
resources:
- count: 2
type: node
with:
- count: 1
label: default
type: slot
spack:
replicas: 2
type: node
with:
- count: 2
type: core
task:
command:
- ior
slot: default
count:
per_slot: 1

tasks:
- command: [ior]
resources: spack

attributes:
parameter:
seconds_per_gb: 0.4
seconds_per_gb: 0.4
24 changes: 8 additions & 16 deletions docs/examples/scheduler/jobspec-io.yaml
Original file line number Diff line number Diff line change
@@ -1,22 +1,14 @@
version: 1
resources:
- count: 2
type: node
with:
- count: 1
label: default
type: slot
ior:
type: node
replicas: 1
requires:
- name: io
match: shm
field: type
with:
- count: 2
type: core
task:
command:
- ior
slot: default
count:
per_slot: 1
resources:
io:
match:
- field: type
value: shm
command: [ior]
2 changes: 1 addition & 1 deletion docs/examples/scheduler/rainbow-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ scheduler:
name: match
cluster:
name: keebler
secret: df4d1009-a95a-4fe6-8d8d-ae3cf9f016cd
secret: c1971ac9-8350-440f-8f5d-b64d97e929a4
graphdatabase:
name: memory
host: 127.0.0.1:50051
Expand Down
2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ require (
github.com/Knetic/govaluate v3.0.0+incompatible
github.com/Masterminds/semver/v3 v3.2.1
github.com/akamensky/argparse v1.4.0
github.com/compspec/jobspec-go v0.0.0-20240406210339-886aab99ffbe
github.com/compspec/jobspec-go v0.0.0-20240510054255-ee02cdc7d3d4
github.com/converged-computing/jsongraph-go v0.0.0-20240229082022-c6887a5a00fe
github.com/fatih/color v1.16.0
github.com/google/uuid v1.6.0
Expand Down
4 changes: 2 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@ github.com/Masterminds/semver/v3 v3.2.1 h1:RN9w6+7QoMeJVGyfmbcgs28Br8cvmnucEXnY0
github.com/Masterminds/semver/v3 v3.2.1/go.mod h1:qvl/7zhW3nngYb5+80sSMF+FG2BjYrf8m9wsX0PNOMQ=
github.com/akamensky/argparse v1.4.0 h1:YGzvsTqCvbEZhL8zZu2AiA5nq805NZh75JNj4ajn1xc=
github.com/akamensky/argparse v1.4.0/go.mod h1:S5kwC7IuDcEr5VeXtGPRVZ5o/FdhcMlQz4IZQuw64xA=
github.com/compspec/jobspec-go v0.0.0-20240406210339-886aab99ffbe h1:AMgW4uL//FX/Rl0lVP0bjvr0s/tjJqUSdxd1enFvMp4=
github.com/compspec/jobspec-go v0.0.0-20240406210339-886aab99ffbe/go.mod h1:BaJyxaOhESe2DD4lqBdwTEWOw0TaTZVJGPrFh6KyXQM=
github.com/compspec/jobspec-go v0.0.0-20240510054255-ee02cdc7d3d4 h1:4MaTp3OcUmp6HFEojeI//GthUt7GMYnB8K5OSZdKxZA=
github.com/compspec/jobspec-go v0.0.0-20240510054255-ee02cdc7d3d4/go.mod h1:BaJyxaOhESe2DD4lqBdwTEWOw0TaTZVJGPrFh6KyXQM=
github.com/converged-computing/jsongraph-go v0.0.0-20240229082022-c6887a5a00fe h1:Tk//RW3uKn4A7N8gpHRXs+ZGlR7Fxkwh+4/Iml0GBV4=
github.com/converged-computing/jsongraph-go v0.0.0-20240229082022-c6887a5a00fe/go.mod h1:+DhVyLXGVfBsfta4185jd33jqa94inshCcdvsXK2Irk=
github.com/fatih/color v1.16.0 h1:zmkK9Ngbjj+K0yRhTVONQh1p/HknKYSlNT+vZCzyokM=
Expand Down
2 changes: 1 addition & 1 deletion pkg/client/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import (
"context"
"log"

js "github.com/compspec/jobspec-go/pkg/jobspec/experimental"
js "github.com/compspec/jobspec-go/pkg/nextgen/v1"

pb "github.com/converged-computing/rainbow/pkg/api/v1"
"github.com/converged-computing/rainbow/pkg/config"
Expand Down
2 changes: 1 addition & 1 deletion pkg/client/endpoint.go
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ import (
"os"
"time"

js "github.com/compspec/jobspec-go/pkg/jobspec/experimental"
js "github.com/compspec/jobspec-go/pkg/nextgen/v1"
pb "github.com/converged-computing/rainbow/pkg/api/v1"
"github.com/converged-computing/rainbow/pkg/config"
"github.com/converged-computing/rainbow/pkg/graph"
Expand Down
4 changes: 1 addition & 3 deletions pkg/graph/algorithm/algorithm.go
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ import (
"fmt"
"log"

v1 "github.com/compspec/jobspec-go/pkg/jobspec/experimental"
"github.com/converged-computing/rainbow/pkg/types"
)

Expand All @@ -23,8 +22,7 @@ type MatchAlgorithm interface {
Init(map[string]string) error

// A MatchAlgorithm needs to take a slot and determine if it matches
GetSlotResourceNeeds(slot *v1.Task) *types.SlotResourceNeeds
CheckSubsystemEdge(slotNeeds *types.SlotResourceNeeds, edge *types.Edge, vtx *types.Vertex)
CheckSubsystemEdge(slotNeeds *types.MatchAlgorithmNeeds, edge *types.Edge)
}

// List returns known algorithms
Expand Down
2 changes: 1 addition & 1 deletion pkg/graph/backend/backend.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import (
"fmt"
"log"

js "github.com/compspec/jobspec-go/pkg/jobspec/experimental"
js "github.com/compspec/jobspec-go/pkg/nextgen/v1"

"github.com/converged-computing/jsongraph-go/jsongraph/v2/graph"
"github.com/converged-computing/rainbow/pkg/graph/algorithm"
Expand Down

0 comments on commit 502bfbb

Please sign in to comment.