Skip to content
This repository has been archived by the owner on Oct 26, 2023. It is now read-only.

Adds API for creating Ray clusters #2

Merged
merged 110 commits into from
Mar 10, 2021
Merged

Adds API for creating Ray clusters #2

merged 110 commits into from
Mar 10, 2021

Conversation

sonnysideup
Copy link
Contributor

along with other scaffolded files and types (i.e. controller)

along with other scaffolded files and types (i.e. controller)
Copy link
Contributor

@ddl-kevin ddl-kevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

make target requires the use of `source' command
since this uses "/bin/sh" by default. if that path doesn't point to
bash, then the scripting will error out because the "source" command is
unavailable
Copy link
Contributor

@ddl-kevin ddl-kevin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of notes / questions. I can't think of anything else I would include right now, but I'm sure we will find some things.

api/v1alpha1/raycluster_types.go Outdated Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Outdated Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Outdated Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Outdated Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Outdated Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Show resolved Hide resolved
generating CRDs that support the latest and greatest
also sets fields with default values as "optional" since the v1 CRD spec
requires this distinction. "omitempty" accomplishes the same thing but
preference is lent to the former annotation since it clearly conveys our
intention.

update sample ray CR with a real field value; more will come
for use by all resources
and scaffolds structure for creating all the resources need to build a
ray cluster. serviceaccount is implemented. tests will come
these will be used to mount additional volumes into cluster pods
fixes default repo for ray image
adds NodeSelector and Affinity fields
adds realistic ObjectStoreMemoryBytes min validation in accordance with
the app's requirements. Points to OCIImageDefinition object
naive in the sense the controller creates them w/o check, but the
generation of the deployment structs are well tested.

also adds a head service and empty stub files for other required
resources.
this is a bit cleaner, more descriptive
adds ClientServerPort to RayClusterSpec
exposes proper ports in head service for ray
changes default ray image to standard image (w/o extra ML libs)
refactors ray resource lib funcs and adds more unit tests
creates network policy
creates and binds pod security policy
adds watches to controller for "owned" resources (still naive)
changes ObjectStoreMemoryBytes type to account for nil
should be int64 so we can represent more than 2.2Gi
fixes bug with cmd args passing value instead of pointer addr
while we can create a PSP for every cluster, the logistics behind
managing both cluster and namespace scoped resources are gross. these
changes make it possible for users to provide a psp "name". the psp will
be verified and then use will be enabled via RBAC resources.
enables primitive autoscaling
separates head/worker resource requests (different load profiles)
adds printer columns for `kubectl get' output
uses "recreate" strategy to manage head deployment
updates example RayCluster YAML
Copy link
Contributor

@steved steved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing controversial 👍

api/v1alpha1/common_types.go Show resolved Hide resolved
api/v1alpha1/raycluster_types.go Outdated Show resolved Hide resolved
pkg/resources/ray/horizontalpodautoscaler.go Outdated Show resolved Hide resolved
pkg/resources/ray/networkpolicy.go Show resolved Hide resolved
exposes ScaleDownStabilizationWindowSeconds for HPA
adds ImagePullSecrets to spec for pulling private images
makes RayCluster.Spec field required
controller-runtime 0.7.0 was causing API submissions to error out
whenever no default values were applied. this was fixed in version 0.7.1

kubernetes-sigs/controller-runtime#1329
reworks HPA creation behavior
adds integration tests for webhook logic
we already default this field and validate use input. making this
optional allows the user to specify other worker fields (e.g. resources)
without having to explicitly provide a replica count as well.
@sonnysideup sonnysideup marked this pull request as ready for review March 7, 2021 18:16
@sonnysideup sonnysideup requested a review from a team March 7, 2021 18:16
@sonnysideup
Copy link
Contributor Author

@steved This is about 99% complete; I just need to add some more logging to the controller. I'd appreciate it if you would give this another look.

sonnysideup and others added 14 commits March 7, 2021 13:30
version 0.8 allows us to excluded error stack traces in logs. this does
3 things:
- produces less noise in production logs
- forces us to add context to errors with wrapping
- still allows us to trace errors in development when fixing bugs
removes "development mode" default
this helps us log type information when creating owned resources
minor changes moving from 1.19 to 1.20
logging create/update/delete of owned components
debug logging for status updates and object patches
debug++ logging for reconciliation loop triggers
when updates to a stale object occur, we just requeue the key so it can
be reprocessed with the latest version of the RayCluster object.
…ommon name (which represents image, chart, and deployment names)
0.8.3 removes deprecation warnings that were muddling the logs
change MINIKUBE_PROFILE to evaluate as $MINIKUBE_PROFILE
adds extra vars with defaults to allow users override capabilities
steved
steved previously approved these changes Mar 9, 2021
Copy link
Contributor

@steved steved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but I don't think any are a blocking issue.

Makefile Show resolved Hide resolved
api/v1alpha1/raycluster_webhook.go Outdated Show resolved Hide resolved
api/v1alpha1/raycluster_webhook.go Show resolved Hide resolved
controllers/raycluster_controller.go Show resolved Hide resolved
Comment on lines +289 to +292
if modified, ok := controlled.(*corev1.Service); ok {
current := found.(*corev1.Service)
modified.Spec.ClusterIP = current.Spec.ClusterIP
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ... seems a little odd and scary (i.e. feels like we should either have immutable services or somehow copy all the stateful-ish fields)

pkg/crd/crd.go Outdated Show resolved Hide resolved
pkg/manager/manager.go Show resolved Hide resolved
pkg/manager/manager.go Show resolved Hide resolved
pkg/resources/ray/deployment.go Outdated Show resolved Hide resolved
pkg/resources/ray/ray.go Show resolved Hide resolved
now this can be used by any reconciler or other types that require the
need to propagate a contextual logger
this was only added to enrich log statements. instead, we can grab the
type metadata directly from the scheme and add it to the logs prior to
create/update operations.
indicates the reason why the shape of this operation is slightly
different from the other reconcile functions.
using ifs to assert whether we need to take one action or another, or
return an err
default image definition is added when completely missing. if a user
provides an image, then we valid that the minimum required fields are
present.
@sonnysideup sonnysideup merged commit 6b28e91 into main Mar 10, 2021
@sonnysideup sonnysideup deleted the ray-cluster branch March 10, 2021 01:55
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
5 participants