Skip to content

A Kubernetes third party resource and controller to manage Tensorflow training clusters.

Notifications You must be signed in to change notification settings

elsonrodriguez/tensorsets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 

Repository files navigation

#TensorSets

TensorSets are a third party resource to manage Tensorflow training clusters running in Kubernetes.

Note, this is a duct-tape POC. Using this in production will result in multiple RGEs.

Walkthrough

First we define our ThirdPartyResource. This declares a new Kubernetes object type called "TensorSets".

kubectl create -f kubernetes/tensorset-tpr-v0.yaml

Next, we deploy our TensorSet controller. The controller is a small app that performs actions based on TensorSet objects.

kubectl create -f kubernetes/tensorset-controller-v0.yaml

Now we create our first TensorSet.

kubectl create -f kubernetes/cluster1-ts-v0.yaml

The TensorSet controller will create your training cluster, and eventually you will see a bunch of pods in your current namespace.

Once they are all ready, start a training job

kubectl create -f kubernetes/cluster1-job-v0.yaml

To see the progress of your Job:

pods=$(kubectl get pods --selector=ts-cluster-name=cluster1 --output=jsonpath={.items..metadata.name})
kubectl logs -f pods

Once done with your training cluster:

kubectl delete tensorset cluster1

And your cluster will be gone!

Roadmap

About

A Kubernetes third party resource and controller to manage Tensorflow training clusters.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published