Distributed TensorFlow with Estimators
Follow the instructions below to run the tutorial code locally and on Clusterone.
Table of Contents
To run the code locally, you need:
- Python 3.6
- TensorFlow 1.5 or higher. Install it like this:
pip install tensorflow
- The Clusterone Python library. Install it with
pip install clusterone
To run this project on Clusterone, you need:
- Clusterone account. Create a free account on https://clusterone.com/.
That's all you need! Add a project by linking this GitHub repo (
clusterone/clusterone-tutorials) as shown here.
Follow the Set Up section of the Get Started guide to add your GitHub personal access token to your Clusterone account.
Then follow Create a project section to add clusterone-tutorials project. Use
clusterone/clusterone-tutorials repository instead of what is shown in the guide.
You can run the tutorial code either on your local machine or on the Clusterone deep learning platform, even distributed over multiple GPUs. No code changes are necessary to switch between these modes.
Run the code locally
Start out by cloning this repository onto your local machine.
git clone https://github.com/clusterone/clusterone-tutorials
Then navigate to the directory with
Make sure you have all requirements installed that are listed above. Assuming all packages are installed correctly, you can run all script with
python mnist.py. The script will download the mnist dataset and then start training. You can view the training results with Tensorboard with
Run on Clusterone
These instructions use the
just command line tool. It comes with the Clusterone Python library and is installed automatically with the library.
If you have used Clusterone library before with a different Clusterone installation, make sure it is connected to the correct endpoint by running
just config endpoint https://clusterone.com.
Log into your Clusterone account using
just login, and entering your login information.
First, let's make sure that you have the project. Execute the command
just get projects to see all your projects. You should see something like this:
>> just get projects All projects: | # | Project | Created at | Description | |---|-------------------------------|---------------------|-------------| | 0 | username/clusterone-tutorials | 2018-11-20T00:00:00 | |
username should be your Clusterone account name.
Let's create a job. Make sure to replace
username with your username.
just create job distributed \ --project username/clusterone-tutorials \ --name distributed-mnist-job \ --worker-replicas 2 \ --worker-type aws-t2-small \ --docker-image tensorflow-1.11.0-cpu-py35 \ --ps-replicas 1 \ --ps-type aws-t2-small \ --ps-docker-image tensorflow-1.11.0-cpu-py35 \ --time-limit 1h \ --command "python tf-estimator/main.py" \ --setup_command "pip install -r tf-estimator/requirements.txt"
This creates a job with 2 worker nodes and 1 parameter server. See our documentation for more information on how to change the number and instance types of worker and parameter servers.
Now the final step is to start the job:
just start job -p clusterone-tutorials/distributed-mnist-job
That's it! You can monitor its progress on the command line using
just get events. More elaborate monitoring is available on the Matrix, Clusterone's graphical web interface.
For further information on this example, take a look at the tutorial based on this repository on the Clusterone Blog.
If you have any further questions, don't hesitate to reach out on Slack!
MIT © Clusterone Inc.
The MNIST dataset has been created and curated by Corinna Cortes, Christopher J.C. Burges, and Yann LeCun.