Add PyTorch example returning trained model #22

scharlottej13 · 2023-07-19T03:05:34Z

Adding an example that trains and returns a model (see #20 (comment))

This is close, but I'm having some deserialization issues. Explaining this in terms of the two new files:

run/pytorch-test.py this works! This is a good minimal example of how to return a model from a function running on a remote GPU, save it locally, and then load the CPU version.
run/pytorch-train.py this is the real example. I'm getting a deserialization error, and I think it's related to loading the mnist dataset, since the traceback includes a ModuleNotFoundError for torchvision (cluster here), full traceback:

ModuleNotFoundError                       Traceback (most recent call last)
File /opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py:4297, in update_graph()

File /opt/coiled/env/lib/python3.11/site-packages/distributed/protocol/serialize.py:432, in deserialize()

File /opt/coiled/env/lib/python3.11/site-packages/distributed/protocol/serialize.py:98, in pickle_loads()

File /opt/coiled/env/lib/python3.11/site-packages/distributed/protocol/pickle.py:96, in loads()

File /opt/coiled/env/lib/python3.11/site-packages/cloudpickle/cloudpickle.py:649, in subimport()

ModuleNotFoundError: No module named 'torchvision'

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[26], line 1
----> 1 model = train_all_epochs()

File ~/mambaforge/envs/pytorch/lib/python3.11/site-packages/coiled/run.py:62, in Function.__call__(self, *args, **kwargs)
     61 def __call__(self, *args, **kwargs):
---> 62     return self.client.submit(self.function, *args, **kwargs).result()

File ~/mambaforge/envs/pytorch/lib/python3.11/site-packages/distributed/client.py:319, in Future.result(self, timeout)
    317 if self.status == "error":
    318     typ, exc, tb = result
--> 319     raise exc.with_traceback(tb)
    320 elif self.status == "cancelled":
    321     raise result

RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

Any debugging tips would be much appreciated! cc @mrocklin @ntabris

ntabris · 2023-07-19T03:33:44Z

Maybe you just need to rebuild the Coiled software environment? (If it's not rebuilding you could try using --force-rebuild)

I see torchvision in https://github.com/coiled/examples/blob/main/pytorch.yml#L13

but not listed in https://cloud.coiled.io/software/alias/26488/build/21361?account=sarah-johnson

run/pytorch-train.py

scharlottej13 · 2023-07-19T16:36:49Z

Maybe you just need to rebuild the Coiled software environment? (If it's not rebuilding you could try using --force-rebuild)

I see torchvision in https://github.com/coiled/examples/blob/main/pytorch.yml#L13

but not listed in https://cloud.coiled.io/software/alias/26488/build/21361?account=sarah-johnson

Ah yeah that's it, thank you Nat! (I even looked at the pytorch env and just didn't see it there 🤦‍♀️)

scharlottej13 · 2023-07-19T21:33:43Z

@mrocklin @ntabris this is ready for review, thanks for your help!

mrocklin · 2023-07-20T03:19:11Z

run/pytorch-train.py

+    software="pytorch",   # Our software environment defined above
+    region="us-west-2",   # We find GPUs are easier to get here
+)
+def train_all_epochs():


I would have expected somewhere in here a PyTorch line specifying that we're using the cuda backend. My guess is that we're just using CPU here. If you search the previous example for cuda you'll find the relevant line.

I pushed f5df826 with code changes so that model trains on GPU. @scharlottej13 if you want to revert that / do something different / make changes, go ahead (and hope you don't mind me jumping in, I wanted to do something fun this morning).

You can see the logging I added for the cluster I ran: https://staging.coiledhq.com/clusters/93357/information?account=nat-tabris-staging&tab=Logs&filterPattern=instance%3Ascheduler+cuda

Thanks @ntabris!

made a small change, looking good https://cloud.coiled.io/clusters/245450/information?account=sarah-johnson&tab=Metrics

I'm running this now and GPU util is pretty sad (but also non-zero), pretty much sitting at 13–14%.

nvm found it https://cloud.coiled.io/clusters/245461/information?account=nat-tabris&tab=Metrics

I'm not seeing any GPUs w/ less than 4 vCPU, maybe we could do some heavier computation in the model training?

I think low CPU util is fine here, I was noting that GPU util was low. I actually wonder if we're CPU bound here, since I do see CPU pegged at equiv of single core.

These are things you'd have to worry about if you actually cared about training a model on GPU, I don't think it's a serious concern for this example. (Obviously efficient training would be nicer, but personally it doesn't feel necessary—I'd rather an inefficient training example rather than none.)

Yeah agreed, I think the most important part is that we show that you can train a model on the GPU. I don't think that potential users will care much about the specific model

phofl

very small nit

run/pytorch-train.py

scharlottej13 · 2023-07-20T22:50:32Z

one thing I'm kind of curious about w/ this example is if given the time it takes to move the data to the GPU, is it still faster to use a GPU vs. CPU for training? Here's the same example but w/ c6i.xlarge https://cloud.coiled.io/clusters/245459/information?account=sarah-johnson&tab=Metrics,

mrocklin · 2023-07-20T23:00:18Z

It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)

…

On Thu, Jul 20, 2023, 6:50 PM Sarah Charlotte Johnson < ***@***.***> wrote: one thing I'm kind of curious about w/ this example is if given the time it takes to move the data to the GPU, is it still faster to use a GPU vs. CPU for training? Here's the same example but w/ c6i.xlarge https://cloud.coiled.io/clusters/245459/information?account=sarah-johnson&tab=Metrics , — Reply to this email directly, view it on GitHub <#22 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTF53YQDGVCBTB32NLTXRGY4HANCNFSM6AAAAAA2PIPXTY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ntabris · 2023-07-21T16:34:31Z

It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)

I'm +1 on merging as is. It's a toy case (fashion mnist model) but I think the point we're trying to make with this example is that you can easily offload to GPU in the cloud, and also easily return your model back to local machine that doesn't have GPU.

scharlottej13 · 2023-07-21T18:12:45Z

It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)

I'm +1 on merging as is. It's a toy case (fashion mnist model) but I think the point we're trying to make with this example is that you can easily offload to GPU in the cloud, and also easily return your model back to local machine that doesn't have GPU.

Sounds good, I'm going to merge this then!

WIP - add example training and returning a model

a105727

ntabris reviewed Jul 19, 2023

View reviewed changes

run/pytorch-train.py Show resolved Hide resolved

scharlottej13 added 2 commits July 19, 2023 11:29

Fix senv, remove test example

6b3cf17

Return best model rather than first best model

3a35cee

scharlottej13 changed the title ~~WIP - add PyTorch example returning trained model~~ Add PyTorch example returning trained model Jul 19, 2023

scharlottej13 marked this pull request as ready for review July 19, 2023 21:28

Add original PyTorch tutorial links

a1e9c92

mrocklin reviewed Jul 20, 2023

View reviewed changes

phofl reviewed Jul 20, 2023

View reviewed changes

run/pytorch-train.py Outdated Show resolved Hide resolved

ntabris and others added 2 commits July 20, 2023 09:00

Train model on GPU

f5df826

return model outside of loop

a199be1

scharlottej13 merged commit dec6abf into main Jul 21, 2023

scharlottej13 deleted the sarah/pytorch branch July 21, 2023 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PyTorch example returning trained model #22

Add PyTorch example returning trained model #22

scharlottej13 commented Jul 19, 2023 •

edited

Loading

ntabris commented Jul 19, 2023

scharlottej13 commented Jul 19, 2023

scharlottej13 commented Jul 19, 2023

mrocklin Jul 20, 2023

ntabris Jul 20, 2023

scharlottej13 Jul 20, 2023

scharlottej13 Jul 20, 2023

ntabris Jul 20, 2023

scharlottej13 Jul 20, 2023

scharlottej13 Jul 20, 2023

ntabris Jul 21, 2023

phofl Jul 21, 2023

phofl left a comment

scharlottej13 commented Jul 20, 2023

mrocklin commented Jul 20, 2023 via email

ntabris commented Jul 21, 2023

scharlottej13 commented Jul 21, 2023

Add PyTorch example returning trained model #22

Add PyTorch example returning trained model #22

Conversation

scharlottej13 commented Jul 19, 2023 • edited Loading

ntabris commented Jul 19, 2023

scharlottej13 commented Jul 19, 2023

scharlottej13 commented Jul 19, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phofl left a comment

Choose a reason for hiding this comment

scharlottej13 commented Jul 20, 2023

mrocklin commented Jul 20, 2023 via email

ntabris commented Jul 21, 2023

scharlottej13 commented Jul 21, 2023

scharlottej13 commented Jul 19, 2023 •

edited

Loading