Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PyTorch example returning trained model #22

Merged
merged 6 commits into from
Jul 21, 2023
Merged

Conversation

scharlottej13
Copy link
Contributor

@scharlottej13 scharlottej13 commented Jul 19, 2023

Adding an example that trains and returns a model (see #20 (comment))

This is close, but I'm having some deserialization issues. Explaining this in terms of the two new files:

  • run/pytorch-test.py this works! This is a good minimal example of how to return a model from a function running on a remote GPU, save it locally, and then load the CPU version.
  • run/pytorch-train.py this is the real example. I'm getting a deserialization error, and I think it's related to loading the mnist dataset, since the traceback includes a ModuleNotFoundError for torchvision (cluster here), full traceback:
ModuleNotFoundError                       Traceback (most recent call last)
File /opt/coiled/env/lib/python3.11/site-packages/distributed/scheduler.py:4297, in update_graph()

File /opt/coiled/env/lib/python3.11/site-packages/distributed/protocol/serialize.py:432, in deserialize()

File /opt/coiled/env/lib/python3.11/site-packages/distributed/protocol/serialize.py:98, in pickle_loads()

File /opt/coiled/env/lib/python3.11/site-packages/distributed/protocol/pickle.py:96, in loads()

File /opt/coiled/env/lib/python3.11/site-packages/cloudpickle/cloudpickle.py:649, in subimport()

ModuleNotFoundError: No module named 'torchvision'

The above exception was the direct cause of the following exception:

RuntimeError                              Traceback (most recent call last)
Cell In[26], line 1
----> 1 model = train_all_epochs()

File ~/mambaforge/envs/pytorch/lib/python3.11/site-packages/coiled/run.py:62, in Function.__call__(self, *args, **kwargs)
     61 def __call__(self, *args, **kwargs):
---> 62     return self.client.submit(self.function, *args, **kwargs).result()

File ~/mambaforge/envs/pytorch/lib/python3.11/site-packages/distributed/client.py:319, in Future.result(self, timeout)
    317 if self.status == "error":
    318     typ, exc, tb = result
--> 319     raise exc.with_traceback(tb)
    320 elif self.status == "cancelled":
    321     raise result

RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

Any debugging tips would be much appreciated! cc @mrocklin @ntabris

@ntabris
Copy link
Member

ntabris commented Jul 19, 2023

Maybe you just need to rebuild the Coiled software environment? (If it's not rebuilding you could try using --force-rebuild)

I see torchvision in https://github.com/coiled/examples/blob/main/pytorch.yml#L13

but not listed in https://cloud.coiled.io/software/alias/26488/build/21361?account=sarah-johnson

@scharlottej13
Copy link
Contributor Author

Maybe you just need to rebuild the Coiled software environment? (If it's not rebuilding you could try using --force-rebuild)

I see torchvision in https://github.com/coiled/examples/blob/main/pytorch.yml#L13

but not listed in https://cloud.coiled.io/software/alias/26488/build/21361?account=sarah-johnson

Ah yeah that's it, thank you Nat! (I even looked at the pytorch env and just didn't see it there 🤦‍♀️)

@scharlottej13 scharlottej13 changed the title WIP - add PyTorch example returning trained model Add PyTorch example returning trained model Jul 19, 2023
@scharlottej13 scharlottej13 marked this pull request as ready for review July 19, 2023 21:28
@scharlottej13
Copy link
Contributor Author

@mrocklin @ntabris this is ready for review, thanks for your help!

software="pytorch", # Our software environment defined above
region="us-west-2", # We find GPUs are easier to get here
)
def train_all_epochs():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected somewhere in here a PyTorch line specifying that we're using the cuda backend. My guess is that we're just using CPU here. If you search the previous example for cuda you'll find the relevant line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed f5df826 with code changes so that model trains on GPU. @scharlottej13 if you want to revert that / do something different / make changes, go ahead (and hope you don't mind me jumping in, I wanted to do something fun this morning).

You can see the logging I added for the cluster I ran: https://staging.coiledhq.com/clusters/93357/information?account=nat-tabris-staging&tab=Logs&filterPattern=instance%3Ascheduler+cuda

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ntabris!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running this now and GPU util is pretty sad (but also non-zero), pretty much sitting at 13–14%.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any GPUs w/ less than 4 vCPU, maybe we could do some heavier computation in the model training?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think low CPU util is fine here, I was noting that GPU util was low. I actually wonder if we're CPU bound here, since I do see CPU pegged at equiv of single core.

These are things you'd have to worry about if you actually cared about training a model on GPU, I don't think it's a serious concern for this example. (Obviously efficient training would be nicer, but personally it doesn't feel necessary—I'd rather an inefficient training example rather than none.)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agreed, I think the most important part is that we show that you can train a model on the GPU. I don't think that potential users will care much about the specific model

Copy link

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very small nit

run/pytorch-train.py Outdated Show resolved Hide resolved
@scharlottej13
Copy link
Contributor Author

one thing I'm kind of curious about w/ this example is if given the time it takes to move the data to the GPU, is it still faster to use a GPU vs. CPU for training? Here's the same example but w/ c6i.xlarge https://cloud.coiled.io/clusters/245459/information?account=sarah-johnson&tab=Metrics,

@mrocklin
Copy link
Member

mrocklin commented Jul 20, 2023 via email

@ntabris
Copy link
Member

ntabris commented Jul 21, 2023

It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)

I'm +1 on merging as is. It's a toy case (fashion mnist model) but I think the point we're trying to make with this example is that you can easily offload to GPU in the cloud, and also easily return your model back to local machine that doesn't have GPU.

@scharlottej13
Copy link
Contributor Author

It would be nice to have an example that was more computationally intense. It's not necessarily worth spending a bunch of time on to make this great though. Maybe this gets the point across and other things would be higher value (or not, I'm a bit out of the loop about all that's going on)

I'm +1 on merging as is. It's a toy case (fashion mnist model) but I think the point we're trying to make with this example is that you can easily offload to GPU in the cloud, and also easily return your model back to local machine that doesn't have GPU.

Sounds good, I'm going to merge this then!

@scharlottej13 scharlottej13 merged commit dec6abf into main Jul 21, 2023
@scharlottej13 scharlottej13 deleted the sarah/pytorch branch July 21, 2023 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants