-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PyTorch example returning trained model #22
Conversation
Maybe you just need to rebuild the Coiled software environment? (If it's not rebuilding you could try using I see but not listed in https://cloud.coiled.io/software/alias/26488/build/21361?account=sarah-johnson |
Ah yeah that's it, thank you Nat! (I even looked at the pytorch env and just didn't see it there 🤦♀️) |
software="pytorch", # Our software environment defined above | ||
region="us-west-2", # We find GPUs are easier to get here | ||
) | ||
def train_all_epochs(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected somewhere in here a PyTorch line specifying that we're using the cuda backend. My guess is that we're just using CPU here. If you search the previous example for cuda
you'll find the relevant line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed f5df826 with code changes so that model trains on GPU. @scharlottej13 if you want to revert that / do something different / make changes, go ahead (and hope you don't mind me jumping in, I wanted to do something fun this morning).
You can see the logging I added for the cluster I ran: https://staging.coiledhq.com/clusters/93357/information?account=nat-tabris-staging&tab=Logs&filterPattern=instance%3Ascheduler+cuda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ntabris!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made a small change, looking good https://cloud.coiled.io/clusters/245450/information?account=sarah-johnson&tab=Metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm running this now and GPU util is pretty sad (but also non-zero), pretty much sitting at 13–14%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing any GPUs w/ less than 4 vCPU, maybe we could do some heavier computation in the model training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think low CPU util is fine here, I was noting that GPU util was low. I actually wonder if we're CPU bound here, since I do see CPU pegged at equiv of single core.
These are things you'd have to worry about if you actually cared about training a model on GPU, I don't think it's a serious concern for this example. (Obviously efficient training would be nicer, but personally it doesn't feel necessary—I'd rather an inefficient training example rather than none.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed, I think the most important part is that we show that you can train a model on the GPU. I don't think that potential users will care much about the specific model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very small nit
one thing I'm kind of curious about w/ this example is if given the time it takes to move the data to the GPU, is it still faster to use a GPU vs. CPU for training? Here's the same example but w/ c6i.xlarge https://cloud.coiled.io/clusters/245459/information?account=sarah-johnson&tab=Metrics, |
It would be nice to have an example that was more computationally intense.
It's not necessarily worth spending a bunch of time on to make this great
though. Maybe this gets the point across and other things would be higher
value (or not, I'm a bit out of the loop about all that's going on)
…On Thu, Jul 20, 2023, 6:50 PM Sarah Charlotte Johnson < ***@***.***> wrote:
one thing I'm kind of curious about w/ this example is if given the time
it takes to move the data to the GPU, is it still faster to use a GPU vs.
CPU for training? Here's the same example but w/ c6i.xlarge
https://cloud.coiled.io/clusters/245459/information?account=sarah-johnson&tab=Metrics
,
—
Reply to this email directly, view it on GitHub
<#22 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTF53YQDGVCBTB32NLTXRGY4HANCNFSM6AAAAAA2PIPXTY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'm +1 on merging as is. It's a toy case (fashion mnist model) but I think the point we're trying to make with this example is that you can easily offload to GPU in the cloud, and also easily return your model back to local machine that doesn't have GPU. |
Sounds good, I'm going to merge this then! |
Adding an example that trains and returns a model (see #20 (comment))
This is close, but I'm having some deserialization issues. Explaining this in terms of the two new files:
run/pytorch-test.py
this works! This is a good minimal example of how to return a model from a function running on a remote GPU, save it locally, and then load the CPU version.run/pytorch-train.py
this is the real example. I'm getting a deserialization error, and I think it's related to loading the mnist dataset, since the traceback includes a ModuleNotFoundError for torchvision (cluster here), full traceback:Any debugging tips would be much appreciated! cc @mrocklin @ntabris