Replies: 10 comments 16 replies
-
We need to do more investigation involving more feedback from more users. Great job!!! |
Beta Was this translation helpful? Give feedback.
-
What surprises me the most is the differences between the default and t32 in the first two rows. I wonder if 'allow_tf32' is set to true in PyTorch by default (didn't use to be the case, but maybe it's changed). Update: No, it's still 'False' by default. |
Beta Was this translation helpful? Give feedback.
-
Please file a bug vis-a-vis |
Beta Was this translation helpful? Give feedback.
-
It may be worth trying the TorchScript version with TorchSharp again, using v0.101.1 |
Beta Was this translation helpful? Give feedback.
-
I have retried my tests with
I have also tried TorchScript, but the only combination that worked was pure PyTorch with script loaded from the file.
In TorchSharp loading the script didn't work at all (I'll investigate some more what's happening - it seems that |
Beta Was this translation helpful? Give feedback.
-
One thing about benchmarking Python vs. .NET is the latter's JIT -- it's important to do a warmup run (like completing a first batch) of .NET before starting to measure if it's going to be apples-to-apples. |
Beta Was this translation helpful? Give feedback.
-
Do you have your model source code available in a repo somewhere? |
Beta Was this translation helpful? Give feedback.
-
@NiklasGustafsson I've uploaded the project to github: |
Beta Was this translation helpful? Give feedback.
-
@NiklasGustafsson |
Beta Was this translation helpful? Give feedback.
-
@pkese Did you implement nanoGPT in TorchSharp natively, or only through TorchScript? I'm asking because I'm trying to get https://github.com/ejhg/llama-torchsharp (a fork of someone else's initial implementation) to match the performance of the reference llama PyTorch implementation (on an MPS device). Currently, my fork is running 2X slower on MPS, and barely faster than CPU, which is suspicious. If you have a TorchSharp port of nanoGPT, that would be really useful. On a separate thread, if anyone can share any tips on performance profiling for TorchSharp, that would be awesome... |
Beta Was this translation helpful? Give feedback.
-
If anyone is interested...
I made a small language model inspired by https://github.com/karpathy/nanoGPT in both PyTorch and TorchSharp.
The model has 2 layers of transformers totalling 150k parameters and is trained on Shakespeare's text.
I found out that going to smaller data types, improves training time, as does PyTorch's
jit.compile
, which is not available in TorchSharp.Here are some benchmarks of model training times (minutes and seconds) with CUDA on a small GPU (RTX 3070).
For
bf16
I used:I couldn't achieve the same
bf16
functionality with TorchSharp.I don't quite understand why default TorchSharp code is slower than default PyTorch code.
After I set
torch.backends.cuda.matmul.allow_tf32 = true
in both Python and TorchSharp, I get comparable performance (see first vs second column of results).If someone is interested I can publish the code.
(I was trying to also get TorchScript models to work on both sides which messed up the code quite a bit ... and I might wish to reverse that.)
BTW, TorchScript model was 1% slower to train on PyTorch and crashed in TorchSharp.
Beta Was this translation helpful? Give feedback.
All reactions