Benchmarking against PyTorch & jit.compile #1126

pkese · 2023-10-28T09:33:16Z

pkese
Oct 28, 2023

If anyone is interested...

I made a small language model inspired by https://github.com/karpathy/nanoGPT in both PyTorch and TorchSharp.
The model has 2 layers of transformers totalling 150k parameters and is trained on Shakespeare's text.

I found out that going to smaller data types, improves training time, as does PyTorch's jit.compile, which is not available in TorchSharp.

Here are some benchmarks of model training times (minutes and seconds) with CUDA on a small GPU (RTX 3070).

	default	tf32	bf16
TorchSharp 0.100.7	6:46	5:20	N/A
PyTorch 2.0.1	5:31	5:27	4:28
PyTorch+jit.compile	4:04	3:57	2:26

For bf16 I used:

from torch.cuda.amp import autocast
with autocast(dtype=torch.bfloat16):
    <train code>

I couldn't achieve the same bf16 functionality with TorchSharp.

I don't quite understand why default TorchSharp code is slower than default PyTorch code.
After I set torch.backends.cuda.matmul.allow_tf32 = true in both Python and TorchSharp, I get comparable performance (see first vs second column of results).

If someone is interested I can publish the code.
(I was trying to also get TorchScript models to work on both sides which messed up the code quite a bit ... and I might wish to reverse that.)
BTW, TorchScript model was 1% slower to train on PyTorch and crashed in TorchSharp.

GeorgeS2019 · 2023-10-28T10:55:59Z

GeorgeS2019
Oct 28, 2023

@pkese

We need to do more investigation involving more feedback from more users. Great job!!!

0 replies

NiklasGustafsson · 2023-11-02T14:36:40Z

NiklasGustafsson
Nov 2, 2023
Maintainer

What surprises me the most is the differences between the default and t32 in the first two rows. I wonder if 'allow_tf32' is set to true in PyTorch by default (didn't use to be the case, but maybe it's changed).

Update: No, it's still 'False' by default.

0 replies

NiklasGustafsson · 2023-11-02T14:37:26Z

NiklasGustafsson
Nov 2, 2023
Maintainer

Please file a bug vis-a-vis autocast not working in TorchSharp

0 replies

NiklasGustafsson · 2023-11-02T22:37:41Z

NiklasGustafsson
Nov 2, 2023
Maintainer

It may be worth trying the TorchScript version with TorchSharp again, using v0.101.1

0 replies

pkese · 2023-11-03T16:15:33Z

pkese
Nov 3, 2023
Author

I have retried my tests with libtorch-cuda-12.1-linux-x64 version 2.1.0.1 and also updated CUDA to 12.1 on Python side and the results now appear a bit better and more consistent.

	default	tf32	bf16
TorchSharp 0.100.7	4:53	4:51	N/A
PyTorch 2.1.0	4:22	4:19	3:48
PyTorch+jit.compile	4:02	3:57	2:30

I have also tried TorchScript, but the only combination that worked was pure PyTorch with script loaded from the file. jit.compile of that script in PyTorch did work (as in: calling compile didn't raise an exception), however the benchmark results were the same as if the scripted model didn't get compiled at all.

	default	tf32	bf16
PyTorch+script	4:31	4:26	4:03

In TorchSharp loading the script didn't work at all (I'll investigate some more what's happening - it seems that loaded_script_model.to(device) somehow doesn't move all submodels to target device).

0 replies

NiklasGustafsson · 2023-11-03T16:21:23Z

NiklasGustafsson
Nov 3, 2023
Maintainer

One thing about benchmarking Python vs. .NET is the latter's JIT -- it's important to do a warmup run (like completing a first batch) of .NET before starting to measure if it's going to be apples-to-apples.

0 replies

NiklasGustafsson · 2023-11-03T16:22:41Z

NiklasGustafsson
Nov 3, 2023
Maintainer

Do you have your model source code available in a repo somewhere?

0 replies

pkese · 2023-11-03T18:08:16Z

pkese
Nov 3, 2023
Author

@NiklasGustafsson I've uploaded the project to github:
repo: https://github.com/pkese/torch-shakespeare
model: https://github.com/pkese/torch-shakespeare/blob/master/llm.fsx

11 replies

NiklasGustafsson Nov 7, 2023
Maintainer

Yes, the problem is definitely that self.positions does not get moved to the GPU.

NiklasGustafsson Nov 7, 2023
Maintainer

It's worth checking what happens if you call register_buffer() on the positions member in the original Python code.

NiklasGustafsson Nov 7, 2023
Maintainer

Okay, so I'm officially baffled.

layers.embed.positions does appear not as a buffer, but as an attribute of the TorchScript module. The existing code does not deal with attributes at all. It calls the to() method on ScriptModule in the native code:

void THSJIT_Module_to_device(JITModule module, int64_t device, int64_t index)
{
    c10::DeviceType dev = c10::kCPU;
    if (device == 1)
        dev = c10::kCUDA;
    (*module)->to(torch::Device(dev, index));
}

If I add some code to handle attributes, too:

void THSJIT_Module_to_device(JITModule module, int64_t device, int64_t index)
{
    c10::DeviceType dev = c10::kCPU;
    if (device == 1)
        dev = c10::kCUDA;
    (*module)->to(torch::Device(dev, index));

    auto attributes = (*module)->named_attributes();
    int i = 0;
    for (const auto& child : attributes) {
        if (!child.name.empty() && child.value.isTensor())
        {
            auto& t = child.value.toTensor();
            auto moved = t.to(torch::Device(dev, index));
            (*module)->setattr(child.name, moved);
        }
    }
}

then it blows up on the setattr call with an SEHException (no details).

shaltielshmid Dec 11, 2023

For what it's worth, this bug seems to be prevalent in PyTorch as well. The to() method seems to not move the attributes.

For example:

import torch
model = torch.jit.load('shakespeare.pt.zip')
print(model.layers.embed.pos_emb.weight.device) # cpu
print(model.layers.embed.positions.device) # cuda:0
_ = model.cuda()
print(model.layers.embed.pos_emb.weight.device) # cuda:0
print(model.layers.embed.positions.device) # cuda:0
_ = model.cpu()
print(model.layers.embed.pos_emb.weight.device) # cpu
print(model.layers.embed.positions.device) # cuda:0

The difference is that PyTorch default loads that attribute to CUDA. If we specify the cpu device during load, then the problems prevails except the positions stays on the CPU.
The crash occurs similarly in PyTorch:

import torch
model = torch.jit.load('shakespeare.pt.zip')
vocab_size, block_size = 65, 32
xs = torch.randint(vocab_size, (1,block_size), device='cpu', dtype=torch.int64)
model(xs)

For TorchSharp, if we load in the model directly to CUDA, then it works just fine:

var device = torch.cuda.is_available() ? torch.CUDA : torch.CPU;
var model = torch.jit.load<torch.Tensor, torch.Tensor>("D:\\Temp\\torch-shakespeare\\shakespeare.pt.zip", device)
...
model.forward(xs);

shaltielshmid Dec 11, 2023

In any case, the issue with the setattr in C++ was that named_attributes() by default returns all the attributes recursively, whereas setattr() only sets attributes if they already exist on that module.

Should be fixed with: #1183

GeorgeS2019 · 2023-11-03T22:04:34Z

GeorgeS2019
Nov 3, 2023

@NiklasGustafsson
If you need more people to test using e.g. macOs , it would be great to have the codes also available in c#

0 replies

ejhg · 2024-07-27T05:53:19Z

ejhg
Jul 27, 2024

I made a small language model inspired by https://github.com/karpathy/nanoGPT in both PyTorch and TorchSharp.

@pkese Did you implement nanoGPT in TorchSharp natively, or only through TorchScript?

I'm asking because I'm trying to get https://github.com/ejhg/llama-torchsharp (a fork of someone else's initial implementation) to match the performance of the reference llama PyTorch implementation (on an MPS device). Currently, my fork is running 2X slower on MPS, and barely faster than CPU, which is suspicious. If you have a TorchSharp port of nanoGPT, that would be really useful.

On a separate thread, if anyone can share any tips on performance profiling for TorchSharp, that would be awesome...

5 replies

GeorgeS2019 Jul 28, 2024

@ejhg => check @LittleLittleCloud's projects on different torchsharp LLM projects.

join his discord to ask him questions directly

ejhg Jul 28, 2024

@GeorgeS2019 Will do, thanks!

Sorry, I'm a boomer, what do you mean "join his discord"? I don't see a link on his profile.

I just emailed him directly using the email that he signs his commits with...

re: performance - it seems that tensors are just being accumulated during evaluation and not getting garbage collected (at least fast enough). not sure if that would explain the 2X performance slowdown, but will get that fixed and see if it helps.

travisjj Sep 7, 2024

RE: performance profiling with regards to nanoGPT

Many of the PyTorch features that contribute to speed used in Karpathy's nanoGPT are to my knowledge not supported by TorchSharp.

For example: torch.compile is not supported which allows for common read/write operations to be grouped;
torch.set_float32_matmul_precision is not supported which allows for the gpu to take advantage of lower precision in float operations;
torch.autocast is not supported which is similarly related to matmul precision.

In addition to the gpu support, there is a very large body of work done by Karpathy with parallel optimizations that is not supported, such as torch.distributed, which allows for multiple gpus to be used at the same time.

These speed improvements offer significant gains inside of nanoGPT, in total they are 10x faster, and in parallel nearly 300k times faster. This allows for the difference in time measured per iteration between his original 1000ms benchmark to a 90ms benchmark after the gpu optimizations but prior to running in parallel.

It is hard to apples to apples compare gpu benchmarks as a result of not being able to take full advantage of the gpu features.

If there is a way in TorchSharp to use torch.compile, torch.autocast, and torch.distributed please let me know.

ejhg Sep 30, 2024

Even without all those optimizations, it would still be nice to have an initial reference nanogpt implemented in torchsharp, so if anyone has it, please share... :)

GeorgeS2019 Oct 1, 2024

#1379

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking against PyTorch & jit.compile #1126

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 16 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Benchmarking against PyTorch & jit.compile #1126

Replies: 10 comments · 16 replies

NiklasGustafsson Nov 2, 2023 Maintainer

NiklasGustafsson Nov 2, 2023 Maintainer

NiklasGustafsson Nov 2, 2023 Maintainer

pkese Nov 3, 2023 Author

NiklasGustafsson Nov 3, 2023 Maintainer

NiklasGustafsson Nov 3, 2023 Maintainer

pkese Nov 3, 2023 Author

NiklasGustafsson Nov 7, 2023 Maintainer

NiklasGustafsson Nov 7, 2023 Maintainer

NiklasGustafsson Nov 7, 2023 Maintainer

Replies: 10 comments 16 replies

NiklasGustafsson
Nov 2, 2023
Maintainer

NiklasGustafsson
Nov 2, 2023
Maintainer

NiklasGustafsson
Nov 2, 2023
Maintainer

pkese
Nov 3, 2023
Author

NiklasGustafsson
Nov 3, 2023
Maintainer

NiklasGustafsson
Nov 3, 2023
Maintainer

pkese
Nov 3, 2023
Author

NiklasGustafsson Nov 7, 2023
Maintainer

NiklasGustafsson Nov 7, 2023
Maintainer

NiklasGustafsson Nov 7, 2023
Maintainer