[rough draft] Cython-based file copying; aligned tensors #120

bchess · 2024-03-28T18:55:53Z

This is a rough draft, not intended for merging. Parts of this can be used for 3.x

Loading a kernel-cached fp16 20b from nvme, I can get 24.5GB/s
On a non-kernel-cache NFS-backed system with O_DIRECT, getting 8.5GB/s with equivalent fio job somewhere in the 8-9GB/s range

Cython-based cuda copy

_cuda_file.pyx is similar to the POSIX compatibility mode of cuFileRead, but written in Cython without the cuFile dependency. It does a sliding-window pread into a cuda address, using an arbitrarily-sized registered buffer (16MB in this case) as the bounce buffer.

copy_to_device() can be used with or without O_DIRECT, in cases where O_DIRECT may be faster. To be O_DIRECT compatible it aligns reads to page boundaries for both offsets and sizes. Unlike cuFile, it won't switchover automatically to use O_DIRECT - it'll use whatever file descriptor you pass in.

Alignment & whole-hog reading

We read the entire segment of tensors, headers and all, straight into one large cuda buffer. As that runs in a thread, we skip through, reading the headers and allocate PyTorch tensors around where their data will be in cuda memory. The tensors themselves are aligned to 8-byte boundaries so any possible native type is well-aligned. e.g. fp16 can't start on odd addresses.

Test models

The padding necessarily changes the serialization format. Here are a couple models you can use for testing
http://bchess.object.las1.coreweave.com/gpt-neox-20b-padded.tensors
http://bchess.object.las1.coreweave.com/gpt-j-6B-padded.tensors

Ben Chess added 10 commits March 28, 2024 11:27

27GB/s single reader

cba8fe0

cleaned up, now slower lol. needed to synchronize

def5259

perf measuring improvements

b2caf79

O_DIRECT

504e19d

cython wip

5491c27

more cython

f30ce7f

alignment

f499150

cleanup

8f5ca20

remove flip-flop

e330062

cleanup

2074ed6

bchess requested review from Eta0 and wbrown March 28, 2024 18:55

bchess changed the title ~~Cython-based file copying; aligned tensors~~ [rough draft] Cython-based file copying; aligned tensors Mar 28, 2024

whoops

e6ded25

bchess force-pushed the bchess/cuda_file branch from 32567ad to e6ded25 Compare March 28, 2024 20:52

.

45cee65

bchess mentioned this pull request Jun 27, 2024

[3.0] Optimize deserialization for new schema #161

Open

bchess closed this Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rough draft] Cython-based file copying; aligned tensors #120

[rough draft] Cython-based file copying; aligned tensors #120

Uh oh!

bchess commented Mar 28, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[rough draft] Cython-based file copying; aligned tensors #120

[rough draft] Cython-based file copying; aligned tensors #120

Uh oh!

Conversation

bchess commented Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cython-based cuda copy

Alignment & whole-hog reading

Test models

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bchess commented Mar 28, 2024 •

edited

Loading