Skip to content

Conversation

@bchess
Copy link
Contributor

@bchess bchess commented Mar 28, 2024

This is a rough draft, not intended for merging. Parts of this can be used for 3.x

Loading a kernel-cached fp16 20b from nvme, I can get 24.5GB/s
On a non-kernel-cache NFS-backed system with O_DIRECT, getting 8.5GB/s with equivalent fio job somewhere in the 8-9GB/s range

Cython-based cuda copy

_cuda_file.pyx is similar to the POSIX compatibility mode of cuFileRead, but written in Cython without the cuFile dependency. It does a sliding-window pread into a cuda address, using an arbitrarily-sized registered buffer (16MB in this case) as the bounce buffer.

copy_to_device() can be used with or without O_DIRECT, in cases where O_DIRECT may be faster. To be O_DIRECT compatible it aligns reads to page boundaries for both offsets and sizes. Unlike cuFile, it won't switchover automatically to use O_DIRECT - it'll use whatever file descriptor you pass in.

Alignment & whole-hog reading

We read the entire segment of tensors, headers and all, straight into one large cuda buffer. As that runs in a thread, we skip through, reading the headers and allocate PyTorch tensors around where their data will be in cuda memory. The tensors themselves are aligned to 8-byte boundaries so any possible native type is well-aligned. e.g. fp16 can't start on odd addresses.

Test models

The padding necessarily changes the serialization format. Here are a couple models you can use for testing
http://bchess.object.las1.coreweave.com/gpt-neox-20b-padded.tensors
http://bchess.object.las1.coreweave.com/gpt-j-6B-padded.tensors

@bchess bchess requested review from Eta0 and wbrown March 28, 2024 18:55
@bchess bchess changed the title Cython-based file copying; aligned tensors [rough draft] Cython-based file copying; aligned tensors Mar 28, 2024
@bchess bchess force-pushed the bchess/cuda_file branch from 32567ad to e6ded25 Compare March 28, 2024 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants