[rough draft] Cython-based file copying; aligned tensors #120
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a rough draft, not intended for merging. Parts of this can be used for 3.x
Loading a kernel-cached fp16 20b from nvme, I can get 24.5GB/s
On a non-kernel-cache NFS-backed system with O_DIRECT, getting 8.5GB/s with equivalent fio job somewhere in the 8-9GB/s range
Cython-based cuda copy
_cuda_file.pyxis similar to the POSIX compatibility mode of cuFileRead, but written in Cython without the cuFile dependency. It does a sliding-windowpreadinto a cuda address, using an arbitrarily-sized registered buffer (16MB in this case) as the bounce buffer.copy_to_device()can be used with or without O_DIRECT, in cases where O_DIRECT may be faster. To be O_DIRECT compatible it aligns reads to page boundaries for both offsets and sizes. Unlike cuFile, it won't switchover automatically to use O_DIRECT - it'll use whatever file descriptor you pass in.Alignment & whole-hog reading
We read the entire segment of tensors, headers and all, straight into one large cuda buffer. As that runs in a thread, we skip through, reading the headers and allocate PyTorch tensors around where their data will be in cuda memory. The tensors themselves are aligned to 8-byte boundaries so any possible native type is well-aligned. e.g. fp16 can't start on odd addresses.
Test models
The padding necessarily changes the serialization format. Here are a couple models you can use for testing
http://bchess.object.las1.coreweave.com/gpt-neox-20b-padded.tensors
http://bchess.object.las1.coreweave.com/gpt-j-6B-padded.tensors