-
Notifications
You must be signed in to change notification settings - Fork 704
Move pytorch shuffling out of main thread #1847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This can more than double the performace of simple pytorch micro-benchmark.
9cc4bfe to
eb2bc49
Compare
|
Hi @hakanardo ! Thanks a lot for your contribution. Do you mind signing the CLA, while we review the PR. Also, if you haven't already, please consider joining our community on slack. |
|
Hi @hakanardo, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description? |
|
Sure, they'll have to be extracted from a bigger experimental tangle, but
I'll get back to you with a stand alone micro-benchmark...
…On Thu, Sep 8, 2022 at 8:39 AM Fariz Rahman ***@***.***> wrote:
Hi @hakanardo <https://github.com/hakanardo>, thanks for the
contribution. Can you provide the code and results for the benchmarks
mentioned in the PR description?
—
Reply to this email directly, view it on GitHub
<#1847 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2GZ4P5VX3RRN63N7WFWIDV5GCYLANCNFSM6AAAAAAQFYAKVU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Håkan Ardö
|
|
Hi,
here is the benchmark. Including results as comments at the bottom.
…On Thu, Sep 8, 2022 at 8:48 AM Hakan Ardo ***@***.***> wrote:
Sure, they'll have to be extracted from a bigger experimental tangle, but
I'll get back to you with a stand alone micro-benchmark...
On Thu, Sep 8, 2022 at 8:39 AM Fariz Rahman ***@***.***>
wrote:
> Hi @hakanardo <https://github.com/hakanardo>, thanks for the
> contribution. Can you provide the code and results for the benchmarks
> mentioned in the PR description?
>
> —
> Reply to this email directly, view it on GitHub
> <#1847 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AA2GZ4P5VX3RRN63N7WFWIDV5GCYLANCNFSM6AAAAAAQFYAKVU>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
--
Håkan Ardö
--
Håkan Ardö
|
|
@hakanardo thank you very much for these, these would be extremely helpful for us internally to improve Hub. I'll let @farizrahman4u and @levongh provide any additional updates once they have them. Have a nice start of the week in the meantime! |
|
Hey @hakanardo ! Thanks for your PR. I was looking into the benchmark file you sent but was unable to recreate your results. The speed seems to be almost the same on main vs on your branch. Any specifics that you can share about your machine that might help us recreate this would be helpful. On another note, we have recently introduced a new experimental dataloader implemented in C++ to Hub (only works on linux for now). In the near future we'll be deprecating the existing dataloader. If you're interested, you can take a look at this Getting started with Deep Lake notebook and the assosciated docs. |
|
I use a "Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz" system with an "GeForce
RTX 2080 Ti" GPU. Thanx for the pointer, I will have a look!
…On Thu, Sep 15, 2022 at 4:50 PM Abhinav Tuli ***@***.***> wrote:
Hey @hakanardo <https://github.com/hakanardo> ! Thanks for your PR. I was
looking into the benchmark file you sent but was unable to recreate your
results. The speed seems to be almost the same on main vs on your branch.
Any specifics that you can share about your machine that might help us
recreate this would be helpful.
On another note, we have recently introduced a new experimental dataloader
implemented in C++ to Hub (only works on linux for now). In the near future
we'll be deprecating the existing dataloader. If you're interested, you can
take a look at this Getting started with Deep Lake
<https://colab.research.google.com/drive/1Sky4YX0WQlf3F-0pU34QnhB-fzL_SsTZ?usp=sharing>
notebook and the assosciated docs
<https://docs.deeplake.ai/en/latest/Dataloader-and-Query.html>.
—
Reply to this email directly, view it on GitHub
<#1847 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2GZ4LBQESMNJVGN6MYL3DV6MZTLANCNFSM6AAAAAAQFYAKVU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Håkan Ardö
|
|
I've benchmarked the deeplake loader using the attached benchmark. It performs slightly better than my branch with shuffle=True, but significantly worse than the python loader with shuffle=False. Am I doing something wrong? The exact number are at the bottom of the script. Also, the distribution of the indexes in the produced batches looks a lot more uniform. That is very nice! |
|
hey @hakanardo can you please share |
|
Thanks @hakanardo! You might want to use num_workers=0 for deeplake (we're spinning up threads for fetching and decompression, num_workers spins up processes separately for transformation and collate) as in the current implementation there's an interprocess communication overhead, that we'll be working on fixing shortly. |
|
Thanx for the tip, num_workers=0 improves things, but I still only get some 80% GPU utilization in my trainings as opposed to about 98% with the shuffle_thread branch. I also tried to move the entire dataloading to a separate process by wrapping it in another dataloader similar to how the old loader works: class IterableDatasetWrapper(IterableDataset):
def __init__(self, dl):
self.dl = dl
def __iter__(self):
for b in self.dl:
yield b
dataloader = DataLoader(IterableDatasetWrapper(dataloader), num_workers=1, collate_fn=lambda a: a[0])But that seams to dedlock on the |
|
Se also #1888 |
|
I think using |
|
That worked, but requires the dataloader to be picklable, which is a bit annoying. But it seems to be enough to wrap it in a separate thread: class DONE: pass
class ThreadedItteretor(Thread):
def __init__(self, itterable):
super().__init__(daemon=True)
self.itterable = itterable
self.queue = Queue(2)
self.start()
def run(self) -> None:
while True:
for val in self.itterable:
val['images'] = val['images'].pin_memory()
self.queue.put(val)
self.queue.put(DONE)
def __iter__(self):
while True:
val = self.queue.get()
if val is DONE:
break
yield valThis allows me to train at full GPU utilization again. Would you consider including something like this as an option in the new deeplake dataloader? |
|
Hey @hakanardo! Thanks for the suggestion, I tried it but couldn't really get a performance boost while iterating. Do you think that the better performance here is due to pinning memory in the separate thread that you're running? |
|
I think the main effect comes from doing the data loading on the CPU in
parallel with the training on the GPU instead of in series with it. Thereby
the entire data-loading time can be ignored as the data is already
available when the GPU needs it. Are you benchmarking with a database big
enough to not fit in the cache to make sure there is an io-time-cost to the
dataloading?
…On Tue, Oct 4, 2022 at 7:08 PM Abhinav Tuli ***@***.***> wrote:
Hey @hakanardo <https://github.com/hakanardo>! Thanks for the suggestion,
I tried it but couldn't really get a performance boost while iterating. Do
you think that the better performance here is due to pinning memory in the
separate thread that you're running?
—
Reply to this email directly, view it on GitHub
<#1847 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2GZ4IZUIKKDUOAIT7VZYDWBRQBNANCNFSM6AAAAAAQFYAKVU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Håkan Ardö
|
|
Got it. Yup, I was using imagenet. The thing is that the experimental dataloader is already spinning up threads in C++, to prefetch and decompress data in parallel with training on GPU. The collation and transform is however done serially, perhaps that could be the difference in our results. Were you using some transform or collate? Is it the same script as the one you sent before? |
|
Closing this for now as we didn't see a lot of performance changes. |
This can more than double the performace of simple
pytorch micro-benchmark.
🚀 🚀 Pull Request
Checklist:
coverage-rateupChanges