Move pytorch shuffling out of main thread #1847

hakanardo · 2022-09-06T11:54:45Z

This can more than double the performace of simple
pytorch micro-benchmark.

🚀 🚀 Pull Request

Checklist:

My code follows the style guidelines of this project and the Contributing document
I have commented my code, particularly in hard-to-understand areas
I have kept the coverage-rate up
I have performed a self-review of my own code and resolved any problems
I have checked to ensure there aren't any other open Pull Requests for the same change
I have described and made corresponding changes to the relevant documentation
New and existing unit tests pass locally with my changes

Changes

CLAassistant · 2022-09-06T11:54:50Z

All committers have signed the CLA.

This can more than double the performace of simple pytorch micro-benchmark.

tatevikh · 2022-09-06T11:58:42Z

Hi @hakanardo ! Thanks a lot for your contribution. Do you mind signing the CLA, while we review the PR. Also, if you haven't already, please consider joining our community on slack.

farizrahman4u · 2022-09-08T06:38:51Z

Hi @hakanardo, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description?

hakanardo · 2022-09-08T06:48:42Z

Sure, they'll have to be extracted from a bigger experimental tangle, but I'll get back to you with a stand alone micro-benchmark...

…

On Thu, Sep 8, 2022 at 8:39 AM Fariz Rahman ***@***.***> wrote: Hi @hakanardo <https://github.com/hakanardo>, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description? — Reply to this email directly, view it on GitHub <#1847 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2GZ4P5VX3RRN63N7WFWIDV5GCYLANCNFSM6AAAAAAQFYAKVU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Håkan Ardö

hakanardo · 2022-09-12T11:44:01Z

Hi, here is the benchmark. Including results as comments at the bottom.

…

On Thu, Sep 8, 2022 at 8:48 AM Hakan Ardo ***@***.***> wrote: Sure, they'll have to be extracted from a bigger experimental tangle, but I'll get back to you with a stand alone micro-benchmark... On Thu, Sep 8, 2022 at 8:39 AM Fariz Rahman ***@***.***> wrote: > Hi @hakanardo <https://github.com/hakanardo>, thanks for the > contribution. Can you provide the code and results for the benchmarks > mentioned in the PR description? > > — > Reply to this email directly, view it on GitHub > <#1847 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AA2GZ4P5VX3RRN63N7WFWIDV5GCYLANCNFSM6AAAAAAQFYAKVU> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> > -- Håkan Ardö

-- Håkan Ardö

hakanardo · 2022-09-12T11:51:02Z

hub_bench.zip

mikayelh · 2022-09-12T11:52:45Z

@hakanardo thank you very much for these, these would be extremely helpful for us internally to improve Hub. I'll let @farizrahman4u and @levongh provide any additional updates once they have them. Have a nice start of the week in the meantime!

AbhinavTuli · 2022-09-15T14:50:18Z

Hey @hakanardo ! Thanks for your PR. I was looking into the benchmark file you sent but was unable to recreate your results. The speed seems to be almost the same on main vs on your branch. Any specifics that you can share about your machine that might help us recreate this would be helpful.

On another note, we have recently introduced a new experimental dataloader implemented in C++ to Hub (only works on linux for now). In the near future we'll be deprecating the existing dataloader. If you're interested, you can take a look at this Getting started with Deep Lake notebook and the assosciated docs.

hakanardo · 2022-09-15T15:27:19Z

I use a "Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz" system with an "GeForce RTX 2080 Ti" GPU. Thanx for the pointer, I will have a look!

…

On Thu, Sep 15, 2022 at 4:50 PM Abhinav Tuli ***@***.***> wrote: Hey @hakanardo <https://github.com/hakanardo> ! Thanks for your PR. I was looking into the benchmark file you sent but was unable to recreate your results. The speed seems to be almost the same on main vs on your branch. Any specifics that you can share about your machine that might help us recreate this would be helpful. On another note, we have recently introduced a new experimental dataloader implemented in C++ to Hub (only works on linux for now). In the near future we'll be deprecating the existing dataloader. If you're interested, you can take a look at this Getting started with Deep Lake <https://colab.research.google.com/drive/1Sky4YX0WQlf3F-0pU34QnhB-fzL_SsTZ?usp=sharing> notebook and the assosciated docs <https://docs.deeplake.ai/en/latest/Dataloader-and-Query.html>. — Reply to this email directly, view it on GitHub <#1847 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2GZ4LBQESMNJVGN6MYL3DV6MZTLANCNFSM6AAAAAAQFYAKVU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Håkan Ardö

hakanardo · 2022-09-15T17:00:34Z

I've benchmarked the deeplake loader using the attached benchmark. It performs slightly better than my branch with shuffle=True, but significantly worse than the python loader with shuffle=False. Am I doing something wrong? The exact number are at the bottom of the script.

Also, the distribution of the indexes in the produced batches looks a lot more uniform. That is very nice!

deeplake_bench.zip

levongh · 2022-09-15T17:47:44Z

hey @hakanardo can you please share hub and deeplake versions

AbhinavTuli · 2022-09-15T17:51:47Z

Thanks @hakanardo! You might want to use num_workers=0 for deeplake (we're spinning up threads for fetching and decompression, num_workers spins up processes separately for transformation and collate) as in the current implementation there's an interprocess communication overhead, that we'll be working on fixing shortly.

hakanardo · 2022-09-20T07:49:48Z

Thanx for the tip, num_workers=0 improves things, but I still only get some 80% GPU utilization in my trainings as opposed to about 98% with the shuffle_thread branch. I also tried to move the entire dataloading to a separate process by wrapping it in another dataloader similar to how the old loader works:

class IterableDatasetWrapper(IterableDataset):
    def __init__(self, dl):
        self.dl = dl

    def __iter__(self):
        for b in self.dl:
            yield b

dataloader = DataLoader(IterableDatasetWrapper(dataloader), num_workers=1, collate_fn=lambda a: a[0])

But that seams to dedlock on the api.dataset call on line 60 of convert_to_hub3.py.

hakanardo · 2022-09-20T08:02:57Z

Se also #1888

AbhinavTuli · 2022-09-21T11:25:03Z

I think using multiprocessing.set_start_method("spawn", force=True) at the top should fix the issue of it getting stuck.
There's an issue with forking Hub3Dataset object that we're aware of.

hakanardo · 2022-09-28T06:36:13Z

That worked, but requires the dataloader to be picklable, which is a bit annoying. But it seems to be enough to wrap it in a separate thread:

    class DONE: pass
    class ThreadedItteretor(Thread):
        def __init__(self, itterable):
            super().__init__(daemon=True)
            self.itterable = itterable
            self.queue = Queue(2)
            self.start()

        def run(self) -> None:
            while True:
                for val in self.itterable:
                    val['images'] = val['images'].pin_memory()
                    self.queue.put(val)
                self.queue.put(DONE)

        def __iter__(self):
            while True:
                val = self.queue.get()
                if val is DONE:
                    break
                yield val

This allows me to train at full GPU utilization again. Would you consider including something like this as an option in the new deeplake dataloader?

AbhinavTuli · 2022-10-04T17:08:28Z

Hey @hakanardo! Thanks for the suggestion, I tried it but couldn't really get a performance boost while iterating. Do you think that the better performance here is due to pinning memory in the separate thread that you're running?

hakanardo · 2022-10-04T17:59:00Z

I think the main effect comes from doing the data loading on the CPU in parallel with the training on the GPU instead of in series with it. Thereby the entire data-loading time can be ignored as the data is already available when the GPU needs it. Are you benchmarking with a database big enough to not fit in the cache to make sure there is an io-time-cost to the dataloading?

…

On Tue, Oct 4, 2022 at 7:08 PM Abhinav Tuli ***@***.***> wrote: Hey @hakanardo <https://github.com/hakanardo>! Thanks for the suggestion, I tried it but couldn't really get a performance boost while iterating. Do you think that the better performance here is due to pinning memory in the separate thread that you're running? — Reply to this email directly, view it on GitHub <#1847 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2GZ4IZUIKKDUOAIT7VZYDWBRQBNANCNFSM6AAAAAAQFYAKVU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Håkan Ardö

AbhinavTuli · 2022-10-05T03:23:49Z

Got it. Yup, I was using imagenet. The thing is that the experimental dataloader is already spinning up threads in C++, to prefetch and decompress data in parallel with training on GPU. The collation and transform is however done serially, perhaps that could be the difference in our results. Were you using some transform or collate? Is it the same script as the one you sent before?

tatevikh · 2023-03-31T14:02:47Z

Closing this for now as we didn't see a lot of performance changes.

tatevikh requested review from AbhinavTuli and levongh September 6, 2022 11:57

Move pytorch shuffling out of main thread

eb2bc49

This can more than double the performace of simple pytorch micro-benchmark.

hakanardo force-pushed the shuffle_thread branch from 9cc4bfe to eb2bc49 Compare September 6, 2022 11:58

tatevikh requested a review from farizrahman4u September 7, 2022 13:48

tatevikh closed this Mar 31, 2023

Move pytorch shuffling out of main thread #1847

Move pytorch shuffling out of main thread #1847

Uh oh!

Conversation

hakanardo commented Sep 6, 2022

🚀 🚀 Pull Request

Checklist:

Changes

Uh oh!

CLAassistant commented Sep 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tatevikh commented Sep 6, 2022

Uh oh!

farizrahman4u commented Sep 8, 2022

Uh oh!

hakanardo commented Sep 8, 2022 via email

Uh oh!

hakanardo commented Sep 12, 2022 via email

Uh oh!

hakanardo commented Sep 12, 2022

Uh oh!

mikayelh commented Sep 12, 2022

Uh oh!

AbhinavTuli commented Sep 15, 2022

Uh oh!

hakanardo commented Sep 15, 2022 via email

Uh oh!

hakanardo commented Sep 15, 2022

Uh oh!

levongh commented Sep 15, 2022

Uh oh!

AbhinavTuli commented Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hakanardo commented Sep 20, 2022

Uh oh!

hakanardo commented Sep 20, 2022

Uh oh!

AbhinavTuli commented Sep 21, 2022

Uh oh!

hakanardo commented Sep 28, 2022

Uh oh!

AbhinavTuli commented Oct 4, 2022

Uh oh!

hakanardo commented Oct 4, 2022 via email

Uh oh!

AbhinavTuli commented Oct 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tatevikh commented Mar 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

CLAassistant commented Sep 6, 2022 •

edited

Loading

AbhinavTuli commented Sep 15, 2022 •

edited

Loading

AbhinavTuli commented Oct 5, 2022 •

edited

Loading