Skip to content

Conversation

@hakanardo
Copy link
Contributor

This can more than double the performace of simple
pytorch micro-benchmark.

🚀 🚀 Pull Request

Checklist:

  • My code follows the style guidelines of this project and the Contributing document
  • I have commented my code, particularly in hard-to-understand areas
  • I have kept the coverage-rate up
  • I have performed a self-review of my own code and resolved any problems
  • I have checked to ensure there aren't any other open Pull Requests for the same change
  • I have described and made corresponding changes to the relevant documentation
  • New and existing unit tests pass locally with my changes

Changes

@CLAassistant
Copy link

CLAassistant commented Sep 6, 2022

CLA assistant check
All committers have signed the CLA.

This can more than double the performace of simple
pytorch micro-benchmark.
@tatevikh
Copy link
Contributor

tatevikh commented Sep 6, 2022

Hi @hakanardo ! Thanks a lot for your contribution. Do you mind signing the CLA, while we review the PR. Also, if you haven't already, please consider joining our community on slack.

@farizrahman4u
Copy link
Contributor

Hi @hakanardo, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description?

@hakanardo
Copy link
Contributor Author

hakanardo commented Sep 8, 2022 via email

@hakanardo
Copy link
Contributor Author

hakanardo commented Sep 12, 2022 via email

@hakanardo
Copy link
Contributor Author

hub_bench.zip

@mikayelh
Copy link
Contributor

@hakanardo thank you very much for these, these would be extremely helpful for us internally to improve Hub. I'll let @farizrahman4u and @levongh provide any additional updates once they have them. Have a nice start of the week in the meantime!

@AbhinavTuli
Copy link
Contributor

Hey @hakanardo ! Thanks for your PR. I was looking into the benchmark file you sent but was unable to recreate your results. The speed seems to be almost the same on main vs on your branch. Any specifics that you can share about your machine that might help us recreate this would be helpful.

On another note, we have recently introduced a new experimental dataloader implemented in C++ to Hub (only works on linux for now). In the near future we'll be deprecating the existing dataloader. If you're interested, you can take a look at this Getting started with Deep Lake notebook and the assosciated docs.

@hakanardo
Copy link
Contributor Author

hakanardo commented Sep 15, 2022 via email

@hakanardo
Copy link
Contributor Author

I've benchmarked the deeplake loader using the attached benchmark. It performs slightly better than my branch with shuffle=True, but significantly worse than the python loader with shuffle=False. Am I doing something wrong? The exact number are at the bottom of the script.

Also, the distribution of the indexes in the produced batches looks a lot more uniform. That is very nice!

deeplake_bench.zip

@levongh
Copy link
Contributor

levongh commented Sep 15, 2022

hey @hakanardo can you please share hub and deeplake versions

@AbhinavTuli
Copy link
Contributor

AbhinavTuli commented Sep 15, 2022

Thanks @hakanardo! You might want to use num_workers=0 for deeplake (we're spinning up threads for fetching and decompression, num_workers spins up processes separately for transformation and collate) as in the current implementation there's an interprocess communication overhead, that we'll be working on fixing shortly.

@hakanardo
Copy link
Contributor Author

Thanx for the tip, num_workers=0 improves things, but I still only get some 80% GPU utilization in my trainings as opposed to about 98% with the shuffle_thread branch. I also tried to move the entire dataloading to a separate process by wrapping it in another dataloader similar to how the old loader works:

class IterableDatasetWrapper(IterableDataset):
    def __init__(self, dl):
        self.dl = dl

    def __iter__(self):
        for b in self.dl:
            yield b

dataloader = DataLoader(IterableDatasetWrapper(dataloader), num_workers=1, collate_fn=lambda a: a[0])

But that seams to dedlock on the api.dataset call on line 60 of convert_to_hub3.py.

@hakanardo
Copy link
Contributor Author

Se also #1888

@AbhinavTuli
Copy link
Contributor

I think using multiprocessing.set_start_method("spawn", force=True) at the top should fix the issue of it getting stuck.
There's an issue with forking Hub3Dataset object that we're aware of.

@hakanardo
Copy link
Contributor Author

That worked, but requires the dataloader to be picklable, which is a bit annoying. But it seems to be enough to wrap it in a separate thread:

    class DONE: pass
    class ThreadedItteretor(Thread):
        def __init__(self, itterable):
            super().__init__(daemon=True)
            self.itterable = itterable
            self.queue = Queue(2)
            self.start()

        def run(self) -> None:
            while True:
                for val in self.itterable:
                    val['images'] = val['images'].pin_memory()
                    self.queue.put(val)
                self.queue.put(DONE)

        def __iter__(self):
            while True:
                val = self.queue.get()
                if val is DONE:
                    break
                yield val

This allows me to train at full GPU utilization again. Would you consider including something like this as an option in the new deeplake dataloader?

@AbhinavTuli
Copy link
Contributor

Hey @hakanardo! Thanks for the suggestion, I tried it but couldn't really get a performance boost while iterating. Do you think that the better performance here is due to pinning memory in the separate thread that you're running?

@hakanardo
Copy link
Contributor Author

hakanardo commented Oct 4, 2022 via email

@AbhinavTuli
Copy link
Contributor

AbhinavTuli commented Oct 5, 2022

Got it. Yup, I was using imagenet. The thing is that the experimental dataloader is already spinning up threads in C++, to prefetch and decompress data in parallel with training on GPU. The collation and transform is however done serially, perhaps that could be the difference in our results. Were you using some transform or collate? Is it the same script as the one you sent before?

@tatevikh
Copy link
Contributor

Closing this for now as we didn't see a lot of performance changes.

@tatevikh tatevikh closed this Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants