-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EfficientNet extractor slow for large number of points per image. #31
Comments
@qiminchen : I found some surprising runtime results for the EfficientNet extractor. Do you mind taking a look? |
hmm this is interesting, no problem, i will take a look at this. |
Thanks. LMK if you are able to reproduce the problem. |
I pulled the
Seems to work fine? This is really confusing me.. |
Hmm. Yeah, those results look great. Indeed, they look amazing. 10 points in 0.47 seconds -- you sure you are not using the GPU? How much RAM have you allocated to your Docker Engine? And how many CPU core? I have mine capped at 4GB and 1 core. I tried to up it to 8GB and 2 CPU cores just now, but not seeing much of a difference. |
Here is detailed log and summary using 2CPU cores and 8GB RAM. A bit better, but nowhere near what you got. What host system are you using?
|
No I not using the GPU, actually our code doesn't support GPU so far since we did not move any model and data to GPU. How do you specify the resources, like RAM and number of core when running the docker? I am not very familiar with Docker |
Ubuntu 18.04.5 with AMD® Ryzen 7 3700x 8-core processor × 16 |
Hey @qiminchen . Re. GPU: gotit -- that is what I thought. Re. setting resources, I haven't done it on Ubuntu, but there are plenty of resources online, like this one: https://hostadvice.com/how-to/how-to-limit-a-docker-containers-resources-on-ubuntu-18-04/. Just as a sanity check, can you try running again using the exact same image that I used? You can pull it down using |
@StephenChan : would you mind running this also so we get a third data point? |
I ran again with 4 cores and 16GB ram -- basically my whole laptop. Slightly better yet again:
|
I pulled the exact same image and ran a couple of tests and here is the full comparison. I'm pretty sure I have the RAM limit correct but not sure about the CPU core as the only option I found on the official document that might be related to CPU core is
Seems like VGG16 gains much more benefit from increasing the RAM than EfficientNet-b0. And I think the performance might be highly affected by the CPU types, also the performance of Pytorch vs. Caffe. |
Ok. So to summarize. The runs below are all using 1 CPU core and 4GB RAM. @qiminchen 's results using AMD® Ryzen 7 3700x 8-core processor × 16 on ubuntu:
My results using 2.4 GHz Dual-Core Intel Core i7 on OSX:
My results using m4.large AWS instance with a 2.3 GHz Intel Xeon® E5-2686 v4 CPU:
A few notes.
So there is a clear outlier in @qiminchen 's torch runtimes. I did some googling but didn't fine any evidence that AMD would be so much faster for pytorch compared to Intel. It just doesn't make sense to me. Let's keep digging. |
@qiminchen : I think I found it. For some reson the timing measurements in your log don't add up: E.g.
I'll take a look and see what might have gone wrong there. |
Looks like this is the number from the actual extraction |
You are right. That first measurement is longer than |
this one from your log seems to be weird. the actual extraction took 19.459403 seconds but
|
I didn't get around to testing this today. I can do it tomorrow if the problem hasn't been figured out by then. |
Again, I pulled the exact same image and
I guess this might be because of the AMD vs Intel. Is it possible that you test the runtime on an AWS instance with an AMD CPU instead of an Intel one? |
Thanks. That’s a good data point. I should be able to set that up on AWS.
Let’s also try to google it. The only thing I’m finding is the opposite. Eg
pytorch/pytorch#32008
…On Tue, Oct 13, 2020 at 22:22 Qimin Chen ***@***.***> wrote:
Again, I pulled the exact same image and boto3 branch and tested the
runtime on my laptop: *2.3 GHz Quad-Core Intel Core i5 on MacOS*. I was
able to reproduce the issue on my laptop. Here are the results:
1. 1 core + 4GB RAM
'1000, 10, efficientnet_b0_ver1: 8.29', '1000, 10, vgg16_coralnet_ver1: 17.51',
'1000, 50, efficientnet_b0_ver1: 33.34', '1000, 50, vgg16_coralnet_ver1: 27.32',
'1000, 100, efficientnet_b0_ver1: 64.95', '1000, 100, vgg16_coralnet_ver1: 49.93',
'1000, 200, efficientnet_b0_ver1: 135.88', '1000, 200, vgg16_coralnet_ver1: 101.42'
1. 2 cores + 8GB RAM
'1000, 10, efficientnet_b0_ver1: 4.54', '1000, 10, vgg16_coralnet_ver1: 13.04',
'1000, 50, efficientnet_b0_ver1: 17.50', '1000, 50, vgg16_coralnet_ver1: 17.93',
'1000, 100, efficientnet_b0_ver1: 33.82', '1000, 100, vgg16_coralnet_ver1: 33.23',
'1000, 200, efficientnet_b0_ver1: 67.42', '1000, 200, vgg16_coralnet_ver1: 62.74'
1. 4 cores + 16GB RAM
'1000, 10, efficientnet_b0_ver1: 2.93', '1000, 10, vgg16_coralnet_ver1: 11.38',
'1000, 50, efficientnet_b0_ver1: 10.55', '1000, 50, vgg16_coralnet_ver1: 14.87',
'1000, 100, efficientnet_b0_ver1: 19.42', '1000, 100, vgg16_coralnet_ver1: 27.51',
'1000, 200, efficientnet_b0_ver1: 37.40', '1000, 200, vgg16_coralnet_ver1: 49.96'
I guess this might be because of the AMD vs Intel. Is it possible that you
test the runtime on an AWS instance with an AMD CPU instead of an Intel one?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITTF5YZDQ6LPUFBOGLAYLSKUYRPANCNFSM4SMBLOVA>
.
|
VirtualBox on my laptop with Intel CPU. The VM is Ubuntu Server 20.04, set to 1 core + 4 GB RAM:
|
@StephenChan @qiminchen . I setup a compute environment on Batch with only AMD instanced. Lo and behold: it worked.
|
It is still unclear to me why this is a thing. And I haven't found any clues online. But I'll consider this problem solved, for now. ;) |
Wow. That is a HUGE difference.
It's 20.84 sec vs. 568 sec. 25x difference.
It is really fortunate that Qimin happened to have an AMD machine and had
produced the good timings initially. Otherwise, we may have
gone down a very different path. Would still be fruitful to understand why
there's a different so it doesn't bight us downstream. Of course, the
solution to
just use AMD instances works for now, and just hope that it
does't regress downstream and start running 25x slower. OK to debug then,
but would we notice?
D
…On Wed, Oct 14, 2020 at 10:33 PM Oscar Beijbom ***@***.***> wrote:
Closed #31
<https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/31__;!!Mih3wA!T3e1QzL_VNEtKLfo3M8i05_1Kam01IGdz5UOb8MWDjifEaftGW-T3Cve-YPiKWby$>
.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/31*event-3879644271__;Iw!!Mih3wA!T3e1QzL_VNEtKLfo3M8i05_1Kam01IGdz5UOb8MWDjifEaftGW-T3Cve-TPlydcA$>,
or unsubscribe
<https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5APKQKXFLUW5J6KXK2TSK2CTBANCNFSM4SMBLOVA__;!!Mih3wA!T3e1QzL_VNEtKLfo3M8i05_1Kam01IGdz5UOb8MWDjifEaftGW-T3Cve-d3G5lgc$>
.
|
Yeah. I agree. If I get to it I’ll add a monitor to our admins tools so we
can see runtime per point for each extraction job.
…On Wed, Oct 14, 2020 at 22:43 kriegman ***@***.***> wrote:
Wow. That is a HUGE difference.
It's 20.84 sec vs. 568 sec. 25x difference.
It is really fortunate that Qimin happened to have an AMD machine and had
produced the good timings initially. Otherwise, we may have
gone down a very different path. Would still be fruitful to understand why
there's a different so it doesn't bight us downstream. Of course, the
solution to
just use AMD instances works for now, and just hope that it
does't regress downstream and start running 25x slower. OK to debug then,
but would we notice?
D
On Wed, Oct 14, 2020 at 10:33 PM Oscar Beijbom ***@***.***>
wrote:
> Closed #31
> <
https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/31__;!!Mih3wA!T3e1QzL_VNEtKLfo3M8i05_1Kam01IGdz5UOb8MWDjifEaftGW-T3Cve-YPiKWby$
>
> .
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <
https://urldefense.com/v3/__https://github.com/beijbom/pyspacer/issues/31*event-3879644271__;Iw!!Mih3wA!T3e1QzL_VNEtKLfo3M8i05_1Kam01IGdz5UOb8MWDjifEaftGW-T3Cve-TPlydcA$
>,
> or unsubscribe
> <
https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ABKA5APKQKXFLUW5J6KXK2TSK2CTBANCNFSM4SMBLOVA__;!!Mih3wA!T3e1QzL_VNEtKLfo3M8i05_1Kam01IGdz5UOb8MWDjifEaftGW-T3Cve-d3G5lgc$
>
> .
>
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#31 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAITTF555ATW7ZNRD5LO5ULSK2DXZANCNFSM4SMBLOVA>
.
|
I ran some more thorough timing tests and found to my surprise that while EfficientNet extractor is (as expected) faster at low number of points per image, it's slower for large (>100) number of points.
These issue are replicated both locally in Docker and in the cloud. To replicate run
scripts/docker/runtimes.py
inside the docker. Specifically:Here is what I got in docker:
legend: image_size, nbr_points, extractor, runtime (s)
Note that these measurements are very noisy since they ran on the local system. But it's quite clear that vgg16 is faster for 200 points while efficientNet is faster for fewer points. I'm also attaching some more robust stats from AWS.
In the first plot note how the EfficientNet extractors are al slower for nbr points > 400 or so.
In this plot we see that (a) the effect of image size is minimal (b) runtime is linear in number of points (c) efficientNet is actually slower on average (across all runs) since it's so slow for large nbr points.
The text was updated successfully, but these errors were encountered: