-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Misleading benchmarks? #54
Comments
Variations in execution time based on batch size:
Measured manually with the iPhone Timer app, so results may deviate from actual values by ~2 seconds. |
I guess the benchmarks aren't entirely wrong. The throughput for batched images is 16 seconds/image - probably smaller than Apple's 18 sec because I disabled the NSFW filtering model. However, Apple should warn users about the ~20 second static overhead. This would be important for people making one-off images where the 40-second feedback loop is their bottleneck, not absolute batched throughput. |
Curious what your setting is for the compute units. Try setting it to |
It worked! I had compiled the attention implementation to be GPU-friendly ( Latencies: 4 sec, 1 sec, 19 sec. I'll switch back to v1.5 and provide an updated table of latencies, along with performance when optimizing attention for the ANE. Meanwhile, here's the various power consumption metrics during the sampling state with Here
Sampling is too quick to 100% prove whether it's actually utilizing the ANE, or just late to report that it started inferencing. And here's the metrics with Here
|
Note that if you try to re-run the command for generating a CoreML model, it will actually silently fail. You have to purge the
With attention set to Here
This seems to have marginally slower batched throughput (20 sec vs 16 sec), but about half the power consumption (15 W vs 36 W). Overall, it seems better than With attention set to Here
With attention set to Here
|
I've predicted the likely (actual) fastest implementation on each M1 model, and adjusted the numbers to match CLI latencies.
Regarding battery life on M1 Max, there's a tradeoff between latency and power efficiency. You may want to use the neural engine when on battery. I assumed 3 W during load and sample, except for 1.5 W (sampling,
Assuming a 100 watt-hour battery at 90% health, or 324,000 joules. The battery will be drained from 90% to 10%, a typical real-world scenario. |
The benchmark is still misleading. They said they could generate an image with M1 Ultra 48-core GPU within 13 seconds. And they didn't even use the swift package and neural engine!
|
How do you obtain these detailed mW readings of the process running? |
|
Wow! I recall trying to use something like that a couple years ago but didn’t seem to exist on Macs. Was this recently re-added...
… On Dec 20, 2022, at 8:33 AM, Philip Turner ***@***.***> wrote:
powermetrics
|
The benchmarks only include inference latency, but the actual latency is much larger. For example, they say it takes 18 seconds on the 32c M1 Max, which I have validated. However, there's an additional 22-second latency before that where it says
Sampling...
. I pulled it up in Activity Monitor, and here's what's happening:Loading resources and creating pipeline
- 2 seconds, because I've already run the model several timesSampling...
- 99% CPU, ~0% GPU, which means one CPU core utilized through this entire step (not multi-core), 22 secondsStep 50 of 50 [mean: 0.99, median: 1.56, last 1.55] step/sec
- ~0% CPU, 88% GPU, which means the actual model is running, 18 secondsIs anyone else getting these wierd results? Is it the same, or much larger than 22 seconds? I don't know whether it's because I used the Swift CLI instead of the Python CLI. I cannot get the Python CLI to work: #43 (comment).
The text was updated successfully, but these errors were encountered: