-
Notifications
You must be signed in to change notification settings - Fork 308
Benchmark: CNN proposal #25
Comments
This is giving me results on my MacBook Air 2020 m1 8G
Keys are:
on my Mac mini 2020 m1 16G
|
Run this on my MacBook Pro (16 Zoll, 2019) 2,3 GHz 8-Core Intel Core i9 AMD Radeon Pro 5500M 8GB
|
My results with a Macbook Pro M1, 16Gb of RAM:
One thing to note is that there must be a bottleneck somewhere. I was monitoring the GPU usage in Activity Monitor and it never went above 60%. |
@Willian-Zhang Thank you for providing a reproducible test case. We will take a look. |
MacBook Pro ,13-inch, 2017, i5, 8GB, intel iris 640 apple compiled tensorflow
pip version (tf 2.3.1)
tf compiled with FMA, AVX, AVX2, SSE4.1, SSE4.2 flag
It's interesting to see apple's optimized version of tensorflow is slower than the pip version. Looking at the warming
, I think it either has to do with intel's oneapi or the support for instruction sets on x86 that leads to its performance loss. I tried using the compiled binaries that support FMA, AVX, AVX2, SSE4.1, SSE4.2 to see if it is the instruction support that leads to the performance loss, but it throws a warning (due to exhausted data; why only this run?; Batch size -> 118?). Anyhow, it'd be nice if apple provides more documentation about their own version of tf and please let me know if I am the only one who encountered that tensorflow-macos is slower than pip tensorflow (-> request for documentation / request for feature (instruction set) support?). |
Running Apple's Mac-optimized on a 2019 16' MacBook Pro with AMD Radeon Pro 5500M:
And here is the GPU performance after the first epochs have started. I suspect the slack in the GPU is due to the comparatively low batch size compared to the GPU memory capacity. When I change With the following GPU usage: Note that each epoch now takes 27s, less than half of the speed with |
To echo @dkgaraujo, I can run this at around 24s per epoch on a Macbook Pro 16" 2019 with Radeon Pro 5300M if I increase the batch size (e.g., With low batch sizes (e.g. |
@anhornsby with on MacBook Air 2020 m1 8G, I get:
on Mac mini 2020 m1 16G:
|
Results on my Mac Mini 2020 m1 16G. GPU = 22s per epoch , CPU = 17s per epoch , Any = 28s per epoch (weird!) Best results were from commenting out the code that disables eager execution and also the code that selects GPU.. just don't set these and I get the best results.
Highlights are: 15s/epoch |
Some more results: I ran the same code as before, but with |
Commenting out the line that disables eager execution seems helpful. 20s per epoch with |
Interestingly, when I removed the line that disables eager execution my system just ended up hanging? Did you change anything else other than just commenting that out @anhornsby? |
@danielmbradley nope, same code as above, using the recommended virtualenv |
Macbook Pro M1, 16Gb of RAM batch size 128: 23s/epoch, 45ms/step, 98.98% final accuracy, GPU% ~ 55% |
@anhornsby Interesting, there must be some difference in the way they implemented eager execution between intel Macs and M1 Macs, mine just completely falls over when that line is missing. Did find increasing the size of the batches significantly increased processing speed though (oddly though the time printed in the terminal was wrong once it hit 22 seconds) |
Just for fun I wanted to try running this on a Windows 10 Laptop with a mobile 1060 (6G) and i7-7700HQ, 16GB RAM:
469/469 [==============================] - 5s 11ms/step - batch: 234.0000 - size: 1.0000 - loss: 0.1642 - accuracy: 0.9517 - val_loss: 0.0566 - val_accuracy: 0.9817 Highlights are:
48/48 [==============================] - 4s 78ms/step - batch: 23.5000 - size: 1.0000 - loss: 0.5046 - accuracy: 0.8650 - val_loss: 0.1678 - val_accuracy: 0.9517 Highlights are:
Highlights are: |
MacBook Air 2020 M1 with 16 GB - Same as others with an M1 MacBook
|
Windows GeForce GTX 1080TI, intel 5820k using tensorflow-gpu version 2.3.1 I had to comment out these lines:
Results: Log:
I ran again with batch size = 512 since I have a lot of memory on this GPU. Results: Log:
|
Question please: Is there a way to get this script to use the "Neural Engine" i.e. dedicated ML hardware instead? That is way more interesting to benchmark. |
tested on ubuntu 20.04.1, rtx 3070, tensorflow container
batch size=4096 took longer on first 3 epochs, taking 9, 5, 2 seconds per each epoch (620, 363, 90 ms per step per each epoch) |
Tested on a MacBook Pro (13-inch, M1, 2020) with 8 GB RAM
No changes to the script: Not sure why my numbers aren't comparable to the other M1 numbers. |
3070 running Tensorflow, how did you do it? I thought you needed CUDA 11 on a 3070 and that there were problems with CUDA 11 and the nightly. I guess the difference is Windows vs Ubuntu. One thing I hope is that with support for Apple ComputeML, as the M line of apple chips evolves, this fork "just works" with faster/better Apple Silicon going forward, rather than needing an endless series of patches/etc. The CUDA/CUDNN install dance on Windows never fails to thwart me. |
I believe the Neural Engine is designed to accelerate trained CoreML models inference/prediction, as far as I can tell it's not used in training? There doesn't seem to be any API to use it other than CoreML. |
Oh, I didn't think of that. Do you have any source on this? |
I'm not sure how true that is? I've never had any issue with speed when making predictions on non-ML specific hardware, it's always been the training that's been slow |
Information isn't great on the Neural Engine. CoreML is definitely a way to run trained models on device. This repo talks about what we know about the neural engine. The impressive speedup of the super resolution scaling in Pixelmator 2 cites the Neural Engine as helping on M1 Macs It's notable that the writeup on this branch of Tensorflow talks about using ML Compute to enhance the training speed by using the CPU and GPU, but doesn't mention the Neural Engine itself. It would be great if we could use it to train! Perhaps that's coming some day? |
Macbook pro RAM 16 GB, HD 500 GB, same script but no disable eager execution Epoch 1/12 |
Just install CUDA 11.1 compatible driver(455 for now) and use aforementioned container. |
Tested on MacBook Air (13-inch, Early 2015, 1.6GHz Intel Core i5, Intel HD Graphics 6000) with 8GB RAM
No changes to the script: |
MacBook Air 2020 M1 with 8 GB- Connected to Power - No difference really to the others with M1
No changes to the script: |
Device: MacBook Pro (13-inch, 2019), 2.4 GHz Quad-Core Intel Core i5, 8GB RAM, Radeon RX 5700 XT 8 GB
Summary:
|
Desktop Ryzen 2400g, 16GB, Windows (Conda)
|
Device: Mac Pro Late 2013 (3.7 GHz Quad-Core Intel Xeon E5, 2x AMD FirePro D300 2 GB, 64GB).
In the screenshot below, GPU slot 2 is connected to the display and slot 1 is the spare. Surprisingly, when I ran the code from issue #39 it switched to using the idle GPU with ~80% utilization. |
Mac Pro Late 2013 (3,5 GHz 6-Core Intel Xeon E5, 2x AMD FirePro D500 3 GB, 32GB).
|
Can anyone explain how to install TensorFlow on MacBook m1 2020. I am getting error: zsh: illegal hardware instruction python under virtual environment(tensorflow_macos_venv) when I try to import TensorFlow. |
Thank you @Willian-Zhang for creating this! I used it (code unchanged from above) to benchmark a few of my Macs + a GPU-powered Google Colab instance:
Specs:
Very interesting to see the M1 MacBook Air performing on-par/better than the M1 MacBook Pro. The 16-inch I used is almost top-spec too (barely a year old)... incredible how performant Apple's new M1 chip is. I also did a few more tests on each machine, namely:
See the results from the above on my blog. I also made a video running through each of them on YouTube. |
i5-8400T just disable 2 this line cause dont have gpu
Result:
|
MacBook Pro (16-inch, 2019)
Results:
|
Tested on a 2.2 GHz Quad-Core Intel Core i7, Intel Iris Pro Graphics, 2014 15-inch MacBook Pro. I also observed that the mac-optimized version seems slower than the non-optimized version. (similar to the results of @rnogy )
Non-MacOS optimized Tensorflow (
|
2020-12-30 14:50:04.896932: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA Had quite some fan noise |
GPU name: Tesla T4 16GB vRam Epoch 12/12 GPU name: Tesla T4 16GB vRam Epoch 12/12 |
Seeing all amazing results you might not wanna bother about this machine from 2016. 😅 Anyways, I got the following information. System: MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports)
Key results:
|
iMac Pro 2017, 3 GHz 10-Core Intel Xeon W, 32 GB 2666 MHz DDR4, Radeon Pro Vega 64 16 GB On GPU:
On CPU:
On pip-provided Tensorflow 2.4 (with removing two mlcompute lines from the script) it is twice faster:
|
This code is too hot! I think I just toasted GPU on my 16'' mbp by running this benchmark. Make sure your warranty not expired before experimenting. |
MacBook Air (M1, 2020) 7 Core GPU Process finished with exit code 0 |
this was one of the factors that helped me choose between 2 laptops that were priced the same I ran the benchmark on both the device at the store and was surprised how capable apple m1 even tho it couldn't beat the MSI but it gave a respected result than the similar priced hp anyway at last I end up buying MSI as it gave me more options so here are my results: Epoch 1/12 each epoch 4s |
Tested on a MacBook Pro (13-inch, M1, 2020) with 8 GB RAM
|
The following code implements the original @ylecun LeCun's CNN architecture., with
Dropout
comment out due to an issue.packages required to run:
The text was updated successfully, but these errors were encountered: