Lab 3: Compile on C5 and launch a load test run on Inf1 Instance.

Please complete Lab 2 and clean up by following Lab 2's last step. If using DLAMI Conda environment, please update to latest Neuron software for this lab.

This lab shows an example load testing using FP16 model derived from Keras ResNet50 model and compiled to Inferentia with experimental performance flags. For this lab please use the C5 used in lab 1 and inf1.2xlarge used in lab 2.

Lab 3 Section 1: Compile on C5

3.1.1 Download and unpack the ResNet50 performance package on C5 instance:

wget https://reinventinf1.s3.amazonaws.com/keras_fp16_benchmarking_db.tgz

tar -xzf keras_fp16_benchmarking_db.tgz

cd keras_fp16_benchmarking_db

3.1.2 Activate virtual environment and install Neuron Compiler if not done so. Also install pillow module for test scripts.

source test_env_p36/bin/activate
pip install neuron-cc
pip install pillow

3.1.3 Extract Keras ResNet50 FP32, optimize for inference, and convert to FP16.

Extract Keras ResNet50 FP32 (resnet50_fp32_keras.pb will be generated):

python gen_resnet50_keras.py

Optimize the extracted Keras ResNet50 FP32 graph for inference before casting (resnet50_fp32_keras_opt.pb will be generated):

python optimize_for_inference.py --graph resnet50_fp32_keras.pb --out_graph resnet50_fp32_keras_opt.pb

Convert full graph to FP16 (resnet50_fp16_keras_opt.pb will be generated):

python fp32tofp16.py  --graph resnet50_fp32_keras_opt.pb --out_graph resnet50_fp16_keras_opt.pb

3.1.4 Compile ResNet50 frozen graph using provided pb2sm_compile.py script on Inf1 instance. This step takes about 4 minutes on Inf1.2xlarge. NOTE: please ensure that the Neuron Compiler is up-to-date by following the setup steps in Lab 1 Section 1.

We optimized this model with a compiled time batch size of 5. We optimize throughput having a runtime batch size of mutiples of 5. (50 in this case). This step takes about 6 minutes.

time python pb2sm_compile.py

At the end of this step, you will see a zipped saved model rn50_fp16_compiled_batch5.zip which you will need to copy to your Inf1 instance (the PEM key was setup during Lab 2 Section 3):

scp -i ~/ee-default-keypair.pem ./rn50_fp16_compiled_batch5.zip ubuntu@<instance DNS>:~/ # Ubuntu Image default.
#scp -i ~/ee-default-keypair.pem ./rn50_fp16_compiled_batch5.zip ec2-user@<instance DNS>:~/  # if on AML2  if you are on Amazon

Lab 3 Section 2: Launch a load test run on Inf1

3.2.1 Download and unpack the ResNet50 performance package again, this time on Inf1 instance:

wget https://reinventinf1.s3.amazonaws.com/keras_fp16_benchmarking_db.tgz

tar -xzf keras_fp16_benchmarking_db.tgz

cd keras_fp16_benchmarking_db

Unzip the saved model that was transfered from C5 into current directory:

unzip ~/rn50_fp16_compiled_batch5.zip

3.2.2 Run load test using provided infer_resnet50_keras_loadtest.py script on Inf1 instance (please make sure this is inf1.2xlarge):

There are total of 4 Neuron Cores on Inf1.2xlarge. There are 4 sessions of ResNet50 running, each session binds to a Neuron core. There are 4 threads in each of these sessions.

time python infer_resnet50_keras_loadtest.py

Output:

NUM THREADS:  16
NUM_LOOPS_PER_THREAD:  100
USER_BATCH_SIZE:  50
current throughput: 0 images/sec
current throughput: 0 images/sec
current throughput: 700 images/sec
current throughput: 800 images/sec
current throughput: 1700 images/sec
current throughput: 1800 images/sec
current throughput: 1850 images/sec
current throughput: 1800 images/sec
current throughput: 1850 images/sec
current throughput: 1700 images/sec
current throughput: 1850 images/sec
current throughput: 1800 images/sec
current throughput: 1800 images/sec
current throughput: 1800 images/sec
current throughput: 1800 images/sec
current throughput: 1750 images/sec
current throughput: 1950 images/sec
current throughput: 1750 images/sec
current throughput: 1850 images/sec
current throughput: 1800 images/sec
current throughput: 1750 images/sec
current throughput: 1800 images/sec
current throughput: 1800 images/sec
current throughput: 1750 images/sec
current throughput: 1800 images/sec
current throughput: 1800 images/sec
current throughput: 1750 images/sec
current throughput: 1850 images/sec
current throughput: 1750 images/sec
current throughput: 1800 images/sec
current throughput: 1800 images/sec
current throughput: 1800 images/sec
current throughput: 1850 images/sec
current throughput: 1800 images/sec
current throughput: 1850 images/sec
current throughput: 1800 images/sec
current throughput: 1750 images/sec
current throughput: 1800 images/sec
current throughput: 1750 images/sec
current throughput: 1800 images/sec
current throughput: 1800 images/sec
current throughput: 1750 images/sec
current throughput: 1850 images/sec
current throughput: 1800 images/sec
current throughput: 1800 images/sec
current throughput: 1900 images/sec
current throughput: 1800 images/sec
current throughput: 850 images/sec
current throughput: 250 images/sec

real    0m54.746s
user    1m39.552s
sys     0m7.787s

NOTE: If you see lower throughput, please make sure that the Inf1 instance is inf1.2xlarge.

3.2.3 While this is running you can see utilization using neuron-top tool in a separate terminal (it takes about a minute to load; also running neuron-top will lower the throughput to around 1200 images/sec):

/opt/aws/neuron/bin/neuron-top

Note: Please go back to home directory /home/ubuntu

cd ~/

Go To Lab 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3. benchmark run.md

3. benchmark run.md

Lab 3: Compile on C5 and launch a load test run on Inf1 Instance.

Lab 3 Section 1: Compile on C5

Lab 3 Section 2: Launch a load test run on Inf1

Files

3. benchmark run.md

Latest commit

History

3. benchmark run.md

File metadata and controls

Lab 3: Compile on C5 and launch a load test run on Inf1 Instance.

Lab 3 Section 1: Compile on C5

Lab 3 Section 2: Launch a load test run on Inf1