NEURON_PROFILE does not appear to work when running Reinvent lab3 #57

cprice404-aws · 2020-01-02T20:49:08Z

I've been using the information in this doc to enable profiling and view profiling data via TensorBoard:

https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-tools/getting-started-tensorboard-neuron.md

This has been working well for me so far, when running my own inference code.

However, when I set the NEURON_PROFILE environment variable and then run the inference load test from the Re-Invent lab3 ( https://github.com/awshlabs/reinvent19Inf1Lab/blob/master/3.%20benchmark%20run.md ), there is no profile data generated.

I do see a .pb file and a .neff file that were generated in the directory I specified via NEURON_PROFILE, but there is no trace data, and when I try to start up tensorboard_neuron, it says:

WARNING: no profile data found in ./neuron_profile

It would be really useful to be able to look at the trace in tensorboard in order to understand how the load test is utilizing all of the neuron cores.

The text was updated successfully, but these errors were encountered:

mrnikwaws · 2020-01-03T00:41:15Z

Thanks for your question. You are correct that profiling does not work on this specific tutorial example. The example uses a feature which varies batch sizes dynamically, and is not compatible with profiling at the moment

mrnikwaws · 2020-01-03T21:28:04Z

On further investigation if you update the file infer_resnet50_keras_loadtest.py and modify the line:

USER_BATCH_SIZE = 50

to

USER_BATCH_SIZE = 5

then profiling should work for you. Currently the compilation batch size and the inference batch size must match for profiling to run correctly.

There is a lot of data in the load test, and tensorboard_neuron will take a long time to process all of it on startup. I recommend that you use a much smaller data set which will give you useful profiling information.

Consider changing:

NUM_LOOPS_PER_THREAD = 100

to

NUM_LOOPS_PER_THREAD = 5

cprice404-aws · 2020-01-03T22:19:02Z

Excellent, thanks for the information!

Do you expect for the profiling to be available for dynamic batch sizes in the future?

mrnikwaws · 2020-01-06T19:29:44Z

Thanks for the input. I have made the team aware of the request, we've documented it and we will consider adding it in a future release. For now I am going to close this issue (since you can run the profiler), but please reopen it if you feel that this specific feature is a priority to track in github.

Release notes updated for July 16, 2020 release of Neuron SDK.

mrnikwaws closed this as completed Jan 6, 2020

aws-mesharma pushed a commit that referenced this issue Sep 22, 2020

Neuron SDK Release - July 16, 2020 (#57)

46ec89b

Release notes updated for July 16, 2020 release of Neuron SDK.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEURON_PROFILE does not appear to work when running Reinvent lab3 #57

NEURON_PROFILE does not appear to work when running Reinvent lab3 #57

cprice404-aws commented Jan 2, 2020 •

edited

Loading

mrnikwaws commented Jan 3, 2020

mrnikwaws commented Jan 3, 2020 •

edited

Loading

cprice404-aws commented Jan 3, 2020

mrnikwaws commented Jan 6, 2020

NEURON_PROFILE does not appear to work when running Reinvent lab3 #57

NEURON_PROFILE does not appear to work when running Reinvent lab3 #57

Comments

cprice404-aws commented Jan 2, 2020 • edited Loading

mrnikwaws commented Jan 3, 2020

mrnikwaws commented Jan 3, 2020 • edited Loading

cprice404-aws commented Jan 3, 2020

mrnikwaws commented Jan 6, 2020

cprice404-aws commented Jan 2, 2020 •

edited

Loading

mrnikwaws commented Jan 3, 2020 •

edited

Loading