Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEURON_PROFILE does not appear to work when running Reinvent lab3 #57

Closed
cprice404-aws opened this issue Jan 2, 2020 · 4 comments
Closed

Comments

@cprice404-aws
Copy link
Contributor

cprice404-aws commented Jan 2, 2020

I've been using the information in this doc to enable profiling and view profiling data via TensorBoard:

https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-tools/getting-started-tensorboard-neuron.md

This has been working well for me so far, when running my own inference code.

However, when I set the NEURON_PROFILE environment variable and then run the inference load test from the Re-Invent lab3 ( https://github.com/awshlabs/reinvent19Inf1Lab/blob/master/3.%20benchmark%20run.md ), there is no profile data generated.

I do see a .pb file and a .neff file that were generated in the directory I specified via NEURON_PROFILE, but there is no trace data, and when I try to start up tensorboard_neuron, it says:

WARNING: no profile data found in ./neuron_profile

It would be really useful to be able to look at the trace in tensorboard in order to understand how the load test is utilizing all of the neuron cores.

@mrnikwaws
Copy link
Contributor

Thanks for your question. You are correct that profiling does not work on this specific tutorial example. The example uses a feature which varies batch sizes dynamically, and is not compatible with profiling at the moment

@mrnikwaws
Copy link
Contributor

mrnikwaws commented Jan 3, 2020

On further investigation if you update the file infer_resnet50_keras_loadtest.py and modify the line:

USER_BATCH_SIZE = 50

to

USER_BATCH_SIZE = 5

then profiling should work for you. Currently the compilation batch size and the inference batch size must match for profiling to run correctly.

There is a lot of data in the load test, and tensorboard_neuron will take a long time to process all of it on startup. I recommend that you use a much smaller data set which will give you useful profiling information.

Consider changing:

NUM_LOOPS_PER_THREAD = 100

to

NUM_LOOPS_PER_THREAD = 5

@cprice404-aws
Copy link
Contributor Author

Excellent, thanks for the information!

Do you expect for the profiling to be available for dynamic batch sizes in the future?

@mrnikwaws
Copy link
Contributor

Thanks for the input. I have made the team aware of the request, we've documented it and we will consider adding it in a future release. For now I am going to close this issue (since you can run the profiler), but please reopen it if you feel that this specific feature is a priority to track in github.

aws-mesharma pushed a commit that referenced this issue Sep 22, 2020
Release notes updated for July 16, 2020 release of Neuron SDK.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants