Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues running AML on amperecomputingai/onnxruntime:1.8.0 image #207

Closed
binh-ampere opened this issue Oct 27, 2023 · 2 comments
Closed

Issues running AML on amperecomputingai/onnxruntime:1.8.0 image #207

binh-ampere opened this issue Oct 27, 2023 · 2 comments

Comments

@binh-ampere
Copy link

Hi team, following issues were observed while hands on AML on onnxrt-aio:1.8.0 image. please advise whether these need to be ignored or workaround.

1. Error while execute git clone --recursive https://github.com/AmpereComputingAI/ampere_model_library.git

...
Receiving objects: 100% (559/559), 1.30 MiB | 45.85 MiB/s, done.
Resolving deltas: 100% (337/337), done.
Cloning into '/home/azureuser/ampere_model_library/text_to_image/stable_diffusion/stablediffusion'...
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:AmpereComputingAI/stablediffusion.git' into submodule path '/home/azureuser/ampere_model_library/text_to_image/stable_diffusion/stablediffusion' failed
Failed to clone 'text_to_image/stable_diffusion/stablediffusion' a second time, aborting

2. Error while execute bash setup_deb.sh

...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
test-tube 0.7.5 requires torch>=1.1.0, which is not installed.
streamlit 1.28.0 requires altair<6,>=4.0, which is not installed.
streamlit 1.28.0 requires blinker<2,>=1.0.0, which is not installed.
streamlit 1.28.0 requires gitpython!=3.1.19,<4,>=3.0.7, which is not installed.
streamlit 1.28.0 requires pydeck<1,>=0.8.0b4, which is not installed.
streamlit 1.28.0 requires tenacity<9,>=8.1.0, which is not installed.
streamlit 1.28.0 requires toml<2,>=0.10.1, which is not installed.
streamlit 1.28.0 requires tzlocal<6,>=1.1, which is not installed.
streamlit 1.28.0 requires validators<1,>=0.2, which is not installed.
streamlit 1.28.0 requires watchdog>=2.1.5; platform_system != "Darwin", which is not installed.
...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ultralytics 8.0.75 requires seaborn>=0.11.0, which is not installed.
ultralytics 8.0.75 requires sentry-sdk, which is not installed.
ultralytics 8.0.75 requires torchvision>=0.8.1, which is not installed.
open-clip-torch 2.7.0 requires torchvision, which is not installed.
nnunet 1.7.0 requires dicom2nifti, which is not installed.
nnunet 1.7.0 requires sklearn, which is not installed.
batchgenerators 0.21 requires unittest2, which is not installed.
pytorch-lightning 1.9.1 requires torchmetrics>=0.7.0, but you have torchmetrics 0.6.0 which is incompatible.
nnunet 1.7.0 requires batchgenerators>=0.23, but you have batchgenerators 0.21 which is incompatible.
...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ultralytics 8.0.75 requires seaborn>=0.11.0, which is not installed.
ultralytics 8.0.75 requires sentry-sdk, which is not installed.

3. Attempt to get default onnx fp32 result:

# AIO_PROCESS_MODE=0 OMP_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp32 -f ort

FAIL: this model seems to be unsupported in a specified precision: fp32

Looks like ort backend supports fp16 only in AML? https://github.com/AmpereComputingAI/ampere_model_library/blob/main/computer_vision/classification/resnet_50_v15/run.py#L187

4. Attempt to get default onnx fp16 result:

# AIO_PROCESS_MODE=0 OMP_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp16 -f ort

Intraop parallelism set to 16 threads


Running with ONNX Runtime

  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 199, in <module>
    main()
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 188, in main
    run_ort_fp16(**vars(args))
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 151, in run_ort_fp16
    return run_ort_fp(model_path, batch_size, num_runs, timeout, images_path, labels_path)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 127, in run_ort_fp
    return run_model(run_single_pass, runner, dataset, batch_size, num_runs, timeout)
  File "/aml/utils/benchmark.py", line 229, in run_model
    single_pass_func(runner, dataset)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 114, in run_single_pass
    output = ort_runner.run(batch_size)
  File "/aml/utils/ort.py", line 36, in run
    outputs = self.session.run(self._output_names, self._feed_dict)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float16)) , expected: (tensor(float))
  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]

5. Attempt to get onnx-aio fp16 result:

# AIO_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp16 -f ort


Intraop parallelism set to 16 threads


Running with ONNX Runtime

  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 199, in <module>
    main()
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 188, in main
    run_ort_fp16(**vars(args))
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 151, in run_ort_fp16
    return run_ort_fp(model_path, batch_size, num_runs, timeout, images_path, labels_path)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 127, in run_ort_fp
    return run_model(run_single_pass, runner, dataset, batch_size, num_runs, timeout)
  File "/aml/utils/benchmark.py", line 229, in run_model
    single_pass_func(runner, dataset)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 114, in run_single_pass
    output = ort_runner.run(batch_size)
  File "/aml/utils/ort.py", line 36, in run
    outputs = self.session.run(self._output_names, self._feed_dict)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float16)) , expected: (tensor(float))
  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]

The onnx model is https://zenodo.org/record/2592612/files/resnet50_v1.onnx

6. Try download ONNX Runtime model in fp16 precision
https://www.dropbox.com/s/r80ndhbht7tixn5/resnet_50_v1.5_fp16.onnx described in README.md:

# python3 run.py -m resnet_50_v1.5_fp16.onnx -p fp16 -f ort

Intraop parallelism set to 16 threads

Traceback (most recent call last):
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 199, in <module>
    main()
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 188, in main
    run_ort_fp16(**vars(args))
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 151, in run_ort_fp16
    return run_ort_fp(model_path, batch_size, num_runs, timeout, images_path, labels_path)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 125, in run_ort_fp
    runner = OrtRunner(model_path)
  File "/aml/utils/ort.py", line 26, in __init__
    self.session = ort.InferenceSession(model, session_options, providers=ort.get_available_providers())
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 360, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 397, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from resnet_50_v1.5_fp16.onnx failed:Protobuf parsing failed.

@jan-grzybek-ampere
Copy link
Member

jan-grzybek-ampere commented Oct 30, 2023

Hi, fp16 with AIO should be used in implicit mode.
Please set:

AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*"

and run standard fp32 inference from AML level.

@binh-ampere
Copy link
Author

Thanks Jan, still got same error after setting AIO_IMPLICIT_FP16_TRANSFORM_FILTER

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants