Issues running AML on amperecomputingai/onnxruntime:1.8.0 image #207

binh-ampere · 2023-10-27T06:52:40Z

Hi team, following issues were observed while hands on AML on onnxrt-aio:1.8.0 image. please advise whether these need to be ignored or workaround.

1. Error while execute git clone --recursive https://github.com/AmpereComputingAI/ampere_model_library.git

...
Receiving objects: 100% (559/559), 1.30 MiB | 45.85 MiB/s, done.
Resolving deltas: 100% (337/337), done.
Cloning into '/home/azureuser/ampere_model_library/text_to_image/stable_diffusion/stablediffusion'...
git@github.com: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'git@github.com:AmpereComputingAI/stablediffusion.git' into submodule path '/home/azureuser/ampere_model_library/text_to_image/stable_diffusion/stablediffusion' failed
Failed to clone 'text_to_image/stable_diffusion/stablediffusion' a second time, aborting

2. Error while execute bash setup_deb.sh

...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
test-tube 0.7.5 requires torch>=1.1.0, which is not installed.
streamlit 1.28.0 requires altair<6,>=4.0, which is not installed.
streamlit 1.28.0 requires blinker<2,>=1.0.0, which is not installed.
streamlit 1.28.0 requires gitpython!=3.1.19,<4,>=3.0.7, which is not installed.
streamlit 1.28.0 requires pydeck<1,>=0.8.0b4, which is not installed.
streamlit 1.28.0 requires tenacity<9,>=8.1.0, which is not installed.
streamlit 1.28.0 requires toml<2,>=0.10.1, which is not installed.
streamlit 1.28.0 requires tzlocal<6,>=1.1, which is not installed.
streamlit 1.28.0 requires validators<1,>=0.2, which is not installed.
streamlit 1.28.0 requires watchdog>=2.1.5; platform_system != "Darwin", which is not installed.
...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ultralytics 8.0.75 requires seaborn>=0.11.0, which is not installed.
ultralytics 8.0.75 requires sentry-sdk, which is not installed.
ultralytics 8.0.75 requires torchvision>=0.8.1, which is not installed.
open-clip-torch 2.7.0 requires torchvision, which is not installed.
nnunet 1.7.0 requires dicom2nifti, which is not installed.
nnunet 1.7.0 requires sklearn, which is not installed.
batchgenerators 0.21 requires unittest2, which is not installed.
pytorch-lightning 1.9.1 requires torchmetrics>=0.7.0, but you have torchmetrics 0.6.0 which is incompatible.
nnunet 1.7.0 requires batchgenerators>=0.23, but you have batchgenerators 0.21 which is incompatible.
...
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
ultralytics 8.0.75 requires seaborn>=0.11.0, which is not installed.
ultralytics 8.0.75 requires sentry-sdk, which is not installed.

3. Attempt to get default onnx fp32 result:

# AIO_PROCESS_MODE=0 OMP_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp32 -f ort

FAIL: this model seems to be unsupported in a specified precision: fp32

Looks like ort backend supports fp16 only in AML? https://github.com/AmpereComputingAI/ampere_model_library/blob/main/computer_vision/classification/resnet_50_v15/run.py#L187

4. Attempt to get default onnx fp16 result:

# AIO_PROCESS_MODE=0 OMP_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp16 -f ort

Intraop parallelism set to 16 threads


Running with ONNX Runtime

  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 199, in <module>
    main()
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 188, in main
    run_ort_fp16(**vars(args))
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 151, in run_ort_fp16
    return run_ort_fp(model_path, batch_size, num_runs, timeout, images_path, labels_path)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 127, in run_ort_fp
    return run_model(run_single_pass, runner, dataset, batch_size, num_runs, timeout)
  File "/aml/utils/benchmark.py", line 229, in run_model
    single_pass_func(runner, dataset)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 114, in run_single_pass
    output = ort_runner.run(batch_size)
  File "/aml/utils/ort.py", line 36, in run
    outputs = self.session.run(self._output_names, self._feed_dict)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float16)) , expected: (tensor(float))
  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]

5. Attempt to get onnx-aio fp16 result:

# AIO_NUM_THREADS=16 python3 run.py -m resnet50_v1.onnx -p fp16 -f ort


Intraop parallelism set to 16 threads


Running with ONNX Runtime

  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 199, in <module>
    main()
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 188, in main
    run_ort_fp16(**vars(args))
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 151, in run_ort_fp16
    return run_ort_fp(model_path, batch_size, num_runs, timeout, images_path, labels_path)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 127, in run_ort_fp
    return run_model(run_single_pass, runner, dataset, batch_size, num_runs, timeout)
  File "/aml/utils/benchmark.py", line 229, in run_model
    single_pass_func(runner, dataset)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 114, in run_single_pass
    output = ort_runner.run(batch_size)
  File "/aml/utils/ort.py", line 36, in run
    outputs = self.session.run(self._output_names, self._feed_dict)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(float16)) , expected: (tensor(float))
  0%|                                                                                                                     | 0/60 [00:00<?, ?it/s]

The onnx model is https://zenodo.org/record/2592612/files/resnet50_v1.onnx

6. Try download ONNX Runtime model in fp16 precision
https://www.dropbox.com/s/r80ndhbht7tixn5/resnet_50_v1.5_fp16.onnx described in README.md:

# python3 run.py -m resnet_50_v1.5_fp16.onnx -p fp16 -f ort

Intraop parallelism set to 16 threads

Traceback (most recent call last):
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 199, in <module>
    main()
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 188, in main
    run_ort_fp16(**vars(args))
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 151, in run_ort_fp16
    return run_ort_fp(model_path, batch_size, num_runs, timeout, images_path, labels_path)
  File "/aml/computer_vision/classification/resnet_50_v15/run.py", line 125, in run_ort_fp
    runner = OrtRunner(model_path)
  File "/aml/utils/ort.py", line 26, in __init__
    self.session = ort.InferenceSession(model, session_options, providers=ort.get_available_providers())
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 360, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/usr/local/lib/python3.10/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 397, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from resnet_50_v1.5_fp16.onnx failed:Protobuf parsing failed.

The text was updated successfully, but these errors were encountered:

jan-grzybek-ampere · 2023-10-30T21:38:57Z

Hi, fp16 with AIO should be used in implicit mode.
Please set:

AIO_IMPLICIT_FP16_TRANSFORM_FILTER=".*"

and run standard fp32 inference from AML level.

binh-ampere · 2023-11-01T10:57:50Z

Thanks Jan, still got same error after setting AIO_IMPLICIT_FP16_TRANSFORM_FILTER

MarcelWilnicki closed this as completed Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues running AML on amperecomputingai/onnxruntime:1.8.0 image #207

Issues running AML on amperecomputingai/onnxruntime:1.8.0 image #207

binh-ampere commented Oct 27, 2023

jan-grzybek-ampere commented Oct 30, 2023 •

edited

Loading

binh-ampere commented Nov 1, 2023

Issues running AML on amperecomputingai/onnxruntime:1.8.0 image #207

Issues running AML on amperecomputingai/onnxruntime:1.8.0 image #207

Comments

binh-ampere commented Oct 27, 2023

jan-grzybek-ampere commented Oct 30, 2023 • edited Loading

binh-ampere commented Nov 1, 2023

jan-grzybek-ampere commented Oct 30, 2023 •

edited

Loading