Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not load library libcudnn_ops_infer.so.8. #212

Closed
1 task done
kadirnar opened this issue Sep 16, 2022 · 18 comments · Fixed by #525
Closed
1 task done

Could not load library libcudnn_ops_infer.so.8. #212

kadirnar opened this issue Sep 16, 2022 · 18 comments · Fixed by #525
Labels
bug Something isn't working

Comments

@kadirnar
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Code:

import yolov5

def pip_load():
    return yolov5.load('yolov5s.pt')

pip_load()

Error Message:

Writing profile results into speed/memray-pip_speed.py.15691.bin
Memray WARNING: Correcting symbol for aligned_alloc from 0x7f6d26393cc0 to 0x7f6d279a4250
Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory
Please make sure libcudnn_ops_infer.so.8 is in your library path!

Expected Behavior

No response

Steps To Reproduce

pip install memray

Code :

memray run file.py 

Memray Version

1.3.1

Python Version

3.8

Operative System

Linux

Anything else?

No response

@kadirnar kadirnar added the bug Something isn't working label Sep 16, 2022
@pablogsal
Copy link
Member

pablogsal commented Sep 16, 2022

Thanks, @kadirnar for opening an issue.

This doesn't seem like a memray error. This error seems to be coming from the library because your are missing a shared object.

Can you paste here what is the output of the program if you run python file.py and python -m memray file.py?

@kadirnar
Copy link
Author

I solved the problem thank you.

@ash2703
Copy link

ash2703 commented Aug 22, 2023

@kadirnar Could you please post the solution, faving same issue when profiling pytorch model

@bnawras
Copy link

bnawras commented Dec 9, 2023

@pablogsal

  • Ubuntu 20.04.5
  • python3.8
  • memray 1.11.0
$ python3.8 -m memray train.py

Memray WARNING: Correcting symbol for malloc from 0x425490 to 0x7fd52516e0e0
Memray WARNING: Correcting symbol for free from 0x425910 to 0x7fd52516e6d0
Memray WARNING: Correcting symbol for aligned_alloc from 0x7fd524941ca0 to 0x7fd52516f250
Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory
Aborted (core dumped)

@pablogsal
Copy link
Member

@pablogsal

  • Ubuntu 20.04.5

  • python3.8

  • memray 1.11.0


$ python3.8 -m memray train.py



Memray WARNING: Correcting symbol for malloc from 0x425490 to 0x7fd52516e0e0

Memray WARNING: Correcting symbol for free from 0x425910 to 0x7fd52516e6d0

Memray WARNING: Correcting symbol for aligned_alloc from 0x7fd524941ca0 to 0x7fd52516f250

Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory

Aborted (core dumped)

As I mentioned in my previous comment this doesn't look like an issue with memray but an issue with your environment or the packages you are using.

In order for us to check what's going on can you please provide the contents of "train.py" and all the dependencies you are using.

@bnawras
Copy link

bnawras commented Dec 9, 2023

@pablogsal

Dockerfile

FROM nvidia/cuda:11.1.1-runtime-ubuntu20.04

ENV DEBIAN_FRONTEND noninteracrive
RUN apt-get update
RUN apt-get upgrade -y
RUN apt-get install -y \
        build-essential git python3 python3-pip \
        ffmpeg libsm6 libxext6 libxrender1 libglib2.0-0

WORKDIR /app
COPY requirements.txt .
RUN pip install --ignore-installed -r requirements.txt

RUN mkdir /datasets

CMD jupyter lab --ip 0.0.0.0 --port 1110 --allow-root

requirements

absl-py==2.0.0
aiohttp==3.8.6
aiosignal==1.3.1
albumentations==1.3.0
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
asttokens==2.2.1
astunparse==1.6.3
async-timeout==4.0.3
attrs==22.2.0
augraphy==8.2.4
azure-core==1.29.4
azure-storage-blob==12.18.3
Babel==2.11.0
backcall==0.2.0
beautifulsoup4==4.11.1
bleach==5.0.1
cachetools==5.3.1
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.0.1
clearml==1.13.1
cloudpickle==3.0.0
coloredlogs==15.0.1
comm==0.1.2
contourpy==1.0.7
cryptography==41.0.4
cssutils==2.9.0
cycler==0.11.0
dataframe-image==0.2.2
dbus-python==1.2.16
debugpy==1.6.5
decorator==5.1.1
defusedxml==0.7.1
efficientnet-pytorch==0.7.1
entrypoints==0.4
executing==1.2.0
fastjsonschema==2.16.2
filelock==3.12.4
filprofiler==2023.3.1
flatbuffers==23.5.26
fonttools==4.38.0
frozenlist==1.4.0
fsspec==2023.9.2
furl==2.1.3
google-auth==2.23.3
google-auth-oauthlib==1.0.0
grpcio==1.59.0
html2image==2.0.4.3
huggingface-hub==0.18.0
humanfriendly==10.0
idna==3.4
imageio==2.25.0
importlib-metadata==6.0.0
importlib-resources==5.10.2
ipykernel==6.20.2
ipython==8.8.0
ipython-genutils==0.2.0
ipywidgets==8.0.4
isodate==0.6.1
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
json5==0.9.11
jsonschema==4.17.3
jupyter-client==7.4.9
jupyter-core==5.1.3
jupyter-events==0.6.3
jupyter-server==2.1.0
jupyter-server-terminals==0.4.4
jupyterlab==3.5.2
jupyterlab-pygments==0.2.2
jupyterlab-server==2.19.0
jupyterlab-widgets==3.0.5
kiwisolver==1.4.4
llvmlite==0.41.0
lmdb==0.94
lxml==4.9.3
Markdown==3.5
markdown-it-py==3.0.0
MarkupSafe==2.1.2
matplotlib==3.6.3
matplotlib-inline==0.1.6
mdurl==0.1.2
memory-profiler==0.61.0
memray==1.11.0
mistune==2.0.4
mpmath==1.3.0
multidict==6.0.4
munch==4.0.0
nbclassic==0.4.8
nbclient==0.7.2
nbconvert==7.2.8
nbformat==5.7.3
nest-asyncio==1.5.6
networkx==3.0
notebook==6.5.2
notebook-shim==0.2.2
numba==0.58.0
numpy==1.24.1
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauthlib==3.2.2
onnxruntime-gpu==1.9.0
opencv-python==4.8.1.78
opencv-python-headless==4.7.0.68
orderedmultidict==1.0.1
packaging==23.0
pandas==1.5.3
pandocfilters==1.5.0
parso==0.8.3
pathlib2==2.3.7.post1
pexpect==4.8.0
pickleshare==0.7.5
Pillow==9.4.0
pkgutil-resolve-name==1.3.10
platformdirs==2.6.2
pretrainedmodels==0.7.4
prometheus-client==0.15.0
prompt-toolkit==3.0.36
protobuf==4.24.4
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
Pygments==2.14.0
PyGObject==3.36.0
PyJWT==2.4.0
pynvml==11.4.1
pyparsing==3.0.9
pyrsistent==0.19.3
python-dateutil==2.8.2
python-json-logger==2.0.4
pytz==2022.7.1
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==25.0.0
qudida==0.0.4
requests==2.28.2
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.7.0
rsa==4.9
safetensors==0.4.0
scalene==1.5.31.1
scikit-image==0.19.3
scikit-learn==1.2.0
scipy==1.10.0
seaborn==0.13.0
segmentation-models-pytorch==0.3.3
Send2Trash==1.8.0
shapely==2.0.0
six==1.16.0
sniffio==1.3.0
soupsieve==2.3.2.post1
stack-data==0.6.2
svgpathtools==1.3.3
svgwrite==1.4.3
sympy==1.12
tensorboard==2.14.0
tensorboard-data-server==0.7.1
terminado==0.17.1
textual==0.44.1
threadpoolctl==3.1.0
tifffile==2023.1.23.1
timm==0.9.2
tinycss2==1.2.1
tomli==2.0.1
torch==1.13.1
torchvision==0.14.1
tornado==6.2
tqdm==4.64.1
traitlets==5.8.1
typing-extensions==4.4.0
urllib3==1.26.14
wcwidth==0.2.6
webencodings==0.5.1
websocket-client==1.4.2
werkzeug==3.0.0
widgetsnbextension==4.0.5
yarl==1.9.2
zipp==3.11.0

code

import torch
import torchvision

model = torchvision.models.regnet_x_1_6gf()
model.cuda()

output = model(torch.rand(3, 3, 200, 200).cuda())

@bnawras
Copy link

bnawras commented Dec 9, 2023

@pablogsal

this code works without memray, it also works with scalene

@pablogsal
Copy link
Member

Seems that this is because pytorch is doing something weird with their dlopen handles:

pytorch/pytorch@198a3e4

See also voicepaw/so-vits-svc-fork#364

I think you need to bring this to the pytorch developers as their "workaround" conflicts with dlopen interposition.

@pablogsal
Copy link
Member

@godlygeek I can confirm that not patching torch/lib/../../nvidia/cudnn/lib/libcudnn.so.8 and torch/lib/libtorch_cuda.so fixes the problem, but it's unclear if this is something we should do.

@pablogsal
Copy link
Member

Hummm, when libcudnn_ops_infer.so.8 is first tried to be dlopen-ed by cudnnCreate in lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn.so.8 this is the linker load:

   968020:
    968020:     file=libcudnn_ops_infer.so.8 [0];  dynamically loaded by /home/pablogsal/.pyenv/versions/3.11.1/envs/memray/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn.so.8 [0]
    968020:     find library=libcudnn_ops_infer.so.8 [0]; searching
    968020:      search path=/opt/cuda/lib64:glibc-hwcaps/x86-64-v4:glibc-hwcaps/x86-64-v3:glibc-hwcaps/x86-64-v2:              (LD_LIBRARY_PATH)
    968020:       trying file=/opt/cuda/lib64/libcudnn_ops_infer.so.8
    968020:       trying file=glibc-hwcaps/x86-64-v4/libcudnn_ops_infer.so.8
    968020:       trying file=glibc-hwcaps/x86-64-v3/libcudnn_ops_infer.so.8
    968020:       trying file=glibc-hwcaps/x86-64-v2/libcudnn_ops_infer.so.8
    968020:       trying file=libcudnn_ops_infer.so.8
    968020:      search path=/home/pablogsal/.pyenv/versions/3.11.1/envs/memray/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib           (RPATH from file /home/pablogsal/.pyenv/versions/3.11.1/envs/memray/lib/python3.11/site-packages/torch/lib/libtorch_global_deps.so)
    968020:       trying file=/home/pablogsal/.pyenv/versions/3.11.1/envs/memray/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_ops_infer.so.8
    968020:

but when loaded via memray in memray::intercept::dlopen this is the linker load:

    990508:
    990508:     file=libcudnn_ops_infer.so.8 [0];  dynamically loaded by /home/pablogsal/github/memray/src/memray/_memray.cpython-311-x86_64-linux-gnu.so [0]
    990508:     find library=libcudnn_ops_infer.so.8 [0]; searching
    990508:      search path=/opt/cuda/lib64:glibc-hwcaps/x86-64-v4:glibc-hwcaps/x86-64-v3:glibc-hwcaps/x86-64-v2:              (LD_LIBRARY_PATH)
    990508:       trying file=/opt/cuda/lib64/libcudnn_ops_infer.so.8
    990508:       trying file=glibc-hwcaps/x86-64-v4/libcudnn_ops_infer.so.8
    990508:       trying file=glibc-hwcaps/x86-64-v3/libcudnn_ops_infer.so.8
    990508:       trying file=glibc-hwcaps/x86-64-v2/libcudnn_ops_infer.so.8
    990508:       trying file=libcudnn_ops_infer.so.8
    990508:      search path=/home/pablogsal/.pyenv/versions/3.11.1/lib         (RUNPATH from file /home/pablogsal/.pyenv/versions/3.11.1/envs/memray/bin/python)
    990508:       trying file=/home/pablogsal/.pyenv/versions/3.11.1/lib/libcudnn_ops_infer.so.8
    990508:      search cache=/etc/ld.so.cache
    990508:      search path=/usr/lib/glibc-hwcaps/x86-64-v4:/usr/lib/glibc-hwcaps/x86-64-v3:/usr/lib/glibc-hwcaps/x86-64-v2:/usr/lib           (system search path)
    990508:       trying file=/usr/lib/glibc-hwcaps/x86-64-v4/libcudnn_ops_infer.so.8
    990508:       trying file=/usr/lib/glibc-hwcaps/x86-64-v3/libcudnn_ops_infer.so.8
    990508:       trying file=/usr/lib/glibc-hwcaps/x86-64-v2/libcudnn_ops_infer.so.8
    990508:       trying file=/usr/lib/libcudnn_ops_infer.so.8

somehow the RPATH of /home/pablogsal/.pyenv/versions/3.11.1/envs/memray/lib/python3.11/site-packages/torch/lib/libtorch_global_deps.so was not conisderd.

@pablogsal
Copy link
Member

Ah, is not being considered because the dlopen happens in /home/pablogsal/github/memray/src/memray/_memray.cpython-311-x86_64-linux-gnu.so and not in /home/pablogsal/.pyenv/versions/3.11.1/envs/memray/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn.so.8.

Here is a reproducer:

$ cat mypreload.c

#define _GNU_SOURCE
#include <dlfcn.h>
#include <stdio.h>

typedef void* (*dlopen_t)(const char* filename, int flags);

static void* (*real_dlopen)(const char* filename, int flags) = NULL;

void*
dlopen(const char* filename, int flags)
{
    if (!real_dlopen) {
        real_dlopen = (dlopen_t)dlsym(RTLD_NEXT, "dlopen");
        if (!real_dlopen) {
            fprintf(stderr, "Error: Unable to find real dlopen function\n");
            return NULL;
        }
    }

    printf("Intercepted: Loading library: %s\n", filename);

    return real_dlopen(filename, flags);
}

$ gcc -shared -fPIC -ldl mypreload.c -o mypreload.so

$ LD_PRELOAD=./mypreload.so python example.py

...
...
Intercepted: Loading library: libcudnn_ops_infer.so.8
Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory

@pablogsal
Copy link
Member

This means that RPATHs won't be considered when dlopen intercepts happen. This affects other profilers, for example:

$heaptrack /home/pablogsal/.pyenv/versions/memray/bin/python example.py
heaptrack output will be written to "/home/pablogsal/github/memray/heaptrack.python.1180606.zst"
starting application, this might take some time...
...
Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8: cannot open shared object file: No such file or directory
/usr/bin/heaptrack: line 361: 1180621 Aborted                 (core dumped) LD_PRELOAD="$LIBHEAPTRACK_PRELOAD${LD_PRELOAD:+:$LD_PRELOAD}" DUMP_HEAPTRACK_OUTPUT="$pipe" "$client" "$@"

@albertz
Copy link

albertz commented Feb 8, 2024

Is there a workaround? Or should I just wait for PR #525?

@pablogsal
Copy link
Member

pablogsal commented Feb 8, 2024

Is there a workaround? Or should I just wait for PR #525?

There is a workaround meanwhile we merge PR #525. You need to set the LD_LIBRARY_PATH environment variable to include also the rpath of the library. For example, in my system I can do:

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/home/pablogsal/.pyenv/versions/3.11.1/envs/memray/lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib

@pablogsal
Copy link
Member

@albertz can you confirm this works for you?

@albertz
Copy link

albertz commented Feb 8, 2024

export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:.../virtualenv/.../lib/python3.11/site-packages/torch/lib/../../nvidia/cudnn/lib

Thanks, yes, that works.

@pablogsal
Copy link
Member

pablogsal commented Feb 10, 2024

@albertz you mentioned here https://news.ycombinator.com/reply?id=39327452&goto=item%3Fid%3D39325983%2339327452 that seems that there are some things that don’t make sense in your experience when you used memray.

Could you give us a small reproducer or explain a bit the problem so we can look into it?

@albertz
Copy link

albertz commented Feb 10, 2024

I will add some details in a separate issue: #547

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants