Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary of Changes
dlopen
to open Tensorflow library for TF+ZenDNNMotivation
An upcoming change in the Tensorflow version used by
tfzednn
creates a symbol conflict between it and the inference server due to mismatching protobuf symbols. In the change, the correct protobuf symbols are located in another TF library but these symbols aren't found because the server is already linking protobuf.Implementation
I and @amuralee-amd explored many options to find a workable solution to resolve the symbol conflict. For future reference, here's what I tried:
RTLD_DEEPBIND
for all workers. This is a good idea in theory because this library version mismatch occurred now with tfzendnn but it can happen again with other workers. Using this option for loading all workers should isolate them. Unfortunately, this creates a number of problems.std::cout
stops working in the loaded shared library and certain functions inlibstdc++
raisebad_cast
exceptions.libamdinfer.so
was done in part to address another issue with usingRTLD_DEEPBIND
which resulted in some global symbols like the logger not being correctly initialized in the loaded library. By not linking it, the worker would refer back to the version in the global scope instead.dlmopen
instead ofdlopen
to load the library in a different namespace creates different problems. For example,gdb
can't easily peer into the loaded library. There are also other posts online discussing the various issues around usingdlmopen
-nodefaultlibs
and similar flags to avoid linking the standard library (and hopefully let it resolve from the main scope) also didn't work at compile time. Manually editing the.dynamic
section to remove libraries with patchelf also didn't work (though removinglibc
did have an effect in thatfree()
stopped working).libtensorflow_cc.so
took too long to generate.os.setdlopenflags()
that I needed to use to change the flags used by Python to work with libraries that have been opened withRTLD_DEEPBIND
.