Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RTLD_DEEPBIND for TF+ZenDNN #111

Merged
merged 4 commits into from Jan 12, 2023
Merged

Use RTLD_DEEPBIND for TF+ZenDNN #111

merged 4 commits into from Jan 12, 2023

Conversation

varunsh-xilinx
Copy link
Member

@varunsh-xilinx varunsh-xilinx commented Jan 12, 2023

Summary of Changes

  • Use dlopen to open Tensorflow library for TF+ZenDNN
  • Remove extra linkages in libraries
  • Fix shutdown race condition with gRPC

Motivation

An upcoming change in the Tensorflow version used by tfzednn creates a symbol conflict between it and the inference server due to mismatching protobuf symbols. In the change, the correct protobuf symbols are located in another TF library but these symbols aren't found because the server is already linking protobuf.

Implementation

I and @amuralee-amd explored many options to find a workable solution to resolve the symbol conflict. For future reference, here's what I tried:

  1. Using RTLD_DEEPBIND for all workers. This is a good idea in theory because this library version mismatch occurred now with tfzendnn but it can happen again with other workers. Using this option for loading all workers should isolate them. Unfortunately, this creates a number of problems. std::cout stops working in the loaded shared library and certain functions in libstdc++ raise bad_cast exceptions.
  2. Not linking the workers to libamdinfer.so was done in part to address another issue with using RTLD_DEEPBIND which resulted in some global symbols like the logger not being correctly initialized in the loaded library. By not linking it, the worker would refer back to the version in the global scope instead.
  3. Using dlmopen instead of dlopen to load the library in a different namespace creates different problems. For example, gdb can't easily peer into the loaded library. There are also other posts online discussing the various issues around using dlmopen
  4. Certain fixes I tried worked in some cases but not others. Currently, there are the Python examples (which may or may not start the server from Python), the C++ examples, and the tests. I don't know enough about how the load-time process works in C++ to begin to compare it to how Python is doing it to wrapped library made with Pybind11.
  5. Building the workers with -nodefaultlibs and similar flags to avoid linking the standard library (and hopefully let it resolve from the main scope) also didn't work at compile time. Manually editing the .dynamic section to remove libraries with patchelf also didn't work (though removing libc did have an effect in that free() stopped working).
  6. Using a trampoline utility like Implib.so didn't work either. You get missing symbols possibly related to the vtables but adding the vtables for libtensorflow_cc.so took too long to generate.
  7. I found a Python command os.setdlopenflags() that I needed to use to change the flags used by Python to work with libraries that have been opened with RTLD_DEEPBIND.
  8. Having TF produce a single library instead of two would require a lot of changes to the TF build system and issues with the legal scan - @amuralee-amd

Signed-off-by: Varun Sharma <varun.sharma@amd.com>
Signed-off-by: Varun Sharma <varun.sharma@amd.com>
@gbuildx
Copy link
Collaborator

gbuildx commented Jan 12, 2023

Build failed!

Signed-off-by: Varun Sharma <varun.sharma@amd.com>
@gbuildx
Copy link
Collaborator

gbuildx commented Jan 12, 2023

Build failed!

Signed-off-by: Varun Sharma <varun.sharma@amd.com>
@gbuildx
Copy link
Collaborator

gbuildx commented Jan 12, 2023

Build failed!

@varunsh-xilinx
Copy link
Member Author

This passed after running again

@varunsh-xilinx varunsh-xilinx merged commit 5f31443 into main Jan 12, 2023
@varunsh-xilinx varunsh-xilinx deleted the use-deepbind-for-tf branch January 12, 2023 23:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants