Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Torchscript casting error #5

Closed
ravihammond opened this issue Feb 2, 2022 · 7 comments
Closed

Torchscript casting error #5

ravihammond opened this issue Feb 2, 2022 · 7 comments

Comments

@ravihammond
Copy link

ravihammond commented Feb 2, 2022

Following the discussion from the the issue I posted, where I was experiencing silent deadlocks when running theobl1.sh script, it was suggested by @hengyuan-hu for me to try a new CUDA version. I decided to throw a hail mary, and try the latest version of Pytorch.

Here are the details of my new software setup inside a docker container:

  • Ubuntu 20.04
  • CUDA 11.3
  • Python 3.7.4
  • Pytorch 1.10.2
  • Pybind "stable" @9b4f71d12de4f9

It compiled successfully, but when I run the script, I'm experiencing a new torchscript error:

Traceback (most recent call last):
  File "selfplay.py", line 237, in <module>
    belief_model,
  File "/app/pyhanabi/act_group.py", line 45, in __init__
    runner = rela.BatchRunner(agent.clone(dev), dev)
RuntimeError: Unable to cast Python instance of type <class 'torch._C.ScriptModule'> to C++ type 'torch::jit::Module'

I suspect that I'm experiencing this error because torchscript has changed in the latest PyTorch. I will investigate further and report back here once I've figured out some more information.

If you have any idea what might be causing this issue, I'd be very happy to hear your thoughts!

@hengyuan-hu
Copy link
Contributor

Hi, this is the error that I observed when I was trying to compile with pytorch 1.7.1 (without trying your script yet, this is the old problem that basically stopped me from upgrading to newer pytorch). Unfortunately we don't have a solution to this yet. This is likely caused that a mismatch between the pybind used for this repo and the pybind used by the pytorch release build.

Any luck build pytorch from scratch? That may be the easiest fix to get you started.

@hengyuan-hu
Copy link
Contributor

Just found the version info for pybind in pytorch in the commit message
https://github.com/pytorch/pytorch/tree/master/third_party

I tried using the same version but the casting still does not work. Will let you know if I have any progress.

@hengyuan-hu
Copy link
Contributor

hengyuan-hu commented Feb 2, 2022

First checkout pybind to the version used by pytorch. For the latest one, it should be
git checkout v2.6.2

Then adding
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\"")
on this line

should fix the problem.

@ravihammond
Copy link
Author

I've just tried, it's stopped giving the error, thank you so much!
Now I'll see if this new version of CUDA gives me silent deadlocks.

@hengyuan-hu
Copy link
Contributor

With this fix we will also switch to a newer version internally. Let's see if we can reproduce the deadlock problem ourselves.

@ravihammond
Copy link
Author

ravihammond commented Feb 4, 2022

Okay, I've successfully run obl1.sh and obl2.sh without any deadlocks, illegal move errors, or casting errors. Thanks so much for your help @hengyuan-hu!

@hengyuan-hu
Copy link
Contributor

Resolved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants