Torchscript casting error #5

ravihammond · 2022-02-02T00:40:21Z

Following the discussion from the the issue I posted, where I was experiencing silent deadlocks when running theobl1.sh script, it was suggested by @hengyuan-hu for me to try a new CUDA version. I decided to throw a hail mary, and try the latest version of Pytorch.

Here are the details of my new software setup inside a docker container:

Ubuntu 20.04
CUDA 11.3
Python 3.7.4
Pytorch 1.10.2
Pybind "stable" @9b4f71d12de4f9

It compiled successfully, but when I run the script, I'm experiencing a new torchscript error:

Traceback (most recent call last):
  File "selfplay.py", line 237, in <module>
    belief_model,
  File "/app/pyhanabi/act_group.py", line 45, in __init__
    runner = rela.BatchRunner(agent.clone(dev), dev)
RuntimeError: Unable to cast Python instance of type <class 'torch._C.ScriptModule'> to C++ type 'torch::jit::Module'

I suspect that I'm experiencing this error because torchscript has changed in the latest PyTorch. I will investigate further and report back here once I've figured out some more information.

If you have any idea what might be causing this issue, I'd be very happy to hear your thoughts!

The text was updated successfully, but these errors were encountered:

hengyuan-hu · 2022-02-02T01:26:59Z

Hi, this is the error that I observed when I was trying to compile with pytorch 1.7.1 (without trying your script yet, this is the old problem that basically stopped me from upgrading to newer pytorch). Unfortunately we don't have a solution to this yet. This is likely caused that a mismatch between the pybind used for this repo and the pybind used by the pytorch release build.

Any luck build pytorch from scratch? That may be the easiest fix to get you started.

hengyuan-hu · 2022-02-02T02:12:37Z

Just found the version info for pybind in pytorch in the commit message
https://github.com/pytorch/pytorch/tree/master/third_party

I tried using the same version but the casting still does not work. Will let you know if I have any progress.

hengyuan-hu · 2022-02-02T04:09:42Z

First checkout pybind to the version used by pytorch. For the latest one, it should be
git checkout v2.6.2

Then adding
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DPYBIND11_COMPILER_TYPE=\\\"_gcc\\\" -DPYBIND11_STDLIB=\\\"_libstdcpp\\\" -DPYBIND11_BUILD_ABI=\\\"_cxxabi1011\\\"")
on this line

off-belief-learning/CMakeLists.txt

Line 8 in 73e734d

should fix the problem.

ravihammond · 2022-02-02T04:20:59Z

I've just tried, it's stopped giving the error, thank you so much!
Now I'll see if this new version of CUDA gives me silent deadlocks.

hengyuan-hu · 2022-02-02T05:19:48Z

With this fix we will also switch to a newer version internally. Let's see if we can reproduce the deadlock problem ourselves.

ravihammond · 2022-02-04T01:19:51Z

Okay, I've successfully run obl1.sh and obl2.sh without any deadlocks, illegal move errors, or casting errors. Thanks so much for your help @hengyuan-hu!

hengyuan-hu · 2022-02-04T16:28:12Z

Resolved.

hengyuan-hu closed this as completed Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Torchscript casting error #5

Torchscript casting error #5

ravihammond commented Feb 2, 2022 •

edited

hengyuan-hu commented Feb 2, 2022

hengyuan-hu commented Feb 2, 2022

hengyuan-hu commented Feb 2, 2022 •

edited

ravihammond commented Feb 2, 2022

hengyuan-hu commented Feb 2, 2022

ravihammond commented Feb 4, 2022 •

edited

hengyuan-hu commented Feb 4, 2022

Torchscript casting error #5

Torchscript casting error #5

Comments

ravihammond commented Feb 2, 2022 • edited

hengyuan-hu commented Feb 2, 2022

hengyuan-hu commented Feb 2, 2022

hengyuan-hu commented Feb 2, 2022 • edited

ravihammond commented Feb 2, 2022

hengyuan-hu commented Feb 2, 2022

ravihammond commented Feb 4, 2022 • edited

hengyuan-hu commented Feb 4, 2022

ravihammond commented Feb 2, 2022 •

edited

hengyuan-hu commented Feb 2, 2022 •

edited

ravihammond commented Feb 4, 2022 •

edited