Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization failure with MSVC and version 1.84 #309

Closed
johnwason opened this issue Mar 14, 2024 · 29 comments
Closed

Serialization failure with MSVC and version 1.84 #309

johnwason opened this issue Mar 14, 2024 · 29 comments

Comments

@johnwason
Copy link

We are seeing a new issue with serialization using MSVC and version 1.84. Unit tests are failing, and it looks like the serialization is being corrupted with binary archives, and some tests are failing with XML archives. The serialization has worked with previous versions of serialization. I did a matrix test using version 1.82, 1.83, and 1.84 to confirm. 1.82 and 1.83 passed, while 1.84 failed. Here is the workflow run:

https://github.com/johnwason/tesseract/actions/runs/8242279639

The upstream source repository is here: https://github.com/tesseract-robotics/tesseract

Both tesseract and boost-serialization are large projects so I am not sure where the problem is. Were there any major changes in boost-serialization that would explain the problem?

@Levi-Armstrong

@robertramey
Copy link
Member

I looked at the links cited. I'm not sure how to understand the build script and how/if it points to the boost serialization library.

@LowLevelMahn
Copy link

im currently upgrading to 1.84 - so just commenting here to be informed by github if there is a real issue

@johnwason
Copy link
Author

I have narrowed down the problem to the "command" class tests, and tests that use commands.

An example of the test failure is EnvironmentCommandsSerializeUnit.AddLinkCommand in the test function testSerializationDerivedClass() located here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/test/tesseract_environment_serialization_unit.cpp#L145

Within testSerializationDerivedClass() the failure is here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_common/include/tesseract_common/unit_test_utils.h#L145

As far as I can tell this looks like a failure related to polymorphic types. This error was not occurring in previous boost serialization versions.

The base class Command is defined here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/include/tesseract_environment/command.h#L81

And the example subclass AddLinkCommand is located here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/include/tesseract_environment/commands/add_link_command.h#L45

The implementation for the commands is located here: https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/src/command.cpp#L35

https://github.com/tesseract-robotics/tesseract/blob/ea3d5c37d466d55068a41e75de24ce2655f5c648/tesseract_environment/src/commands/add_link_command.cpp#L41

There are many other tests failing that all use the Command base class.

@Levi-Armstrong

@johnwason
Copy link
Author

Note that the serialization to string and XML works correctly so I don't think the issue is in the tesseract library.

@johnwason
Copy link
Author

I did more tinkering and was able to find this difference between the failing unit test and an isolated example that seems to work correctly. ChangeJointPositionLimitsCommand2.binary is from the failing test, and ChangeJointPositionLimitsCommand3.binary is from the working test. The only difference is in like 60:

$ xxd ChangeJointPositionLimitsCommand2.binary
00000000: 1600 0000 0000 0000 7365 7269 616c 697a  ........serializ
00000010: 6174 696f 6e3a 3a61 7263 6869 7665 1400  ation::archive..
00000020: 0404 0408 0100 0000 0001 0000 0002 0020  ............... 
00000030: 0000 0000 0000 0043 6861 6e67 654a 6f69  .......ChangeJoi
00000040: 6e74 506f 7369 7469 6f6e 4c69 6d69 7473  ntPositionLimits
00000050: 436f 6d6d 616e 6401 0000 0000 0000 0000  Command.........
00000060: 0000 0000 0001 0000 000c 0000 0000 0000  ................
00000070: 0000 0100 0000 0000 0000 0800 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0006 0000 0000  ................
00000090: 0000 006a 6f69 6e74 3600 0000 0000 182d  ...joint6......-
000000a0: 4454 fb21 09c0 182d 4454 fb21 0940       DT.!...-DT.!.@

$ xxd ChangeJointPositionLimitsCommand3.binary
00000000: 1600 0000 0000 0000 7365 7269 616c 697a  ........serializ
00000010: 6174 696f 6e3a 3a61 7263 6869 7665 1400  ation::archive..
00000020: 0404 0408 0100 0000 0001 0000 0002 0020  ...............
00000030: 0000 0000 0000 0043 6861 6e67 654a 6f69  .......ChangeJoi
00000040: 6e74 506f 7369 7469 6f6e 4c69 6d69 7473  ntPositionLimits
00000050: 436f 6d6d 616e 6401 0000 0000 0000 0000  Command.........
00000060: 0100 0000 0001 0000 000c 0000 0000 0000  ................
00000070: 0000 0100 0000 0000 0000 0800 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0006 0000 0000  ................
00000090: 0000 006a 6f69 6e74 3600 0000 0000 182d  ...joint6......-
000000a0: 4454 fb21 09c0 182d 4454 fb21 0940       DT.!...-DT.!.@

The serialization step is here: https://github.com/johnwason/tesseract/blob/51a62b5bf3d844f8605b76879ac0f4c38b7d059b/tesseract_environment/src/commands/change_joint_position_limits_command.cpp#L81 and the class is defined here: https://github.com/johnwason/tesseract/blob/51a62b5bf3d844f8605b76879ac0f4c38b7d059b/tesseract_environment/include/tesseract_environment/commands/change_joint_position_limits_command.h#L45

Maybe someone with more knowledge of the serialization format can explain what could cause this difference.

@johnwason
Copy link
Author

johnwason commented Apr 18, 2024

I have been able to isolate and reproduce the problem.

Here is the repo with the test code: https://github.com/johnwason/boost_serialization_1_84_bug

And here is the action run: https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991

What I have discovered is that if the classes being serialized is in a shared library, then the serialization will fail. The failure only occurs for binary archives. The action run tests version 1.83 and version 1.82 of boost serialization, and it appears to work correctly.

Here is the cmake file. The tests for test_no_lib_prog works, while test_lib_prog fails.

cmake_minimum_required(VERSION 3.15.0)
project(boost_serialization_1_84_bug LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 17)

find_package(Boost REQUIRED COMPONENTS system filesystem serialization)
find_package(GTest REQUIRED)

add_library(test_lib SHARED cmd.cpp cmd2.cpp)
target_link_libraries(test_lib Boost::boost
Boost::system
Boost::filesystem
Boost::serialization
GTest::GTest GTest::Main)
set_target_properties(test_lib PROPERTIES WINDOWS_EXPORT_ALL_SYMBOLS TRUE)

add_executable(test_lib_prog test.cpp)
target_link_libraries(test_lib_prog test_lib Boost::boost
Boost::system
Boost::filesystem
Boost::serialization
GTest::GTest GTest::Main)

add_executable(test_no_lib_prog test.cpp cmd.cpp cmd2.cpp)
target_link_libraries(test_no_lib_prog Boost::boost
Boost::system
Boost::filesystem
Boost::serialization
GTest::GTest GTest::Main)

enable_testing()

add_test(NAME test_no_lib COMMAND test_no_lib_prog)
add_test(NAME test_lib COMMAND test_lib_prog)

Here is the ctest log:

test 1
    Start 1: test_no_lib

1: Test command: D:\a\boost_serialization_1_84_bug\boost_serialization_1_84_bug\build\Release\test_no_lib_prog.exe
1: Working Directory: D:/a/boost_serialization_1_84_bug/boost_serialization_1_84_bug/build
1: Test timeout computed to be: 10000000
1: [==========] Running 6 tests from 2 test suites.
1: [----------] Global test environment set-up.
1: [----------] 3 tests from serialization_direct_test
1: [ RUN      ] serialization_direct_test.xml
1: [       OK ] serialization_direct_test.xml ([20](https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991#step:6:21) ms)
1: [ RUN      ] serialization_direct_test.binary
1: [       OK ] serialization_direct_test.binary (0 ms)
1: [ RUN      ] serialization_direct_test.xml_string
1: [       OK ] serialization_direct_test.xml_string (0 ms)
1: [----------] 3 tests from serialization_direct_test ([21](https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991#step:6:22) ms total)
1: 
1: [----------] 3 tests from serialization_derived_test
1: [ RUN      ] serialization_derived_test.xml
1: [       OK ] serialization_derived_test.xml (1 ms)
1: [ RUN      ] serialization_derived_test.binary
1: [       OK ] serialization_derived_test.binary (0 ms)
1: [ RUN      ] serialization_derived_test.xml_string
1: [       OK ] serialization_derived_test.xml_string (0 ms)
1: [----------] 3 tests from serialization_derived_test (2 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 6 tests from 2 test suites ran. ([24](https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991#step:6:25) ms total)
1: [  PASSED  ] 6 tests.
1/2 Test #1: test_no_lib ......................   Passed    0.08 sec
test 2
    Start 2: test_lib

2: Test command: D:\a\boost_serialization_1_84_bug\boost_serialization_1_84_bug\build\Release\test_lib_prog.exe
2: Working Directory: D:/a/boost_serialization_1_84_bug/boost_serialization_1_84_bug/build
2: Test timeout computed to be: 10000000
2: [==========] Running 6 tests from 2 test suites.
2: [----------] Global test environment set-up.
2: [----------] 3 tests from serialization_direct_test
2: [ RUN      ] serialization_direct_test.xml
2: [       OK ] serialization_direct_test.xml (1 ms)
2: [ RUN      ] serialization_direct_test.binary
2: D:\a\boost_serialization_1_84_bug\boost_serialization_1_84_bug\unit_test_utils.h(73): error: Value of: *object_derived != *nobject_derived
2:   Actual: true
2: Expected: false
2: 
2: [  FAILED  ] serialization_direct_test.binary (0 ms)
2: [ RUN      ] serialization_direct_test.xml_string
2: [       OK ] serialization_direct_test.xml_string (0 ms)
2: [----------] 3 tests from serialization_direct_test (2 ms total)
2: 
2: [----------] 3 tests from serialization_derived_test
2: [ RUN      ] serialization_derived_test.xml
2: [       OK ] serialization_derived_test.xml (1 ms)
2: [ RUN      ] serialization_derived_test.binary
2: D:\a\boost_serialization_1_84_bug\boost_serialization_1_84_bug\unit_test_utils.h(73): error: Value of: *object_derived != *nobject_derived
2:   Actual: true
2: Expected: false
2: 
2: [  FAILED  ] serialization_derived_test.binary (1 ms)
2: [ RUN      ] serialization_derived_test.xml_string
2: [       OK ] serialization_derived_test.xml_string (0 ms)
2: [----------] 3 tests from serialization_derived_test (3 ms total)
2: 
2: [----------] Global test environment tear-down
2: [==========] 6 tests from 2 test suites ran. (6 ms total)
2: [  PASSED  ] 4 tests.
2: [  FAILED  ] 2 tests, listed below:
2: [  FAILED  ] serialization_direct_test.binary
2: [  FAILED  ] serialization_derived_test.binary
2: 
2:  2 FAILED TESTS
2/2 Test #2: test_lib .........................***Failed    0.02 sec

[50](https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/8745401644/job/24000327991#step:6:51)% tests passed, 1 tests failed out of 2

Total Test time (real) =   0.10 sec

The following tests FAILED:
	  2 - test_lib (Failed)
Errors while running CTest

@johnwason
Copy link
Author

I am still seeing this failure in boost 1.85.

@johnwason
Copy link
Author

Are there any suggestions where I can begin troubleshooting?

@correaa
Copy link

correaa commented Jun 7, 2024

I can’t help you, I don’t have experience with Windows. but I am curious what you do mean by “classes being serialized is in a shared library”? do you mean that especial members are linked in a dynamic library? dll?

@robertramey
Copy link
Member

Sorry I haven't been more responsive. As far as Boost is concerned, I've been bogged down in build/test issues which have nothing to do with the serialization library itself.

@robertramey
Copy link
Member

You've done some real work here. I would agree that your experiments with different archives suggest that it's an issue with binary archive suggest that the problem lies there. That narrows things down a lot.

Could you send the source code to your tests - both the failing and passing one?

@johnwason
Copy link
Author

The source code for the isolated test is here: https://github.com/johnwason/boost_serialization_1_84_bug . It includes an actions workflow that does a parametric test of different boost versions.

The original problem was discovered in the Tesseract robot motion planner: https://github.com/tesseract-robotics/tesseract . The isolated test was extracted from this source.

@correaa
Copy link

correaa commented Jun 7, 2024

The source code for the isolated test is here: https://github.com/johnwason/boost_serialization_1_84_bug . It includes an actions workflow that does a parametric test of different boost versions.

Excuse my ignorance, but what is the meaning of "&=" for an Archive in cmd.cpp (line 15 for example) ?

@robertramey
Copy link
Member

LOL - good question! I'd be curious to hear the answer as well.

@Levi-Armstrong
Copy link

It looks like it is a typo. There are 656 lines of serialization code within Tesseract and only two lines of code has the syntax ar&= which is clearly not correct.

@Levi-Armstrong
Copy link

@robertramey and @correaa Any thoughts why this syntax works at all?

@robertramey
Copy link
Member

I'm mystified by this. I did look into it but didn't see wow it would not produce a syntax error. I'm going to run your test case with the debugger. I expect this will address the error you found and also shed light on this.

@johnwason
Copy link
Author

I changed &= to & and it appears to have worked?? https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/9454574752/job/26042427949

I am getting the warning "warning C4552: '&': result of expression not used" now. Can this be ignored, or is there something else we are missing?

I am going to test the main tesseract library to see if the test pass.

@correaa
Copy link

correaa commented Jun 10, 2024

I guess that &= was converting either side of the expression to int. I am unaware that such conversions are possible, but it is worth checking. Also compiling with all conversion warnings on.

@johnwason
Copy link
Author

My isolated test case is fixed, but I am still getting crashes and failures in the main library. I am only seeing this issue for the binary archive so I don't think it is something in the client code.

@correaa
Copy link

correaa commented Jun 10, 2024

Isolate the error again?

@robertramey
Copy link
Member

  • Thanks to all of for your hard work! Here's what I do with these tough cases:
  1. create test case. You've done that. I'm not familiar with GTEST so I might change the test case again.
  2. verify that the test case builds and still fails on my development system.
  3. use the debugger in my IDE to step through the test case. Since you've got a passing AND a failing instance, it should be fairly easily to compare the two and isolate the error. It is tedious though. But it does work.
  4. I would do the same with the &= issue. This should permit one to trace into any conversions which occur. when you find something, looking at the stack in the debugger is very useful.

As I've said, I don't see the operator &=defined for an archive type - so I would expect a syntax error.

@johnwason
Copy link
Author

It looks like I nerfed the test rather than fixing the error. Here is the corrected run showing that it is still failing:

https://github.com/johnwason/boost_serialization_1_84_bug/actions/runs/9458596594

@johnwason
Copy link
Author

One possibility I noticed is that there is a singleton template that is being initialized with separate static values in the different libraries. This would mean that the base class is seeing a different singleton than the derived class when trying to serialize. Was something involving singletons changed in the more recent versions?

@Levi-Armstrong
Copy link

@robertramey and @correaa This is still failing after the fix.

@Levi-Armstrong
Copy link

I found the other issue. The cpp was missing BOOST_CLASS_EXPORT_IMPLEMENT(tesseract_environment1::Command).

@johnwason
Copy link
Author

johnwason commented Jul 27, 2024

The fix by @Levi-Armstrong fixed the test case. I am working on the full library test now.

... Why did this work before for other boost versions and different OS??

@johnwason
Copy link
Author

This appears to be fixed with tesseract-robotics/tesseract#1035

Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants