Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gtsam doesn't build on i386 or armhf #1605

Open
dkogan opened this issue Aug 14, 2023 · 40 comments
Open

gtsam doesn't build on i386 or armhf #1605

dkogan opened this issue Aug 14, 2023 · 40 comments

Comments

@dkogan
Copy link

dkogan commented Aug 14, 2023

Hi. I pushed gtsam into Debian, and I see (among other things) that it doesn't
build on armhf and i386. Can one of you please take a look? It'll take yall much
less time to detangle this than it would take me.

Here's the i386 build log: https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=i386&ver=4.2%7E9%2Bdfsg-4&stamp=1692035308&raw=0

The punchline is that when building gtsam/linear/tests/testSparseEigen.cpp
this happens:

In file included from ./gtsam/base/FastSet.h:28,
                 from ./gtsam/inference/Key.h:22,
                 from ./gtsam/inference/Ordering.h:23,
                 from ./gtsam/inference/EliminateableFactorGraph.h:26,
                 from ./gtsam/linear/GaussianFactorGraph.h:24,
                 from gtsam/linear/tests/testSparseEigen.cpp:21:
./gtsam/base/Testable.h: In instantiation of 'bool gtsam::assert_equal(const V&, const V&, double) [with V = int]':
gtsam/linear/tests/testSparseEigen.cpp:44:3:   required from here
./gtsam/base/Testable.h:99:26: error: incomplete type 'gtsam::traits<int>' used in nested name specifier
   99 |     if (traits<V>::Equals(actual,expected, tol))
      |         ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
./gtsam/base/Testable.h:102:21: error: incomplete type 'gtsam::traits<int>' used in nested name specifier
  102 |     traits<V>::Print(expected,"expected:\n");
      |     ~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
./gtsam/base/Testable.h:103:21: error: incomplete type 'gtsam::traits<int>' used in nested name specifier
  103 |     traits<V>::Print(actual,"actual:\n");

It's reproducible standalone, without cmake like this:

< cmake/dllexport.h.in sed \
' s/#cmakedefine/#define/g;
  s/@library_name@/GTSAM/g;
' > gtsam/dllexport.h

< gtsam/config.h.in sed \
' s/#cmakedefine/#define/g;
  s/@GTSAM_VERSION_MAJOR@/4/g;
  s/@GTSAM_VERSION_MINOR@/2/g;
  s/@GTSAM_VERSION_PATCH@/0/g;
  s/@GTSAM_VERSION_NUMERIC@/40200/g;
  s/@GTSAM_VERSION_STRING@/4.2a9/g;
  s/@GTSAM_EIGEN_VERSION_WORLD@/3/g;
  s/@GTSAM_EIGEN_VERSION_MAJOR@/4/g;
  s/.*define .*_USE.*_MKL.*//g;
  s/.*define GTSAM_EIGEN_VERSION_MINOR.*//g;
  s/.*define GTSAM_ALLOCATOR_BOOSTPOOL.*//g;
  s/.*define GTSAM_ALLOCATOR_STL.*//g;
  s/.*define GTSAM_SLOW_BUT_CORRECT_BETWEENFACTOR.*//g;
  s/.*define GTSAM_USE_QUATERNIONS.*//g;
' > gtsam/config.h

g++ \
  -DBOOST_ALL_NO_LIB \
  -DBOOST_ATOMIC_DYN_LINK \
  -DBOOST_CHRONO_DYN_LINK \
  -DBOOST_DATE_TIME_DYN_LINK \
  -DBOOST_FILESYSTEM_DYN_LINK \
  -DBOOST_PROGRAM_OPTIONS_DYN_LINK \
  -DBOOST_REGEX_DYN_LINK \
  -DBOOST_SERIALIZATION_DYN_LINK \
  -DBOOST_SYSTEM_DYN_LINK \
  -DBOOST_THREAD_DYN_LINK \
  -DBOOST_TIMER_DYN_LINK \
  -DNDEBUG \
  -I"." \
  -I"CppUnitLite" \
  -isystem /usr/include/eigen3 \
  -fstack-protector-strong \
  -Wformat \
  -Werror=format-security \
  -Wno-deprecated-declarations \
  -Wdate-time \
  -D_FORTIFY_SOURCE=2 \
  -DNDEBUG \
  -Wall \
  -fPIC \
  -Wreturn-local-addr \
  -Werror=return-local-addr \
  -Wreturn-type \
  -Werror=return-type \
  -Wformat \
  -Werror=format-security \
  -Wsuggest-override \
  -Wno-unused-local-typedefs \
  -o /tmp/tst.o \
  -c "gtsam/linear/tests/testSparseEigen.cpp"

Thanks much

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 15, 2023

I need to spin up an armhf box first, thanks for reporting! Also, do you also build the Python wrapper? Currently I think they need to be built together.

@dkogan
Copy link
Author

dkogan commented Aug 15, 2023 via email

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 16, 2023

@dkogan I think you only need to change the EXPECT(assert_equal()) to EXPECT_LONGS_EQUAL()

@dkogan
Copy link
Author

dkogan commented Aug 16, 2023

Thanks much. Appears to work. Doing another upload; let's see how it does...

@dkogan
Copy link
Author

dkogan commented Aug 17, 2023

OK, this issue appears to be fixed, but now the tests fail on armhf: https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=armhf&ver=4.2%7E9%2Bdfsg-5&stamp=1692186176&raw=0

The other arches still have problems too: https://buildd.debian.org/status/package.php?p=gtsam

I'm going to have to come back to this later. If you want to work on some of these issues in the meantime, that would be great. The package tree is here: https://buildd.debian.org/status/package.php?p=gtsam

@dkogan dkogan closed this as completed Aug 17, 2023
@ProfFan
Copy link
Collaborator

ProfFan commented Aug 17, 2023

@dkogan I think we currently only support 64-bit platforms, is this supported on Debian?

@dkogan
Copy link
Author

dkogan commented Aug 17, 2023 via email

@dkogan
Copy link
Author

dkogan commented Aug 21, 2023

Hi. It looks like the failure mode on i686 and armhf is the same one: the build completes, but the tests fail: https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=i386&ver=4.2%7E9%2Bdfsg-5&stamp=1692266835&raw=0

Since you don't need any new hardware on i686, can I get you to look into that? You'll be far more effective at debugging this than I. Thanks.

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 21, 2023

@dkogan I can look at it this week

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 21, 2023

Tagging @jlblancoc for this is a Debian-related issue

@jlblancoc
Copy link
Member

First, thanks a lot @dkogan for finally being brave enough to move this forward! :-)

All these errors are caused by assumptions and/or never-happened-yet situations about the different sizes of "int", "long", "size_t" in those "uncommon" archs, it's not related directly with Debian packaging or flags per se...

@jlblancoc
Copy link
Member

jlblancoc commented Aug 22, 2023

PS: The comment above was for build errors.

For failing tests, I cannot see a clear reason from the buildd logs. In the past, reasons for tests to fail in other projects on "uncommon" archs have been:

  • Undefined / uninitialized memory in some archs. I recommend running failing tests against valgrind. I added targets named make testFoo.valgrind to help debugging that. Also, temporarily, @dkogan , it would be great to add valgrind to Build-Depends and running make check_valgrind instead of make check to see all the valgrind reports from the build farm. Or, if you have Debian Developer access to the build boxes (I don't), try that locally without involving a complete DD upload...
  • Thresholds in numerical algorithms: many numerical methods where residuals, traces, etc. are checked (e.g. rank checking of matrix) have significant different numbers when run on i386 or other archs in comparison to amd64 / arm64, and thresholds may need to be updated or adapted, for all cases, or with conditional #if for those archs.

It's still a long track ahead, but it's worth it! 👍

@dkogan
Copy link
Author

dkogan commented Aug 22, 2023 via email

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 23, 2023

I think I know what is the issue, it's because we expect Key to be the same as Eigen::Index which is 32 bit in some archs. I would say now we have a failure case for #1522

@dkogan
Copy link
Author

dkogan commented Aug 23, 2023 via email

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 23, 2023

Fan Jiang @.***> writes:
I think I know what is the issue, it's because we expect Key to be the same as Eigen::Index which is 32 bit in some archs. I would say now we have a failure case for #1522
That looks like a good candidate. Are the patches in that PR more or less done? Should I try applying them to the Debian builds?

@dkogan Yes that would be a good test, let's apply and see what happens

@dkogan
Copy link
Author

dkogan commented Aug 25, 2023

The patches in #1522 fixed most of the test failures but not all. I just tried on armhf, and there's one failure still:

 69/221 Test  #69: testSymbolicFactorGraph ............***Failed    0.03 sec
Not equal:
expected:
: cliques: 4, variables: 6
expected:
- P( e0 l0 b0)
expected:
| - P( s0 | b0 l0)
expected:
| - P( t0 | e0 l0)
expected:
| - P( x0 | e0)
actual:
: cliques: 4, variables: 6
actual:
- P( e0 l0 b0)
actual:
| - P( t0 | e0 l0)
actual:
| - P( x0 | e0)
actual:
| - P( s0 | b0 l0)
./gtsam/symbolic/tests/testSymbolicFactorGraph.cpp:93: Failure: "assert_equal(asiaBayesTree, actual2)" 
There were 1 failures

Yall should test this all yourselves I think. I would start with i686. I suggest using a Debian install on your native arch (presumably amd64), crossing to i686. This is trivial to set up, and works without emulation. Let me know if you need help.

@dellaert
Copy link
Member

If you can root-cause it and suggest a fix/PR that I think would be the most helpful. Might be some cross-platform non-determinism that we have not yet encountered.

@dkogan
Copy link
Author

dkogan commented Aug 31, 2023

This has a lot of gtsam-specific complexity, and you can debug and fix this much faster than I can. Building on i686 is trivial for you to do (as described above), and you should start there. Let me know if you need help setting this up.

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 31, 2023

This seems to be just an order issue in the comparison of Bayes tree, these two trees are equivalent. @dkogan I think you can mask this test in i386 first and we can work on a fix to the comparison

@dkogan
Copy link
Author

dkogan commented Aug 31, 2023 via email

@ProfFan
Copy link
Collaborator

ProfFan commented Aug 31, 2023

Yes :)

@dkogan
Copy link
Author

dkogan commented Sep 1, 2023

Alright, I tried it. Build logs are here: https://buildd.debian.org/status/package.php?p=gtsam

Click on the red "Build-Attempted" to get the log for each particular arch. Notes:

  • It looks like there are two separate sets of tests. If the first set doesn't all pass, it doesn't even try to run the second set. This is a bug: it should run all the tests always so that we can get a clear idea of what's failing
  • armhf and armel pass all of the first set of tests (all 221 of them), but fails some of the second set. I don't know enough to comment about what the failures mean
  • i386 fails one of the first set of tests: it looks like it just barely misses some thresholds, so it's probably not a failure. But since it failed something in the first set, we don't know if any of the second set of tests would have failed

The other arches either succeeded, or haven't been built yet.

Can somebody please take a look? Thanks

@jlblancoc
Copy link
Member

  • The two sets of tests seem to be the C++ and the Python parts.
  • The threshold issues are handled in another PR now.
  • The python errors look to me like this, not caused by binder in this case, but just perhaps due to the use of int or other non-portable types (but perhaps I'm wrong, not familiar enough with gtsam python wrapper).

@ProfFan
Copy link
Collaborator

ProfFan commented Sep 5, 2023

Yep, seems like abuse of size_t when it should be uint64_t

@dkogan
Copy link
Author

dkogan commented Sep 5, 2023

OK. Once some form of #1625 is merged, I'll pull that into my patch stack.

And now that all the builds are done, I can see that some other arches failed in different ways. I think I fixed what mips64el is complaining about. Can somebody please look at the test failures for hppa and x32:

@ProfFan
Copy link
Collaborator

ProfFan commented Sep 6, 2023

@dkogan It's near conference deadline recently so no bandwidth for me, but after ICRA sub deadline I can look into the Python failures.

@dkogan
Copy link
Author

dkogan commented Sep 6, 2023 via email

@dkogan
Copy link
Author

dkogan commented Sep 8, 2023

#1625 was just merged, so this round of i386 issues should be fixed. Thanks! The armel, armhf ones would be unaffected though.

@berndpfrommer
Copy link
Collaborator

Today our automated packaging broke because of the introduction of libcephes-gtsam.so into the development branch. As the name suggests, it's a private build of libcephes since no debian package exists.
Didn't we get rid of a privately built libmetis because it would preclude gtsam from inclusion into debian?

@dkogan
Copy link
Author

dkogan commented Jan 1, 2024

Hi. If you're including more dependencies you didn't write, it would be great if you had a clear switch in the code to avoid using them.

@dkogan
Copy link
Author

dkogan commented Jan 9, 2024

Hi. I pushed updated the Debian package to the new 4.2.0 release. And the 32-bit architectures still fail tests. For i386: https://buildd.debian.org/status/fetch.php?pkg=gtsam&arch=i386&ver=4.2.0%2Bdfsg-1&stamp=1704797861&raw=0

The first failure:

======================================================================
ERROR: test_VisualISAMExample (testEssentialMatrixConstraint.TestVisualISAMExample.test_VisualISAMExample)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/obj-i686-linux-gnu/python/gtsam/tests/testEssentialMatrixConstraint.py", line 32, in test_VisualISAMExample
    factor = EssentialMatrixConstraint(poseKey1, poseKey2, E, model)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. gtsam.gtsam.EssentialMatrixConstraint(key1: int, key2: int, measuredE: gtsam.gtsam.EssentialMatrix, model: gtsam.gtsam.noiseModel.Base)

Invoked with: 8646911284551352321, 8646911284551352322, R:
 [
	0.969061, 0.218325, 0.115126;
	-0.196438, 0.964625, -0.175815;
	-0.149438, 0.14776, 0.977668
]
d: : 0.333333
-0.666667
 0.666667
, isotropic dim=5 sigma=0.25

This looks to me like it couldn't run the test code, not that it failed some threshold. Can one of you please take a look?

Thanks.

@ProfFan
Copy link
Collaborator

ProfFan commented Jan 10, 2024

@varunagrawal Is libcephes a header-only library? If so maybe it does not need to be built at all. We can make all function inline and remove the lib.

@dkogan You are right, this error is potentially a wrong dynamic_cast. There is a recent issue #1685 about similar stuff. I can look into this error this week.

@varunagrawal
Copy link
Collaborator

@varunagrawal Is libcephes a header-only library? If so maybe it does not need to be built at all. We can make all function inline and remove the lib.

Unfortunately it isn't and I don't have the bandwidth to go deep into updating it to be header-only (especially since there is no test suite).

@ProfFan
Copy link
Collaborator

ProfFan commented Jan 10, 2024

@varunagrawal

/**
 * @brief Compute the quantile function of the Chi-Squared distribution.
 *
 * The quantile function of the Chi-squared distribution is the quantile of
 * the specific (inverse) incomplete Gamma distribution.
 *
 * @param dofs Degrees of freedom
 * @param alpha Quantile value
 * @return double
 */
double chi_squared_quantile(const double dofs, const double alpha) {
  return 2 * igami(dofs / 2, alpha);
}

From a search this is the only place using cephes, is it correct?

@varunagrawal
Copy link
Collaborator

Yes, at least for now.

@varunagrawal
Copy link
Collaborator

I am not sure I understand what the issue with Cephes is here. @dkogan's latest comment seems to be referring to something else.

@ProfFan
Copy link
Collaborator

ProfFan commented Jan 10, 2024

Just thinking about how can we make packaging easier :)

@berndpfrommer
Copy link
Collaborator

Sorry I think I somewhat hijacked this issue, which was about gtsam not building on i386/armhf. The point I'm raising about the introduction of libcephes is in so far relevant that it may go counter to @dkogan 's efforts to make gtsam an acceptable Debian package (which the present issue is also concerned with). It was my understanding that shipping binaries of other libraries along with gtsam is less than ideal, see
the discussion related to libmetis.
But again, sorry for taking this thread off-topic.

@varunagrawal
Copy link
Collaborator

Just thinking about how can we make packaging easier :)

The two seem orthogonal? The reasoning for including cephes is to enable functionality (specifically of the GNC optimizer), so sacrificing functionality for the sake of packaging seems to be akin to getting a poor gift but really expensive wrapping paper. There is a single function call but behind the scenes, multiple files and functions are required to make it work.

That being said, if there was an easier way to enable GNC without Cephes, I would do it. Unfortunately, re-implementing code provided by Cephes in a manner that keeps up with the quality of GTSAM is out of the question for me since I have been working on it by myself without any help for many months, and I have other things on my plate. Unless someone is willing to step up and help with development, it feels a bit disingenuous to ask for changes to make others lives easier while simultaneously making mine harder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants