Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuild for v18.2.x #16

Closed
15 of 16 tasks
bazaah opened this issue Jun 23, 2023 · 18 comments
Closed
15 of 16 tasks

Rebuild for v18.2.x #16

bazaah opened this issue Jun 23, 2023 · 18 comments
Labels

Comments

@bazaah
Copy link
Owner

bazaah commented Jun 23, 2023

This is a tracking issue for me to collect my thoughts / notes around the push for v18

Most notably, I likely will not be actually pushing a v18 build to AUR for at least a few patch versions, as typically Ceph finds+fixes serious issues shortly after a public release of the new version.


v18.1.x (TEST)

  • Build
  • Check
    • Skipping for now, seems even the upstream test suite isn't passing at the moment
  • Package

v18.2.x (RELEASE)

  • Build
  • Check
  • Package

Experiments

  • Switch to ninja for cmake -G (upstream has since v17.2.5)
    • Encountered some infinite loop 3 times when attempting to build. Not sure if I'm missing something, but holding off on further testing for now
  • Enable -DWITH_RBD_RWL=ON for the writeback cache (Build RBD persistent cache plugin #18)

Fixes

  • Include lvm2 in pkg.ceph-volume (1)
  • Investigate missing dependencies of pkg.ceph-mgr (1, 2)
    • This one is strange, I should probably take a look at the dep tree of my current ceph nodes as I don't seem to experience these issues myself, meaning I'm transitively fixing the problems somehow
  • Fix /etc/sudoers.d/ fakeroot perms (1)

Tests

  • run-tox-mgr-dashboard-lint
  • run-tox-cephadm
  • check-generated.sh
  • unittest_erasure_code_shec_arguments
  • unittest_bluefs
@bazaah bazaah added enhancement New feature or request upstream/v18 labels Jun 23, 2023
@bazaah
Copy link
Owner Author

bazaah commented Jun 23, 2023

Got my first successful build of v18.1.1 with the moral equivalent of ceph/ceph#52119 + ceph/ceph#51737, minus lots of intree patches that are no longer relevant

@bazaah
Copy link
Owner Author

bazaah commented Jun 23, 2023

Check failed, but not too worried about that for the moment

@bazaah
Copy link
Owner Author

bazaah commented Jul 7, 2023

Switching to ninja seems to trigger some sort of infinite loop in the build somewhere, continuously reading something. Not sure what is going on, but leaving that alone for now

@bazaah
Copy link
Owner Author

bazaah commented Aug 8, 2023

First stable version has released: https://github.com/ceph/ceph/tree/v18.2.0

@bazaah
Copy link
Owner Author

bazaah commented Aug 11, 2023

Ran into a fmt compile error, seems I need to implement the fmtlib specialization for ceph_le<T>. Need to investigate this some more, maybe find prior art I can use

@bazaah
Copy link
Owner Author

bazaah commented Aug 13, 2023

ran into a lot of fmt compile errors this weekend. Got to 81% in the build, but more work yet to be done.

@bazaah
Copy link
Owner Author

bazaah commented Aug 15, 2023

first ever successful build of v18.2 just completed. Likely there going to be lots of "fun" tests to fix, but I'm happy to say that a build with -DWITH_RBD_RWL=ON completed.

@bazaah
Copy link
Owner Author

bazaah commented Aug 16, 2023

The following tests FAILED:
         10 - run-tox-mgr-dashboard-lint (Failed)
         22 - run-tox-cephadm (Failed)
        142 - check-generated.sh (Failed)
        161 - unittest_erasure_code_shec_arguments (Failed)
        179 - unittest_bluefs (Subprocess aborted)

The last two are the most troubling. The first two seem like entirely failed lints (from the newer pylint), and the 3rd I'm not sure of yet

@bazaah
Copy link
Owner Author

bazaah commented Aug 20, 2023

I have fixes for check-generated.sh, and "fixed" (re-disabled) the lints in the first two.

However, I think the last two are serious, and caused by some change either in gcc or boost. Need more time to investigate them.

@bazaah
Copy link
Owner Author

bazaah commented Aug 21, 2023

As a side note, if anyone else is interested I've pushed a cleaned up patch for the fmtlib fixes in aa4476a, so you can now build v18.2.0 from the feature/v18.2.0-1 branch yourself.

@bazaah
Copy link
Owner Author

bazaah commented Aug 25, 2023

So, either ceph/ceph@844260f or ceph/ceph@2595143 cause the regression in unittest_erasure_code_shec_arguments. Unsure which; and maybe its both somehow. Confirmed to be the second, not sure what the issue is, yet


EDIT:

iff --git a/src/test/erasure-code/TestErasureCodeShec_arguments.cc b/src/test/erasure-code/TestErasureCodeShec_arguments.cc
index 075c6383eed..74403eaf6ed 100644
--- a/src/test/erasure-code/TestErasureCodeShec_arguments.cc
+++ b/src/test/erasure-code/TestErasureCodeShec_arguments.cc
@@ -86,12 +86,12 @@ void create_table_shec432() {
           continue;
         }
         if (std::popcount(avails) == 4) {
-         auto a = to_array<std::initializer_list<int>>({
+         std::vector<std::initializer_list<int>> a = {
              {0,1,2,3}, {0,1,2,4}, {0,1,2,6}, {0,1,3,4}, {0,1,3,6}, {0,1,4,6},
              {0,2,3,4}, {0,2,3,5}, {0,2,4,5}, {0,2,4,6}, {0,2,5,6}, {0,3,4,5},
              {0,3,4,6}, {0,3,5,6}, {0,4,5,6}, {1,2,3,4}, {1,2,3,5}, {1,2,4,5},
              {1,2,4,6}, {1,2,5,6}, {1,3,4,5}, {1,3,4,6}, {1,3,5,6}, {1,4,5,6},
-             {2,3,4,5}, {2,4,5,6}, {3,4,5,6}});
+             {2,3,4,5}, {2,4,5,6}, {3,4,5,6}};
           if (ranges::any_of(a, std::bind_front(cmp_equal<uint, int>, avails),
                             getint)) {
            vec.push_back(avails);

As it turns out, trying to cast an std::initializer_list to an array is undefined behavior. std::vector actually has a constructor for this, so use it instead.

@bazaah
Copy link
Owner Author

bazaah commented Aug 25, 2023

Promising solution in https://tracker.ceph.com/issues/58759 for unittest_bluefs

@bazaah
Copy link
Owner Author

bazaah commented Aug 27, 2023

Right, I'm moving to integration testing (= upgrading from v17 + standing up a new v18 cluster).

@bazaah
Copy link
Owner Author

bazaah commented Aug 28, 2023

Found this issue pyca/cryptography#9016, and it seems to be a problem beyond ceph: somehow python-cryptography (and other modules?) are attempting to initialize the rust bindings (?) multiple times which has been disallowed for soundness (?) reasons.

@bazaah
Copy link
Owner Author

bazaah commented Aug 28, 2023

I don't know if this is even fixable on my end as it doesn't seem to be a ceph specific issue. I'd have to completely isolate the python stack (e.g build + somehow run in a venv)

@bazaah
Copy link
Owner Author

bazaah commented Aug 28, 2023

The NOTIFY_TYPES messages seem legit... the modules don't define an attr like that in v18.2.0, mostly. It also doesn't seem to technically be an issue, as the code that checks for this has it's error ignored... so not sure what's up.

@bazaah
Copy link
Owner Author

bazaah commented Aug 28, 2023

Found this issue pyca/cryptography#9016, and it seems to be a problem beyond ceph: somehow python-cryptography (and other modules?) are attempting to initialize the rust bindings (?) multiple times which has been disallowed for soundness (?) reasons.

Following up on this, it seems the PyO3 maintainer has effectively decided to flat out restrict usage of PyO3 modules in embedded / multi interpreter contexts, like exists in the ceph-mgr machinery, per PyO3/pyo3#2346 (comment). This is somewhat irritating and effectively turns any module with PyO3 in its dep tree in a bomb.

@bazaah
Copy link
Owner Author

bazaah commented Aug 28, 2023

So. I'm likely going to lift all of this context into its own issue, and move forward with the v18 release, as I do not see a realistic method for fixing this myself.

I'd need to (either):

  1. Remove python-cryptography from all mgr related python code
  2. Create and maintain an extension to the ceph build for a stable venv, and ensure we use old enough versions of the affected modules to avoid hitting PyO3/pyo3@78ba70d
  3. Somehow work with the PyO3 maintainer (or python-cryptography) to fix this on their end

1 and 2 ultimately run into the same issue: eventually I will be forced to upgrade to some version of something that depends on >=0.17.0 of PyO3, and 3 seems untenable (PyO3/pyo3#2346 (comment)):

... The extensive redesign seems intractable ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant