-
-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make LDC master build itself, using ninja #275
Conversation
Thanks for your pull request, @joakim-noah! |
buildkite/build_project.sh
Outdated
@@ -218,10 +218,14 @@ case "$REPO_FULL_NAME" in | |||
;; | |||
|
|||
ldc-developers/ldc) | |||
git submodule update --init | |||
git submodule update --init --depth 1 runtime/druntime runtime/phobos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should take much less bandwidth for every checkout for each CI run.
ansible/ci.yml
Outdated
@@ -118,6 +118,7 @@ | |||
- llvm | |||
- mongodb-server | |||
- moreutils # sponge et.al. | |||
- ninja |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: I installed ninja-build
on the respective machines
buildkite/build_project.sh
Outdated
@@ -218,10 +218,14 @@ case "$REPO_FULL_NAME" in | |||
;; | |||
|
|||
ldc-developers/ldc) | |||
git submodule update --init | |||
git submodule update --init --depth 10 runtime/druntime runtime/phobos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that --depth 10
is always going to work and not result in:
fatal: reference is not a tree: 41e9dece3fc6c2bc966cfa327ae8160f67a4c569
As with --depth 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I've had it fail up to 15-25 too, don't know git well enough to say why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah then maybe just remove --depth
for now and we can figure this out in a different PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installed git versions on the clients are 2.7.4 and 2.11.0
So this attempts to improve robustness by checking that the LDC built with tested DMD is able to build itself? I don't think that's needed, the regular C++ interop tests in the DMD testsuite should catch these potential issues (and many more). I don't recall any such badly miscompiled LDC with a host compiler passing the C++ interop tests (and would have added a test otherwise). Linking errors due to mangling changes have occurred though. Unfortunately, I don't recall any of those happening on Linux, but on OSX and Windows, so testing those 2 other platforms too would be nice in the long run. |
I was referring to your comments when merging the dmd 2.081 frontend into ldc, "Wrt. crashing self-compiled LDC, I guess that's due to the numerous extern(C++) changes (vtable?), and apparently something not covered by dmd-testsuite yet. :/" |
Heh yeah right, that was due to a breaking change of semantics (dtors of extern(C++) classes now virtual unless explicitly |
Thinking about this some more, this is not quite the same as your original situation, where you were building the latest ldc with the latest ldc, whereas this buildkite CI builds the last ldc tag, 1.10 right now, with the latest dmd. So the self-bootstrap here uses the old In that case, this ldc 1.10 bootstrap simply becomes a stress test to make sure everything's working, arguably excessive beyond just building the ldc 1.10 stdlib and tests. However, it also presents no merging issues for dmd commits and won't require adapting ldc, as dmd master should always produce a working ldc 1.10. What the dmd testsuite covers is irrelevant, as the entire point of putting ldc on this buildkite CI is to potentially catch incompatibilities that may have slipped through the dmd testsuite. I think this stress test is worth it to make sure of that, but I can scale it back if you guys think otherwise. Let me know. I don't know what's up with the CI not testing this pull anymore. |
Try rebasing to master. I did a few improvements yesterday that among other things will always merge the PR with upstream/master. Unfortunately this wasn't done before. |
FYI: it's still failing with:
|
Heh, have we found our first regression from adding LDC to the CI? ;) |
Yep that's what I meant. The dtor messing up the vtable was/is a (not really needed) LDC-specific addition in the D frontend, so that compiling LDC with that dtor-semantics change would break LDC at runtime, so that testing the built LDC would be necessary to detect it. Building itself again is one way of doing that (particularly useful for C++ interop), taking relatively little time, but of course by no means extensive.
Nope, that is the virtual-dtor breaking change, I remember it quite well - this kind of strange errors can happen when the vtbl is messed up. ;) - LDC 1.11 (with the dtor in question now being |
OK, I see what you mean about ldc changing now: sometimes changes in dmd master require modifying the buildkite projects and not just delaying the dmd pull for later, something ldc doesn't really do since it almost never puts out point releases. I just tested building the latest stable release of ldc, 1.10, with the last point release of dmd, 2.081.1, on linux/x64 and reproduced this error, which shows we don't really care about building ldc with dmd. However, this buildkite CI error does exhibit how we can flag such regressions early, so you don't have to go digging through dmd commits to figure out what happened when 2.081 is released, but will be notified early when that dmd change was made. We can then add a I suggest we hold off on this pull until the upcoming 1.11 release and if it passes, merge this pull then, to give both the dmd and ldc teams greater visibility on these errors much earlier in the development process. |
Using Lines 32 to 34 in 4f7c781
We typically try to avoid checking out |
Yeah, that'd be nice indeed. Other common regressions, non-adapted C++ headers after modifying a D file, sadly won't be caught this way. Building LDC master would probably be the most feasible option.
Absolutely. Self-bootstrapping is CI-tested for Linux (Ubuntu 14.04), OSX and Windows, so that should be pretty robust. |
At the time 1.10 was released, DMD was at 2.080.1, and all was good. At the time DMD v2.081 was released, LDC master was already at v1.11/2.081 incl. the vtable fix. So our 2 Travis jobs (Linux/OSX) using latest DMD as host compiler worked and still work. |
OK, updated this pull to build master, let's see what the CI says. |
Alright, finally got ldc master to build itself and added the doubled build time to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
buildkite/build_project.sh
Outdated
git submodule update --init | ||
cd runtime && git clone --depth 1 https://github.com/ldc-developers/druntime | ||
git clone --depth 1 https://github.com/ldc-developers/phobos | ||
cd .. && mkdir bootstrap && cd bootstrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, so you're using the latest ldc
heads for the 2 submodules? That's not really master and might be problematic at some point (because untested in this particular combination).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean, the last commit of the ldc
branches of druntime and phobos are usually checked out by ldc master. Rather than have them checked out by git submodule
, I'm checking out each ldc
branch manually so we can shallow clone them and save bandwidth.
That's the point, not always, and we know how unnerving unrelated CI breakages are. Plus I think there's a way to shallow clone with |
I thought we were going to keep those submodule hashes updated from now on? ;) As for using |
Heh yeah hopefully we are - but still, there's the chance they are out of sync for some time (serial pushes to the different repos, while an ldc commit itself is atomic, e.g., submodules being updated while the main ldc repo was being cloned). Downloading a zipped archive from GitHub by commit hash could work, e.g., |
git submodule update --init | ||
cd runtime && git clone --depth 5 https://github.com/ldc-developers/druntime | ||
git clone --depth 5 https://github.com/ldc-developers/phobos | ||
cd .. && git submodule update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, this should address your concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated CI failure, LDC builds fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The failure was due to regression that managed to sneak into Phobos - see e.g. #286
I restarted the build.
Btw, this company provides a free build node with unlimited builds for OSS projects (cannot be done concurrently), including one running Windows. Don't know if it can tie into Buildkite, but @kinke set up their linux/AArch64 node for LDC and it's working well. |
ping, should be ready to go, especially since ldc master is primed for the 1.11 tag. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really convinced about --depth 5
. Sorry.
@@ -206,10 +210,16 @@ case "$REPO_FULL_NAME" in | |||
;; | |||
|
|||
ldc-developers/ldc) | |||
git submodule update --init | |||
cd runtime && git clone --depth 5 https://github.com/ldc-developers/druntime | |||
git clone --depth 5 https://github.com/ldc-developers/phobos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you push more than five commits to the LDC's druntime or Phobos fork, this will fail.
AFAICT it's possible to tell git to do shallow clones (e.g. https://stackoverflow.com/questions/2144406/how-to-make-shallow-git-submodules), but if that doesn't work let's better do the normal submodule checkout instead and waste those 2 seconds than running into random failures from time to time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you push more than five commits to the LDC's druntime or Phobos fork, this will fail.
No idea what this means, they have a lot more than five commits. What this does is simply pull the last five commits, has nothing to do with pushing.
AFAICT it's possible to tell git to do shallow clones
Read you own SO link from before, which I saw before you pointed me to it: it doesn't work that well. It kinda, sorta works if you're only checking out HEAD or a tag or something- I lost track- but people finally resort to what I did here, cloning each submodule you want separately and manually.
waste those 2 seconds
Time is not the issue, it's bandwidth, as I've said before in this thread. Shallow cloning saves something like 97% of the bandwidth needed for the submodules, which adds up across thousands of CI runs, which is presumably why you shallow clone all these other CI repos too.
If this setup ends up being flaky, we can always stabilize it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea what this means, they have a lot more than five commits. What this does is simply pull the last five commits, has nothing to do with pushing.
But I think it could happen that LDC master uses different checked out sub repositories?
If not, that's great, but then --depth 1
would be even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the ldc forks of druntime and phobos each have an ldc
branch, which is the only branch from which ldc master
checks out, and is the main branch that will be checked out for each submodule here. --depth 1
may not work if someone commits to one of the submodule ldc
branches and ldc master
tries to check out an older submodule commit before the submodule hash is updated, so 5 tries to account for that. See above discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may not work if someone commits to one of the submodule ldc branches and ldc master tries to check out an older submodule commit before the submodule hash is updated, so 5 tries to account for that.
Yes, this is my point: if the Phobos repository gets updated to a version, it might contain a lot of commits, e.g. Phobos currently has about 200 commits (dlang/phobos@v2.081.0...stable). This means as long as the frontend update PR at the LDC repository hasn't been merged yet, PR will only clone the first five commits and fail as the commit hash can no longer be found.
Hence, I don't like the --depth 5
idea and would prefer if we do such optimization if it really turns out to be a huge issue. Yes, bandwith is a concern, but the real gain and savings can be made if we start to cache the git repositories on the build agents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really convinced about --depth 5. Sorry.
Me neither, I guess it's just a question of time until it breaks. If I understood that --depth
setting correctly, the clone doesn't work if the commit isn't one of the 5 last ones (on any branch?). So if the main LDC repo is cloned while I'm merging a new frontend + libs in the meantime, the druntime/Phobos ldc
heads will have been advanced by something like 100 commits, and cloning the submodules this way afterwards will fail.
You merge a new frontend once every two months, are we really going to worry about this?! 99.999% of the time, probably even for merges, it is within 5 commits of ldc HEAD.
That would be best, of course. |
Huh? Aren't the frontend merge PRs open for one to two weeks? |
To me, everything making a CI build flaky is pure hatred, so if we know this can happen in advance, especially with the many DMD CI builds, it should be definitely avoided IMO, even if the chance is limited to a few seconds every 2 months. |
AFAICT they provide just a CI infrastructure like Travis, CircleCi etc. OTOH as we're reaching the limits of most free or open source tiers, Buildkite gives us an inexpensive option to increase the parallelism as just paying for a few boxes is a lot cheaper than the tiers of the CI companies (for example, 6 of the 10 agents run on very two cheap VPS boxes for which I paid 39$/year each.). |
And your point is? None of that time was on the
That is flat out nuts, and we don't even know if the merges will cost those few secs. |
I'm not offended. I lose way too much time checking faulty CI runs, and every little counts. If you're so keen on saving bandwidth, then do it properly by downloading the archive as suggested earlier. |
Good, because that wasn't my intent. It's just an impossible standard, particularly given the CI is down much more already.
Your time is worth a lot more than bandwidth or CI time, no doubt, but you're not running this CI so it will cost you zero. ;) For Seb or whoever is responsible, I'd like to see that suggested flakiness demonstrated, which I don't think it will.
That seems like such a kludge, especially for such negligible concerns with the current approach. |
Buildkite jobs automatically restart themselves up to three times if they fail initially which does reduce a lot of flakiness and if it's just a few minutes we're talking about here, I would be fine, but AFAICT and as mentioned earlier it would more be like two weeks for the frontend upgrades, no? |
No, check the branches used in those pulls you linked. Do you think we keep master broken for two weeks? :) |
Ah that's alright then. It should at most be 1 minute every 2 months, as we use a temporary |
Ah fair enough. I wasn't aware of this, but then it could be |
There are some PRs with a couple of druntime/Phobos advancements, so to account for those, 5 seems reasonable. |
Ah okay, now I fully understand. However, if that's the case I would feel more comfortable with |
No, 50 is ridiculous given the numbers I gave you above. If we find it to be a problem, we can always bump it up later. |
Well, that's always going to be harder to debug and find out about. https://buildkite.com/dlang/dmd/builds/667#beb00ca1-866c-44e0-ae1e-656e064a0a2c And now imagine that we have about 50 projects in the Project Tester (+ a few of our own targets). |
FYI: Temporarily disabled LDC and opened a revert PR, s.t. we can dig into the failure without affecting the CI infrastructure: #299 |
No, it isn't: you just bump it up to 10, then 20, etc. till it's fixed.
As explained before, you won't have to dig into anything every two months. You seem to have problems reading what is written, so I'm done repeatedly explaining all this to you. Pull or don't pull, I don't care at this point. I'm not responding anymore, as I just don't want to deal with these nonsense objections anymore. |
@joakim-noah There's no need to get confrontational. Everyone involved here sees clearly the benefit of doing more LDC-related testing and we all want to help on making it happen as it benefits everyone: LDC devs won't have to deal with certain type regressions after the fact and dmd/druntime/phobos devs will have more assurance that PRs won't cause unexpected breaking changes. (Of course no finite amount of testing can prove the absence of bugs, but it does help.) The reason @wilzbach seems to act more conservatively than necessary from your PoV is that during the last year there have been a number of breakages to the project tester and when they're not fixed in short time frame, developers start doubting the project tester's authority and begin ignoring it, and then regressions start sneaking it making it even harder to restore the project tester to a green state. The probability of a random project failing is still super high (just look at the build history of the past several months) and we can't afford to not taking steps to reduce this probability. We're all contributing in our free time and given that @wilzbach does the majority of the work by himself during the past several months to keep the project tester green I think it makes sense to respect his wish to take extra steps to ensure the system runs hands off. (Of course not to underestimate Martin who put the foundations in place and other part-time contributors like yourself.) |
I agree that the issue under discussion is minor. What I find frustrating is that I keep repeating the same things and they're not registering. I repeatedly noted that frontend merges are not worked on in the master branch, yet Seb kept acting as though they were, until Martin K. fully spelled it out for him. You guys are acting as though git issues will take a bunch of debugging. They won't, they're trivial. You simply see that a checkout didn't work, you reproduce it yourself with a shallow clone, then bump the number of commits till it works. That's it. That's if it causes problems at all, which I'm saying it won't. If ldc causes any problems, put it on me to fix them. I'm fine with taking responsibility, |
It's not that it may be something hard to debug for people aware of the CI infrastructure, but that issues may appear when builds are triggered from a totally unrelated pull requests to other repos. We want to minimize the time some (potentially first-time) contributor's PR fails for no apparent reason and blocks their work. Even if you assume that most dmd/druntime/phobos contributors know all about the repos they're contributing to (in practice I think most people are oblivious to how the build or CI systems are set up), you can't expect them to know how each project (of the the growing Buildkite project list) is setup, and even less so to go ahead and debug the issue when no one is around to fix it at the moment. And even if they do, their contributing experience becomes something like this: https://www.youtube.com/watch?v=AbSehcT19u0 😄
I don't doubt that you'll be responsive and help fix any issues, it's just that: I still think that the discussion here is getting blown out of proportion and I'm fine with merging the PR as it is. I'm just pointing out that @wilzbach has the full right to be conservative and hard to convince, as a higher priority for him is making sure that the system works as much as possible without his (or anyone's) intervention. |
All this is answered above: it's not going to break. I've given you the reasons why, simply asserting that it might is nonsense. On the 0.001% chance it does, it's an easy fix. And all this hand-wringing is frankly ridiculous considering that the Buildkite logs aren't available in the first place. That's a much bigger usability issue for contributors to be dealt with than this (I know you can only wait for Buildkite, but it is still much bigger). |
Ping, there is no reason to delay this pull because of unreasonable concerns about the LDC stdlib falling out of sync. The LDC stdlib receives very few pulls, maybe 5-6 per month, as most work is directed upstream. Concern that it will break this CI is misplaced. |
Since this pull is being ignored, closing. |
Thanks, if this pull causes flakiness, I will fix it. Sorry if I dug in on the bandwidth issue, but I hate all the automated waste that goes on these days. |
@kinke noted when updating ldc that the issue was with a self-compiled ldc crashing because of some C++ interfacing issues, so I tried to recreate that as much as possible. This will make this test run more than twice as long, but I'm not sure the previous version would have caught C++ interfacing issues.