-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
configuration loss after upgrade from v2018.1 to v2018.1.x (ea9a69f) #1496
Comments
does this happen as well if you autoupdate to the same version of 2018.x again and again? like backdating /lib/gluon/release and then autoupdate -f
|
confirmed for "gluon-v2018.1-7-gea9a69f" wget http://192.168.1.1/cgi-bin/config in other words: it lost it's node name. (and is in config-mode) |
to be more precise:
|
i just automated the testing and did not use the autoupdater this time, but only "sysupgrade". i tested this with qemu/x86. |
I am unable to reproduce this on ea9a69f when
|
But why after 96 tries? Maybe it only happens if the connection to the download server is bad |
sould not happen. if the entire file is not downloaded -> no valid signature -> no sysupgrade should start. In other words: If that would be an issue, we would really have a problem. (and no, i do not think so.) |
not able to reproduce on an 841v9 after 357 autoupdate -f |
i rebuilt my x86 image today on the "original" build machine (but manually, instead of jenkins - shouldn't matter though?) and i can't reproduce the issue anymore. |
desipte the fact that i can not reproduce it under lab environment, last night i rebuild gluon-v2018.1-7-gea9a69f and deployed via autoupdater-webserver to 15 nodes. i lost the configs on more than 50% of the nodes (and i really do not know why, since the exactly same image of for example 841v9 did the job on a different machine >100 times flawlessly.
all those devices did the update from 2016.1.x to 2018.1.7 without problems. my hypothesis for the moment is:
|
reproduced "by accident" on going autoupdater 2018.1 to 2018.1 (idem version)
[..]
|
I've a very similar problem with my mesh-independent auotupdater. At first I assumed there was a bug in my code but closer inspection revealed what's going on: When the autoupdater invokes the sysupgrade script the script creates a tgz archive of all files that should be preserved and stores it in ramfs. But under certain circumstances this fails. The busybox tar fails with the error message "gzip: short write". After patching busybox to print the actual
is the culprit. I'm not really sure why it fails with This seems to happen only with busybox tar. Including a gnu tar in my build "fixed" the issue for me. @Adorfer I'm seeing this error a lot on devices even running simple sysupgrades. I don't think it's related to the current problem. My devices always boot fine even with this error message. |
currently i am looking |
@TobleMiner have you experienced your problem with multiple Gluon versions or which one in particular? also, which of the error messages do you see a lot? |
@rotanid I can reproduce the problem pretty reliably with my mesh-independent autoupdater. Every single time the router updates without mesh connectivity it looses it's config. It doesn't when updating with mesh connectivity though. I'm not seeing any error messages in particular. All interesting output is usually suppressed by |
@TobleMiner ok, so at least in your case the problem definitely is, that errors are suppressed instead of aborting or trying again... i guess this should be reported&fixed in openwrt upstream, but even then @NeoRaider is the right person to do it i guess |
do we see this in master as well? I did not succeed to see it there. |
@Adorfer if your theory about busybox would be correct, Gluon v2017.1.x would also be affected as there was no change to busybox in the lede-17.01 branch since december 2017 |
@rotanid The problem is not that there is no "retrying". There is no valid reason for a |
is it possible that just a few communites did real hard tests von 2017? |
@Adorfer there are thousands of nodes running v2017.1.x, including our ~550 which had no update problems. |
This comment has been minimized.
This comment has been minimized.
Wie wäre denn der sauberste weg, die aktuelle Busybox in 2018.1.x für einen Testlauf hineinzubekommen? immerhin hat das openwrt/lede-busybox eine größe Zahl (dutzend?) von Patches drin. |
I just looked if i could see some errormesages. at least the kernel-messages till occurr.
|
Hey all, today I've had quite a long debugging session diving deep into the issue. The trigger for our problem is a bug in libuclient. Under certain circumstances calling
As soon as tar is called by the sysupgrade script fd0 is reused as the file descriptor for the targzfile:
Since gzip compression was requested the tar process is forked and execs gzip:
But the code responsible for passing the output of tar to gzip assumes that the fd of the targzfile will not be zero and overwrites fd0 with a pipe to tar. This results in gzip trying to write to the unidirectional read-only pipe to tar producing an EBADF. |
so what can we do about that except an upstream fix at libuclient? any chance for a hotfix? |
@Adorfer Fixing libuclient looks like the only option to me. I'll write a patchset for it |
I've just sent a patch for uclient on the openwrt-devel mailing list and a patch for busybox on the busybox mailing list. With a little luck we will have those two bugs fixed very soon. EDIT: The fix for uclient has just been merged |
wow, great work @TobleMiner
two of these have been fixed, one (uclient) is merged into openwrt-18.06 and also lede-17.01. |
about sysupgrade. i'm referring to line numbers in the following file: https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/base-files/files/sbin/sysupgrade;h=a378b029500ac5981d504889b0b3e6af3cf92a0d;hb=HEAD please check if my assumptions are correct: |
I guess the whole sysupgrade code use some refactoring. Especially some |
hm, although i don't disagree, that's like saying "uh, those diesel cars have a bug in their exhaust system, let's replace them by new cars". and you didn't answer the question if my assumptions are correct ;-) |
hoorray, we found the root cause, which lead to the bugs described above: also, i created a patch/PR for sysupgrade, see the comment above: #1496 (comment) |
@rotanid I'd not go as far as calling that the root cause. It's more like the scenario where the bug really surfaces. The root cause is still buggy libuclient closing STDIN which is now fixed in https://git.openwrt.org/?p=project/uclient.git;a=commit;h=ae1c656ff041c6f1ccb37b070fa261e0d71f2b12 . |
@Adorfer @TobleMiner i just pushed an update to v2018.1.x branch to update to current upstream lede-17.01 - which contains the updated/fixed uclient. i think we should leave this issue open at least until we/i/someone also updated the master branch to current openwrt-18.06 |
as far as i understood the libuclient is used 1) directly by the autoupdater and 2) by busybox. this LEDE-bump covers both of them? addionally i would opto to have error handling in the autoupdater in case sysupgrade returns with an errorlevel. (i assume that this is not the case currently. -> freeze of the device, needs to be powercycled. Please correct me i am mistaken.) |
Ist von diesem Bug auch 2018.1 betroffen? Ich habe hier einen Bericht diesbezüglich und bin mir grad nicht sicher. Würde bei uns einen sofortigen Stop beim ausrollen bedeuten und wäre sicherlich relevant dann zu kommunizieren. Edit: Sorry I have been in a hurry and just didn't realize it was in German. Switching between channels with different languages etc.. Is this bug also included in 2018.1? I got a reported regarding this and there is some uncertainity. |
Ich habe bei uns nachgefragt und es wurde ein Fall bei Nutzung des Webinterfaceupdaters berichtet. Es ist aber unklar ob evtl. der Haken nicht gesetzt war... (wäre auch 2017.7/8 auf 2018.1). wenn ich den Bug richtig interpretiere wird er ja erst seit neuerer Firmware getriggered und hängt von der ausgehenden Firmware ab. Edit: If I interpret the bug correctly its only being triggered bei the new firmware (2018.1) and depends on the old Firmware (where the autoupdater is run). |
Please have a look at the initial posting of this issue. "2018.1.x" is meant al "gluon branch 'v2018.1.x'" |
Hi, sorry for my missunderstanding. I was just unsure if 2018.1.x includes 2018.1 as well. There was a discussion. I did not see the topic in the freifunk board before as I was following this thread for a couple of days which resulted in internal discussions. Your post there is quite clear. Thank you. I just wanted to eliminate any missunderstanding. Sorry if this disturbed ;-). If you prefer I can delete those messages to keep the feed clean. |
@Adorfer no, the busybox tar update has to make it into OpenWrt first, it is not covered yet. |
fixed via 0cb9888 15 days ago. |
just for confirmation: is it considered safe now to update to 2018.1 as I'm not sure I understand correctly if the issue is only on update from the broken version on. So should I better wait on 2017.1.x and update to the next gluon release or is it safe enough to go to 2018.1 as released before? |
|
@Adorfer what makes you think v2017.1.x is affected? |
~20 of our nodes are already running on the v2018.1.x branch.
until recently, they ran based on v2018.1 release tag.
when updating to ea9a69f, 4 of those nodes lost their configuration during the update. (so they start into config mode)
these were different devices (Futro S550, Archer C7, UAP AC Mesh, Archer C59) from two different targets (ar71xx, x86)
i was able to reproduce the issue using a Futro and more easily using a virtual machine. (x86-64)
i downgrade to the v2018.1 build via sysupgrade - and after that i upgrade via autoupdater to the build based on ea9a69f
the configuration loss never happens with the downgrade, only when doing the autoupdater-upgrade, but even then not every time.
here's the log before the node reboots into the updated firmware:
here's the boot log after this:
The text was updated successfully, but these errors were encountered: