Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attemp to fix broken installation on slow hardware / VPS #172

Merged
merged 1 commit into from Dec 11, 2016

Conversation

Psycojoker
Copy link
Member

@Psycojoker Psycojoker commented Aug 8, 2016

Hello,

As reported here or
here, YunoHost post install fails on
slow hardware/vps because slapd is to slow to restart itself after its
regen-conf.

This patch is an attempt to fix this but I don't have a good testing
environment (my vagrant is too fast for that). Maybe testing that it's possible
to run something using the admin user could be a better test but I don't see
how to do it easily.

A workarround would be to use my patch to runs this kind of operation using
root instead of admin but this is a workaround, not a real fix (and this bug
could still generate other problems).

Cheers,

@endorama
Copy link

Hello, thank you for this project. I wanted to try it out installing on a Raspberry Pi model 2 and I'm experiencing this problem.
I tried this patch, but it doesn't seem to work, as the same error is displayed.

For clarity, I applied this patch to the file /usr/share/yunohost/hooks/conf_regen/06-slapd, then performed the post install again. The same error shows up. Am I missing something?

@Psycojoker related to testing this, I believe you could make you vagrant "slower" by setting the vm to 1 cpu and reducing the Execution Cap. From the Virtualbox docs: This setting limits the amount of time a host CPU spends to emulate a virtual CPU. The default setting is 100% meaning that there is no limitation. A setting of 50% implies a single virtual CPU can use up to 50% of a single host CPU. Note that limiting the execution time of the virtual CPUs may induce guest timing problems.
Don't know if this will be enough.
( I'm assuming you are using Virtualbox, but the same should be available for VMWare )

Here is /var/log/yunohost/yunohost-cli.log

@Psycojoker
Copy link
Member Author

Psycojoker commented Aug 27, 2016

@endorama thanks for having tested this patch, that a valuable input :)

I though about slowing down the vagrant box too but never took the time to actually look on how to do that (I'm not really into virtualisation and those stuff).

I'll probably switch my test to something like "wait until you are able to log as the admin user" which is the actually cause of failure here.

But ... hmm...

ALREADY_EXISTS: {'desc': 'Already exists'}

This error is weird, I've never saw it before. Don't know what to think about it right now.

@a1ex4
Copy link

a1ex4 commented Aug 27, 2016

@endorama Could you try this image I made for the RPi 2 ? I used it again yesterday and did not ran into any problem. :)

@endorama
Copy link

endorama commented Aug 27, 2016 via email

@Psycojoker
Copy link
Member Author

@likeitneverwentaway hey :)

That's a great news! Could you publish somewhere the script/way you've generated those images? We are still looking for someone to join us and handle the RPI images since the people who used to do that aren't present anymore :/

@endorama
Copy link

I can confirm is working as expected! @likeitneverwentaway thank you.

@Psycojoker
Copy link
Member Author

@endorama I've pushed a new version that this time try to wait for the admin user to be accessible, are you still able to try it? That would be great!

@likeitneverwentaway my request still hold ;)

@Psycojoker Psycojoker force-pushed the fix_slapd_regenconf_on_slowhardware branch from 8d698f7 to 9a66a00 Compare September 4, 2016 07:57
@a1ex4
Copy link

a1ex4 commented Sep 4, 2016

@Psycojoker Sorry :) My workflow was as follow:

  • install latest raspbian lite
  • Yunohost installation : I used a mix of the official guide and this one until I managed to get everything working for the main script and the post install. Sorry for being vague but I'm pretty sure everything I used was from these guides.
  • For the first boot script I used that one, I'm not entirely sure I edited it though... Checking it on my image before the first boot might be worth something.

I also edited and translated the official guide with my additions. Keep in mind that I only have a RPi 2 to play with, and apparently my image does not work flawlessly on the 3, I'm pretty sure this is because of the packages installation before running the main script, things differ here if you have a 3.

If I remember correctly, when I'm here the first command updates the metronome package with this version from Jerome, maybe this happens only on the rpi 2 ? I'm pretty sure it all comes down to this package for the other boards.

I'll be happy maintaining the raspberry image! Well, at least for the 2, all the feedback I received were good. I think officially publishing this image (maybe an announcement ?) would be good for feedback and the project. For updating the image, when should it be done ? Each new release of raspbian, major release of Yunohost ?

@Psycojoker
Copy link
Member Author

@likeitneverwentaway thanks a lot for you answer and your work, that looks really cool :)

I'm going to talk about that with the other actives people in YunoHost trying to find someone that is better suited for that than me (I'm more into python dev). If you want you can join us on the xmpp chatroom, it's were most of us hang out dev@conference.yunohost.org

I would be very happy to see at least a rpi2 image maintained again :)

@endorama
Copy link

endorama commented Sep 7, 2016

@Psycojoker I tried the latest master this evening, on a clean raspbian image, but the same error appeared.

Here is the installation log: http://pastebin.com/kZGhc0XF
yunohost-cli.log is empty
yunohost-api.log has http logs + the "missing admin user error"

@alexAubin
Copy link
Member

Is there any update on this ? This looks like an important issue to fix...

I can't find the Unknown 'admin' user in the logs of your last comment, @endorama, instead it seems to be a failed grep on /etc/ldap/slapd.conf ?

After investigating the issue on my side, I think we actually need to call the hook right after (or inside) tools_ldapinit(). As this issue points out, the bug can also occur here (regen_conf for ssl before creating the CA) and will require the admin user to exist (somewhere inside the black magic of the pre-callback weird stuff) - but the regen_conf for ldap will only occur later.

This TODO actually points to this as well 😛 (though the last words are swapped I think).

@alexAubin
Copy link
Member

If we really want to be paranoid, we can also add a check somewhere after tools_ldapinit() that the admin user exists. Something like :

try:
    pwd.getpwnam("admin")
except KeyError:
    raise MoulinetteError(...)

@Psycojoker
Copy link
Member Author

Yes, this is an important issue but since I don't have any dev environment to test it I've stopped working on it (that and lack of free time :/).

This isn't a part of the code I'm familiar with, do you think you can fix this?

@alexAubin
Copy link
Member

Yes, I'll have a look ASAP (though I'm not more familar with this part of code than you 😛), but I'll probably be too busy until tomorrow night.

@Psycojoker
Copy link
Member Author

Thanks to the work on @alexAubin we now have a solution 504baefd87a4

@alexAubin
Copy link
Member

Actually, looking carefully at these logs posted on issue 463, the fix you proposed here might really be needed.

If you look at the log, you see that in beginning of post-install, user admin really do exists. Otherwise conf_regen/02-ssl would have crashed as in here, which is what fix #191 addresses. (You even see the Creating directory '/home/admin'.)

But later, conf_regen/06-slapd is called, and after it's done goes in conf_regen/09-nslcd which then crashes with ... sudo: unknown user: admin !

So we really do need this fix too. I don't know if the wait loop is a good solution, but that should do the trick. Maybe it can be improved with also a nscd -i passwd like in #191. Maybe inside the loop ?

@Psycojoker
Copy link
Member Author

I would expect this to be a caching problem again, so yes, we probably need to put sudo nscd -i passwd in those files too.

@opi
Copy link
Contributor

opi commented Nov 28, 2016

Note that this "nscd" trick is already there for user_create and user_delete method. Source: https://github.com/YunoHost/yunohost/blob/unstable/src/yunohost/user.py#L191

@alexAubin
Copy link
Member

We should also push this, it's also a critical issue (like #191). I'll test it as soon as I have some times. We need to decide if we do a nscd -i passwd somewhere.

@julienmalik
Copy link
Member

@alexAubin @Psycojoker not sure to follow this one. You added the small decision label but you are thinking about changing it to the nscd -i passwd trick.

You both are more aware of this stuff because you debugged it, so I'm following you.
From what I understood, this nscd cache flush does not pose problems (but instead solves some), so why not putting it anywhere we play inside the ldap around the users/admin entries ?
I would not hesitate too much. Since it seems quite hard to test/reproduce, I would be ok to ship the regenconf helpers in next release with some more well placed nscd -i passwd and see if we still get this bug in the user reports, or if it killed it.

@Psycojoker
Copy link
Member Author

@julienmalik well, I'm personally not fully convinced that this patch is still needed but @alexAubin think so. I haven't took the time to fully think about it so I'm trusting him on this one.

I would too be in favor of going "nscd -i password ALL THE THINGS", I just haven't took the time to do so.

@alexAubin
Copy link
Member

alexAubin commented Dec 5, 2016

TL;DR : fix works ! But we should address issue #656 which is probably the root cause.


So, I've been able to reproduce and pinpoint the issue observed in the log. It's been a long journey 😄 and learned some stuff, so here's what I done if that's of any interest.

What I did is to use the prefix the post-install command with nice, a tool to put higher (or lower) priority on CPU for some commands. I actually launched other high-cpu processes just to keep the CPU really busy :

sudo apt-get install mathomatic-primes --yes
nice -n -5 matho-primes 0 9999999999 > /dev/null &
nice -n -5 matho-primes 0 9999999999 > /dev/null &
nice -n -5 matho-primes 0 9999999999 > /dev/null &
nice -n -5 matho-primes 0 9999999999 > /dev/null &
nice -n -17 yunohost tools postinstall -d yunohostdev123.netlib.re -p yunohost --ignore-dyndns --debug

That way, the post-install would go super-fast while other process (and in particular the ldap restart) would go slower - which I expected to simulate "slow hardware", though still not sure it really does. I encountered the following message in the lines after the slapd force-reload in 06-slapd (using this PR's branch) :

61585 INFO + sudo su admin -c ''
61592 WARNING sudo: ldap_sasl_bind_s(): Can't contact LDAP server

So it looks like it cannot contact the LDAP server (i.e. still 'rebooting' ?). On the next try (I put a delay of 0.01 s), sudo was working fine. Not really what I wanted to obtain : in the logs of the actual issue, admin was reported to be unknown several times. How did that happened ? I played around a bit more, in particular trying to invalidate nscd's cache with the famous nscd -i passwd but couldn't get admin to be reported as unknown...

Then I wondered, what if nscd wasn't started at all ? I added a service nscd stop before launching the postinstall, and here it goes ! Everything went fine, up to the post_regen_conf where every hook was crashing with unknown user : admin ! You don't even need a slow hardware for this to happen.

So my best guess is that it's related to issue #656 : nscd isn't in Yunohost's dependencies. On most debian setups, we got lucky nscd is there somehow (it's only in the "Recommends" of nslcd, as found by @opi) - but maybe on some particular hardware or image, nscd isn't there by default.

Good news is : the currently proposed fix properly work around this ! (It displays a funny WARNING sudo: unknown uid 1007: who are you? (because 06-slapd itself is actually running as admin - which is unknown lol !) the first time, then realize admin truly exists because slapd is fully up I guess). But we could properly avoid triggering this situation by making sure nscd is installed, and running.

@Psycojoker
Copy link
Member Author

But we could properly avoid triggering this situation by making sure nscd is installed, and running.

This makes me wonder if we shouldn't do precheck before running certain things that everything is running as expected (like a serie of assert in some programming language like effel). For example we should check that ldap/nscd/nslcd etc... are running before starting a hook_exec.

Thanks a lot for the tests.

@alexAubin
Copy link
Member

This makes me wonder if we shouldn't do precheck before running certain things that everything is running as expected (like a serie of assert in some programming language like effel). For example we should check that ldap/nscd/nslcd etc... are running before starting a hook_exec.

Agreed, I was thinking about this and would be really in favor of doing this. Maybe not for every command, but at least for the postinstall which is a quite critical part. We could open a dedicated ticket on Redmine.

@Psycojoker
Copy link
Member Author

We need another opinion on this one.

@M5oul M5oul added this to the 2.5.x milestone Dec 11, 2016
Copy link
Contributor

@opi opi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

( untested, but trusted )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants