Skip to content
This repository has been archived by the owner on Dec 13, 2018. It is now read-only.

Container does not start after reboot #9

Closed
nilrog opened this issue Apr 25, 2016 · 22 comments
Closed

Container does not start after reboot #9

nilrog opened this issue Apr 25, 2016 · 22 comments

Comments

@nilrog
Copy link

nilrog commented Apr 25, 2016

I applied the latest DSM upgrade a few days ago and thought that the container worked fine since I am using the latest version of this script. But after 3 days I got the email from Crashplan saying that my client had not been connected for 3 days.

So today when I looked closer I noticed that the Crashplan container had not mounted any of the user specified volumes. So I then recreated the container and when I checked again they were mounted properly.

So as one more step I wanted to reboot the Synology to check if the container started properly. But it did not start at all.

The following is logged in syslog of the Synology:

Apr 25 09:05:10 diskstation synoddsm-hostd_SYNO.Docker.Container_1_list[12808]: synoProfile.cpp:285 Profile /var/packages/Docker/etc/.config not exist
Apr 25 09:05:10 diskstation synoddsm-hostd_SYNO.Docker.Container_1_list[12808]: synoProfile.cpp:285 Profile /var/packages/Docker/etc/.config not exist
Apr 25 09:05:11 diskstation synoddsm-hostd_SYNO.Docker.Container_1_list: SYSTEM:        Last message 'synoProfile.cpp:285 ' repeated 1 times, suppressed by syslog-ng on diskstation
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = []
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = [adoring_banach]
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = [crashplan]
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = [crashplan]
Apr 25 09:05:11 diskstation synoddsm-hostd: SYSTEM:     Last message 'util.cpp:105 SharePa' repeated 1 times, suppressed by syslog-ng on diskstation
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = [trusty-crashplan]
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = []
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = [adoring_banach]
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = [crashplan]
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = [crashplan]
Apr 25 09:05:11 diskstation synoddsm-hostd: SYSTEM:     Last message 'util.cpp:105 SharePa' repeated 1 times, suppressed by syslog-ng on diskstation
Apr 25 09:05:11 diskstation synoddsm-hostd[12808]: util.cpp:105 SharePathGetByName failed, name = [trusty-crashplan]
Apr 25 09:05:14 diskstation root: == DSM 6.0 7321-2 finished boot up ==

If I manually execute /usr/local/etc/rc.d/S99crashplandocker.sh status it starts properly.

Any hints on what goes wrong and where to start looking?

@ghost
Copy link

ghost commented Apr 25, 2016

I am just updating to DSM 6.0-7321 Update 3 this morning, so I'll definitely be on the lookout for issues on my end. I had problems on restart when I first upgraded to DSM 6, and it seemed to be due to symlinks being recreated on startup around the time I was trying to use them.

In the meantime, would you mind looking in /var/log/upstart/3rdparty-services.log to see if there's any useful Crashplan/Docker bits in there?

@ghost
Copy link

ghost commented Apr 25, 2016

Quick update - I updated to Update 3 and things seem to be working properly (although the container didn't start on the first boot because the Docker daemon didn't seem to start soon enough). I did some digging through Google, Synology forums and my NAS log files... can't find anything that appears relevant to the issue you're having. It's worth pointing out that your logs say there's no /var/packages/Docker/etc/.config file. I confirmed that I don't have that file either, but I don't get any log messages about it. I hope there's something in your /var/log/upstart/3rdparty-services.log that'll give us more to go on.

One other thing that may be worth a shot is reinstalling the Synology Docker package. I don't necessarily think that'll fix anything, but it's along the lines of a reboot as far as being an easy thing to do that might have a useful effect.

@nilrog
Copy link
Author

nilrog commented Apr 25, 2016

Thanks for the input...and the awesome container :)

Took a look at the log file you mentioned and it sure looks like the Docker daemon isn't ready by the time the container start script wants to start the container. Here is the relevant part from the log file that shows the reboot I did this morning:

2016-04-25T09:04:53+0200 ==================== start 3rdparty-services ====================
2016-04-25T09:04:53+0200 Mon Apr 25 07:04:53 UTC 2016
2016-04-25T09:04:53+0200 Found the user variables file. Updating...
2016-04-25T09:04:53+0200 CRASHPLAN_DIR=/volume1/crashplan.app
DATA_DIR=/volume1/crashplan/backupArchives
2016-04-25T09:04:53+0200 USER_VOLUMES=-v /volume1:/volume1:ro
2016-04-25T09:04:54+0200 Cannot connect to the Docker daemon. Is the docker daemon running on this host?
2016-04-25T09:04:54+0200 No existing CrashPlan container found. Running image "ajkerrigan/crashplan".
2016-04-25T09:04:54+0200 Cannot connect to the Docker daemon. Is the docker daemon running on this host?

The issue above is probably not related. It looks really weird what happened when the container was up and running but without any volumes mounted. When I installed the Synology update I just checked the output of the status report from the script and since it said it was running I thought everything was ok.

The update I installed was Update 2...I just received the email about Update 3. So I will try that soon and report what happens. But I guess unless there is some change to this start script the container will not start by itself. Or due to timing it might be able to start properly.

@ebiancarelli
Copy link
Contributor

That could be a permissions issue with the wrong user running the script. I notice the same thing when I'm not root and try to run the script(or docker commands).

FWIW, my install is working OK.

@nilrog
Copy link
Author

nilrog commented Apr 25, 2016

After installing Update 3 the container was started...but again without mounting /data and /volume1 so it is not working properly now.

This is what is logged:

2016-04-25T22:01:26+0200 ==================== start 3rdparty-services ====================
2016-04-25T22:01:26+0200 Mon Apr 25 20:01:26 UTC 2016
2016-04-25T22:01:26+0200 Found the user variables file. Updating...
2016-04-25T22:01:26+0200 CRASHPLAN_DIR=/volume1/crashplan.app
DATA_DIR=/volume1/crashplan/backupArchives
USER_VOLUMES=-v /volume1:/volume1:ro
2016-04-25T22:01:27+0200 Cannot connect to the Docker daemon. Is the docker daemon running on this host?
2016-04-25T22:01:27+0200 No existing CrashPlan container found. Running image "ajkerrigan/crashplan".
2016-04-25T22:01:27+0200 Cannot connect to the Docker daemon. Is the docker daemon running on this host?
2016-04-25T22:01:27+0200 Waiting for following services: pkgctl-DownloadStation pkgctl-PhotoStation pkgctl-HyperBackup pkgctl-PHP5.6 pkgctl-Docker pkgctl-StorageAnalyzer pkgctl-WebStation pkgctl-DokuWiki 
2016-04-25T22:01:30+0200 Waiting for following services: pkgctl-DownloadStation pkgctl-PhotoStation pkgctl-PHP5.6 pkgctl-Docker pkgctl-WebStation pkgctl-DokuWiki 
2016-04-25T22:01:33+0200 Waiting for following services: pkgctl-DownloadStation pkgctl-PhotoStation pkgctl-Docker pkgctl-WebStation pkgctl-DokuWiki 
2016-04-25T22:01:36+0200 Waiting for following services: pkgctl-PhotoStation pkgctl-Docker 
2016-04-25T22:01:39+0200 Waiting for following services: pkgctl-PhotoStation pkgctl-Docker 
2016-04-25T22:01:42+0200 Waiting for following services: pkgctl-Docker 
2016-04-25T22:01:45+0200 Reload related services (flags=0x40000000)

Doing a "stop" and then "start" does not fix the problem. Btw, "status" shows Crashplan Service Info even though it is no longer running.

@nilrog
Copy link
Author

nilrog commented Apr 25, 2016

FYI I am root when i'm manually invoking the script. The log snippets I have posted tonight are from when the script is called by the system.

@nilrog
Copy link
Author

nilrog commented Apr 25, 2016

It looks like the container goes bad...the only way, I know of, to get the container back in working order is to issue a recreate. Then I can see that the volumes are mounted properly.

@nilrog
Copy link
Author

nilrog commented Apr 25, 2016

And I tried to reinstall the docker package but the warning message in syslog is still there.

@nilrog
Copy link
Author

nilrog commented Apr 25, 2016

Doing a manual reboot of the Synology from cmdline...when the Synology comes back up the container is not running. Manually starting it and everything is fine again.

So i'm seeing two issues here:

  1. Applying a Synology DSM update somehow corrupts the container so that the volumes are not mounted. Running a recreate is necessary to fix the problem.
  2. Rebooting the Synology causes the container to not automatically start. But a manual start is ok.

@ghost
Copy link

ghost commented Apr 25, 2016

Ok I think one approach here is to slip a check/wait loop into the script. I'm thinking check to see if the Docker socket is ready (/var/run/docker.sock), and sleep/check every second until it's there. I'd want to set a timeout so we don't block system startup. Probably create a variable like DOCKER_DAEMON_TIMEOUT with a default value of 30 seconds (overridable as with the rest of the config options), and stop the loop if the socket is still not available by the end of the timeout period.

I'll get something together, probably tonight. This seems like a good thing to have in general.

@ebiancarelli
Copy link
Contributor

I checked my /var/log as well and have the same messages as @nilrog. The check/wait point in the script seems reasonable to me too.

ghost pushed a commit that referenced this issue Apr 26, 2016
Two related changes to address issue #9:

- The time it takes for the Docker daemon to be ready appears
  to be unreliable. Add a configurable wait period to give it
  some time.

- Running Docker commands as a non-root user isn't going to be
  effective. Test for this immediately - it provides an easy way
  to eliminate a class of problems and not waste anyone's time.
@ghost
Copy link

ghost commented Apr 26, 2016

I added a timeout to solve what I think is the root cause of this issue, and a root check to prevent permissions issues from masquerading as other problems. Give the update a shot when you get a chance, and let me know how it goes.

Thanks for reporting this and including logs as part of your troubleshooting. I hope you'll be OK now, but if not I'm sure we'll figure it out together.

@ebiancarelli
Copy link
Contributor

Did a quick test... Sync'd to the latest and then rebooted my DS.
/var/log/upstart/3rd...log shows:

CRASHPLAN_DIR=/volume1/crashplan
DATA_DIR=/volume1/CrashplanBackups
USER_VOLUMES=-v /volume1:/volume1:ro
2016-04-25T20:05:37-0600 Cannot connect to the Docker daemon. Is the docker daemon running on this host?
2016-04-25T20:05:38-0600 Waiting for the Docker daemon (timeout in 9 seconds)
No existing CrashPlan container found. Running image "ajkerrigan/crashplan".
2016-04-25T20:05:43-0600 Error response from daemon: Conflict. The name "crashplan" is already in use by container 84c9b327346f. You have to remove (or rename) that container to be able to reuse that name.
2016-04-25T20:05:43-0600 Waiting for following services: pkgctl-SurveillanceStation pkgctl-Docker 
2016-04-25T20:05:46-0600 Waiting for following services: pkgctl-SurveillanceStation 
2016-04-25T20:05:49-0600 Reload related services (flags=0x40000000)

My client won't attach. I manually stopped and started the container from the shell on the NAS. Client still won't attach. I recreated the container and client attaches immediately.

Wonder if on a reboot/restart of the NAS the container should be recreated?

@ghost
Copy link

ghost commented Apr 26, 2016

Thanks for testing. It looked OK on mine, but that means something is still inconsistent. The script should only try to run the image and recreate the container if it can't find an existing container.

I'm wondering if the test I used (-e) may be at fault. I'm testing to be sure the socket file exists, but not confirming that it's a socket. I wonder if that's introducing a small window where the docker ps command might fail (and not provide a CONTAINER_ID), but subsequent commands work.

I switched the syntax in the daemon-timeout branch, but haven't pushed the update to master yet because I'll try to consolidate commits while we're still troubleshooting. @ebiancarelli or @nilrog , please pull from that branch and test if you get a chance. Thanks!

I'm going to test it a number of times locally in the meantime, and see if I can find any inconsistent behavior. My setup was working before though, so what I'd really like to see is someone else working :).

@nilrog
Copy link
Author

nilrog commented Apr 26, 2016

With the latest script from the daemon-timeout branch the container starts after reboot. But it is not working properly. The volumes are not mounted properly.

Here are logs from the reboot attempt. I

2016-04-26T09:28:25+0200 ==================== start 3rdparty-services ====================
2016-04-26T09:28:25+0200 Tue Apr 26 07:28:25 UTC 2016
2016-04-26T09:28:25+0200 Found the user variables file. Updating...
2016-04-26T09:28:25+0200 CRASHPLAN_DIR=/volume1/crashplan.app
DATA_DIR=/volume1/crashplan/backupArchives
2016-04-26T09:28:25+0200 USER_VOLUMES=-v /volume1:/volume1:ro
2016-04-26T09:28:26+0200 Cannot connect to the Docker daemon. Is the docker daemon running on this host?
2016-04-26T09:28:27+0200 Waiting for the Docker daemon (timeout in 9 seconds)
2016-04-26T09:28:28+0200 Waiting for the Docker daemon (timeout in 8 seconds)
No existing CrashPlan container found. Running image "ajkerrigan/crashplan".
2016-04-26T09:28:43+0200 Error response from daemon: Conflict. The name "crashplan" is already in use by container cbb25d11c9cb. You have to remove (or rename) that container to be able to reuse that name.
2016-04-26T09:28:43+0200 Waiting for following services: pkgctl-PhotoStation pkgctl-Docker
2016-04-26T09:28:46+0200 Reload related services (flags=0x40000000)

So it looks like it waits for Docker to be up. But then it can't find the existing container so it tries to create one...but by then the existing container is found.

It's a pity that the stock Docker implementation in the Synology is half-baked...i.e. that there is no official support for automatically starting containers upon reboot.

@nilrog
Copy link
Author

nilrog commented Apr 26, 2016

After reading up a bit on Docker I don't think you need a start script at all. Since we manually create the container outside of DSM we can use the option --restart=always with the run command. That will tell Docker to automatically start the container at boot.

So as a test I modified the following line:
RUN_CMD="${DOCKER} run -d --net=host --name=crashplan ${VOLUMES} ${PORTS}"

into this:
RUN_CMD="${DOCKER} run --restart=always -d --net=host --name=crashplan ${VOLUMES} ${PORTS}"

and recreated the container. I also removed the symlink to the start script from /usr/local/etc/rc.d/ and issued a reboot.

When the system was back up the CrashPlan container was running and the volumes were mounted as they should.

Here you can see the 3rdparty logfile and the output of docker ps.

sh-4.3# tail -n 10 /var/log/upstart/3rdparty-services.log
2016-04-26T13:25:19+0200 Reload related services (flags=0x40000000)
2016-04-26T13:39:06+0200 ==================== start 3rdparty-services ====================
2016-04-26T13:39:06+0200 Tue Apr 26 11:39:06 UTC 2016
2016-04-26T13:39:06+0200 Waiting for following services: pkgctl-DownloadStation pkgctl-HyperBackupVault pkgctl-PhotoStation pkgctl-HyperBackup pkgctl-PHP5.6 pkgctl-Docker pkgctl-StorageAnalyzer pkgctl-WebStation pkgctl-DokuWiki
2016-04-26T13:39:09+0200 Waiting for following services: pkgctl-DownloadStation pkgctl-PhotoStation pkgctl-PHP5.6 pkgctl-Docker pkgctl-WebStation pkgctl-DokuWiki
2016-04-26T13:39:12+0200 Waiting for following services: pkgctl-DownloadStation pkgctl-PhotoStation pkgctl-Docker pkgctl-WebStation pkgctl-DokuWiki
2016-04-26T13:39:15+0200 Waiting for following services: pkgctl-PhotoStation pkgctl-Docker
2016-04-26T13:39:18+0200 Waiting for following services: pkgctl-PhotoStation pkgctl-Docker
2016-04-26T13:39:21+0200 Waiting for following services: pkgctl-PhotoStation pkgctl-Docker
2016-04-26T13:39:24+0200 Reload related services (flags=0x40000000)
sh-4.3#
sh-4.3#
sh-4.3#
sh-4.3# docker ps
CONTAINER ID        IMAGE                  COMMAND             CREATED             STATUS              PORTS               NAMES
c5fa27aad422        ajkerrigan/crashplan   "/sbin/my_init"     24 minutes ago      Up 2 minutes                            crashplan
sh-4.3#
sh-4.3#
sh-4.3#
sh-4.3# uptime
 13:42:00 up 3 min,  1 user,  load average: 1.15, 1.01, 0.45

As you can see your script is no longer called and yet the container is up and running :)

I have tried rebooting twice and the container came back up successfully both times with this modification (with volumes mounted).

So unless there is an edge case i'm missing I think you only need to provide a script that can create the container and show its status. There is no need to have a script installed in /usr/local/etc/rc.d/. Just modify the script as shown above, possibly removing the start/stop/recreate actions and instead providing create/delete/status. And then change the instruction so that the user only needs to run it once.

The only thing I haven't tested is if this survives a DSM upgrade. That I cannot test until a new update is available :)

@ghost
Copy link

ghost commented Apr 26, 2016

That's great @nilrog :). While troubleshooting this issue it did seem like I was hackily working around --restart=always. I could have sworn I did that in my first pass and it didn't work, but clearly that's wrong! That's great news, since this script is essentially a giant hack I'd love to not need in the first place.

Thanks a lot for doing the normal thing and helping us both avoid a bunch of unproductive flailing.

@nilrog
Copy link
Author

nilrog commented Apr 26, 2016

You're welcome! It feels good to contribute something back and not only use your solution and hope for fixes :)

This Docker solution is soo much better than the "native" package for CrashPlan. That one worked great in the beginning but the constant breaking by all the automated upgrades was a big hassle, as you also discovered before me (I was looking at doing this myself when I found your solution). It has been flawless so far...apart from when you have rebooted the NAS.

It could be that the restart option didn't work in DSM 5.x? But in DSM 6 is looks like it is working as it should. But please give it a good test run. I will keep this as is and wait for the next DSM upgrade and see how that works.

@ebiancarelli
Copy link
Contributor

Confirmed @nilrog solution. Thanks!

And docker was upgraded from DSM5 to DSM6. I think DSM5 was docker 1.6 and DSM 6 is 1.9.1. I actually opened a bug against Synology to upgrade the docker package so maybe I helped? :-)

@ghost
Copy link

ghost commented Apr 27, 2016

Thank you both very much, you guys are pros. I'm glad we're all up and running, but I also like that there's a way forward that should be more reliable for anyone (likely ex-patters folks) who stumble across this. I wouldn't have even known how fragile it was if not for you two.

I'm thinking I'll rename the script file so it doesn't look like a startup script, then work in @nilrog's changes and try not to let the documentation lag too much. In the meantime, may your backups be smooth :).

@ericsvendsen
Copy link

Was about to try this out when I noticed this issue thread. From what I can gather, the way to go now is to modify the line in S99crashplandocker.sh as @nilrog suggested, and then skip the symlink step in the installation instructions. Does that sum things up?

@ghost
Copy link

ghost commented Dec 29, 2016

@ericsvendsen Yes that should do the trick.

To really close out this issue, I'll need to do some script/documentation tweaks to clean things up and stop attempting to treat this as a startup script. To be honest my setup has been working without problems for so long it's easy to forget about the script until someone asks a question. So thanks, I think! :)

This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants