Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arangod silently exits #1426

Closed
sleepycat opened this issue Aug 5, 2015 · 14 comments
Closed

arangod silently exits #1426

sleepycat opened this issue Aug 5, 2015 · 14 comments
Assignees
Labels
1 Bug 3 Build compiling on targets 3 Core

Comments

@sleepycat
Copy link
Contributor

I installed Debian Jessie on my Raspberry Pi as suggested in the Google group.
The install seems to go well but I cannot connect to arangodb after:

sar@jessie-rpi:~$ wget https://www.arangodb.com/repositories/raspbian/arangodb-2.6.3-raspbian.deb
...
sar@jessie-rpi:~$ sudo dpkg -i arangodb-2.6.3-raspbian.deb
...
sar@jessie-rpi:~$ curl -v localhost:8529
* Rebuilt URL to: localhost:8529/
* Hostname was NOT found in DNS cache
*   Trying ::1...
* connect to ::1 port 8529 failed: Connection refused
*   Trying 127.0.0.1...
* connect to 127.0.0.1 port 8529 failed: Connection refused
* Failed to connect to localhost port 8529: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 8529: Connection refused

I can see that the arangodb supervisor is running but no instances of arangod are around:

sar@jessie-rpi:~$ ps aux | grep arango
arangodb  2256  0.0  0.4  35412  4324 ?        Ss   16:37   0:00 arangodb [supervisor]
arangodb  3395 69.2  7.9 466084 74984 ?        Sl   16:49   0:04 arangodb [supervisor]
sar       3418  0.0  0.0   2060   540 pts/0    S+   16:49   0:00 grep arango

sar@jessie-rpi:~$ sudo systemctl --all | grep arango
  arangodb.service           loaded    active   running   LSB: arangodb

The cause of this appears to be that /usr/sbin/arangod silently exits after about 20 seconds.

sar@jessie-rpi:~$ sudo /usr/sbin/arangod -c /etc/arangodb/arangod.conf --pid-file /var/run/arangodb/arangod.pid            
2015-08-05T17:17:01Z [5947] INFO ArangoDB 2.6.3 32bit -- ICU 54.1, V8 3.31.74.1, OpenSSL 1.0.1k 8 Jan 2015
2015-08-05T17:17:01Z [5947] INFO using default language 'en'
2015-08-05T17:17:02Z [5947] INFO loaded database '_system' from '/var/lib/arangodb/databases/database-80211'
2015-08-05T17:17:02Z [5947] INFO using endpoint 'tcp://127.0.0.1:8529' for non-encrypted requests
2015-08-05T17:17:02Z [5947] INFO using default API compatibility: 20600
2015-08-05T17:17:02Z [5947] INFO JavaScript using startup '/usr/share/arangodb/js', application '/var/lib/arangodb-apps'
sar@jessie-rpi:~$

Nothing is printed in the logs:

sar@jessie-rpi:~$ sudo tail /var/log/arangodb/arangod.log
2015-08-05T17:16:27Z [5918] INFO loaded database '_system' from '/var/lib/arangodb/databases/database-80211'
2015-08-05T17:16:28Z [5918] INFO using endpoint 'tcp://127.0.0.1:8529' for non-encrypted requests
2015-08-05T17:16:28Z [5918] INFO using default API compatibility: 20600
2015-08-05T17:16:28Z [5918] INFO JavaScript using startup '/usr/share/arangodb/js', application '/var/lib/arangodb-apps'
2015-08-05T17:17:01Z [5947] INFO ArangoDB 2.6.3 32bit -- ICU 54.1, V8 3.31.74.1, OpenSSL 1.0.1k 8 Jan 2015
2015-08-05T17:17:01Z [5947] INFO using default language 'en'
2015-08-05T17:17:02Z [5947] INFO loaded database '_system' from '/var/lib/arangodb/databases/database-80211'
2015-08-05T17:17:02Z [5947] INFO using endpoint 'tcp://127.0.0.1:8529' for non-encrypted requests
2015-08-05T17:17:02Z [5947] INFO using default API compatibility: 20600
2015-08-05T17:17:02Z [5947] INFO JavaScript using startup '/usr/share/arangodb/js', application '/var/lib/arangodb-apps'
@jsteemann
Copy link
Contributor

Can you try starting it manually from the command-line with option --javascript.v8-contexts 1?
If that doesn't help, can you try setting ulimit -c unlimited and try again? It would be interesting to see if the process writes a core file and what its exit code was (echo $?).

@sleepycat
Copy link
Contributor Author

Here you go! For some reason ulimit is not installed. Not sure what to make of that yet.

sar@jessie-rpi:~$ sudo /usr/sbin/arangod -c /etc/arangodb/arangod.conf --pid-file /var/run/arangodb/arangod.pid --javascript.v8-contexts 1
2015-08-05T20:18:43Z [12162] INFO ArangoDB 2.6.3 32bit -- ICU 54.1, V8 3.31.74.1, OpenSSL 1.0.1k 8 Jan 2015
2015-08-05T20:18:43Z [12162] INFO using default language 'en'
2015-08-05T20:18:44Z [12162] INFO loaded database '_system' from '/var/lib/arangodb/databases/database-126282'
2015-08-05T20:18:44Z [12162] INFO running WAL recovery (1 logfiles)
2015-08-05T20:18:44Z [12162] INFO replaying WAL logfile '/var/lib/arangodb/journals/logfile-257354.db' (1 of 1)
2015-08-05T20:18:44Z [12162] INFO WAL recovery finished successfully
2015-08-05T20:18:44Z [12162] INFO using endpoint 'tcp://127.0.0.1:8529' for non-encrypted requests
2015-08-05T20:18:44Z [12162] INFO using default API compatibility: 20600
2015-08-05T20:18:44Z [12162] INFO JavaScript using startup '/usr/share/arangodb/js', application '/var/lib/arangodb-apps'
2015-08-05T20:18:47Z [12162] INFO In database '_system': Database is up-to-date (20603/prod/standalone/existing)
sar@jessie-rpi:~$ echo $?
135
sar@jessie-rpi:~$ sudo ulimit -c unlimited
sudo: ulimit: command not found
sar@jessie-rpi:~$ sudo apt-get install ulimit
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package ulimit
sar@jessie-rpi:~$

@jsteemann
Copy link
Contributor

An exit code of 135 is probably SIGBUS:
135 = 128 + 7 (7 = SIGBUS)

This is probably V8 triggering this. The position at which it fails also supports that hypothesis.

Regarding ulimit and core dumps: I think ulimit is a bash built-in, so it may work by omitting the sudo in front of it.

@sleepycat
Copy link
Contributor Author

Oops! you are correct. Ulimit says its already set at unlimited.

@jsteemann
Copy link
Contributor

cat /proc/sys/kernel/core_pattern will tell the location where coredumps are written to. If it starts with a | then the coredump will be piped to the executable specified after the |. I am not sure what the default is on a Raspberry Pi, and which default limit it will use for /proc/sys/kernel/core_pipe_limit (there may be an upper bound on the corefile size). Maybe they need to be adjusted in order to produce a core dump.

@sleepycat
Copy link
Contributor Author

@jsteemann The core dump is pretty big so I sent you a link. Check your email.

@dothebart
Copy link
Contributor

hm, the package installs just fine on our cubie truck. The coredump doesn't produce usefull information about the offending thread except for that its killed by SIG_BUS. Which OS image did you use?

@dothebart
Copy link
Contributor

ok, after ugrading the box, I can reproduce this:
Program received signal SIGILL, Illegal instruction.
_armv7_tick () at armv4cpuid.S:94
94 armv4cpuid.S: No such file or directory.
(gdb) bt
#0 _armv7_tick () at armv4cpuid.S:94
#1 0xb6e17872 in OPENSSL_cpuid_setup () at armcap.c:157

(it was somewhat behind since it was installed when jessie was still testing)

Seems openssl is broken...
[update:] that sigill is from openssl probing, but after continuing the real crash appears...

@jsteemann
Copy link
Contributor

@sleepycat : as @dothebart mentioned, we were able to reproduce it locally with the 2.6.3 package after upgrading the OS. Interestingly enough, the issue didn't occur before the OS upgrade but with the same ArangoDB package.
The SIGILLs from OpenSSL are intentional (it's just probing for CPU features and catches the signals so the program goes on. Until finally there is the SIGBUS error.
We're trying to reproduce that now with the source code version, which will hopefully give us more insight into what's going on. Cloning the repo & compiling there took quite a while, but we're on it.

@jsteemann jsteemann added 1 Bug 3 Build compiling on targets 3 Core labels Aug 6, 2015
@dothebart
Copy link
Contributor

New 2.6x packages are available - closing.

@dothebart dothebart reopened this Aug 18, 2015
@dothebart
Copy link
Contributor

the Cubie truck accidently became a debian testing. We now have a debian jessie bananapi - awaiting the V8 compile to finish.

@dothebart
Copy link
Contributor

'silently exited' seemed not to be true. after researching we found
[ 4511.502550] Alignment trap: not handling instruction ed930b00 at [<004f2100>]
in dmesg, which pointed us to http://blog.galemin.com/tag/alignment/ and -mno-unaligned-access in https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

The new 2.6.5 packages are uploaded. their md5-sum is: ca1860c20cec2ee7f4ed9c5e952e28a3

@dothebart
Copy link
Contributor

We isolated the issue with the unaligned memory access. @jsteemann will work on fixing this later this week.

@dothebart
Copy link
Contributor

if properly controlled (its bit flags - my bad)
echo 2 > /proc/cpu/alignment
turns of the alignment check.
the mnunaligned-access only fixes program structures which are not used by us.

we revalidated this with this simple test program doing unaligned acces as we do it in ArangoDB:

#include <stdint.h>
#include <string.h>

template <typename T>
static inline T TRI_ExtractShapeValue (char const* data) {
 union {
   char buffer[sizeof(T)];
   T value;
 } converter;
 memcpy(&converter.buffer[0], data, sizeof(T));
 return converter.value;
} 
template <typename T>
static inline T TRI_ExtractShapeValue (T const* data, size_t index) {
  return TRI_ExtractShapeValue<T>((const char *) data + index * sizeof(T));
} 

template <typename T>
static inline T TRI_ExtractShapeValue (char const* data, size_t index) {
  return TRI_ExtractShapeValue<T>(data + index * sizeof(T));
} 
int main ()
{
  char buffer[4096];
  const char *pbuf = &buffer[0];
  int i;
  uint64_t testint;
  for (i = 0; i < 30; i++) {
    testint = *((uint64_t*) &buffer[i]);
    testint = TRI_ExtractShapeValue<uint64_t>(buffer + i);
    testint = TRI_ExtractShapeValue<uint64_t>(pbuf + i , 7);
  }    
}

(The first line in the loop does unaligned access, the second re-alignes pointer access, the third one unaligned array access)

One can simply compile the sample above using

g++ test.cpp

From now on arangod will autodetect the alignment kernel settings on start on ARM and decline to start if its wrong for us.

@fceller fceller closed this as completed Oct 12, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 Bug 3 Build compiling on targets 3 Core
Projects
None yet
Development

No branches or pull requests

4 participants