You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A customer occasionally runs into a problem where their node will not start back up after it's been shutdown to perform a backup.
On startup, they see this kind of output in console.log:
2016-12-11 01:14:47.674 [error] <0.1611.0>@riak_kv_vnode:init:512 Failed to start riak_kv_eleveldb_backend backend for index 1381575766539369164864420818427111292018498732032 error: {db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}
2016-12-11 01:14:47.678 [notice] <0.1611.0>@riak:stop:43 "backend module failed to start."
2016-12-11 01:14:47.678 [error] <0.1611.0> gen_fsm <0.1611.0> in state started terminated with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184
2016-12-11 01:14:47.678 [error] <0.1611.0> CRASH REPORT Process <0.1611.0> with 1 neighbours exited with reason: no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in gen_fsm:terminate/7 line 600
2016-12-11 01:14:47.679 [error] <0.163.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.1611.0> exit with reason no function clause matching riak_kv_vnode:terminate({bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818..."}}}, undefined) line 1184 in context child_terminated
2016-12-11 01:14:47.679 [error] <0.160.0> Supervisor riak_core_sup had child riak_core_vnode_manager started with riak_core_vnode_manager:start_link() at <0.1585.0> exit with reason {{function_clause,[{riak_kv_vnode,terminate,[{bad_return_value,{stop,{db_open,"IO error: lock /data/riak/leveldb/1381575766539369164864420818427111292018498732032/LOCK: already held by process"}}},undefined],[{file,"src/riak_kv_vnode.erl"},{line,1184}]},{riak_core_vnode,terminate,3,[{file,"src/riak_core_vnode.erl"},{line,908}]},{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,597}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]},{gen_fsm,sync_send_event,[<0.1611.0>,wait_for_init,...]}} in context child_terminated
2016-12-11 01:14:47.683 [info] <0.500.0>@riak_kv_app:prep_stop:232 Stopping application riak_kv - marked service down.
I've attached the full console.log file for context.
The customer has also implemented some basic debugging in their startup script as it was initially thought a process may have been holding a lock on these console.log.txt
files. This is an example of what they looked like when the node was up and running normally:
lsof:
...
beam.smp 177432 riak mem REG 253,8 32526336 1744834583 /data/riak/leveldb/1267395951122892374379757940871151681107879002112/000369.log
beam.smp 177432 riak mem REG 253,8 27365376 1879049262 /data/riak/leveldb/68507889249886074290797726533575766546371837952/000366.log
beam.smp 177432 riak mem REG 253,8 27631616 805307404 /data/riak/leveldb/125597796958124469533129165311555572001681702912/000367.log
beam.smp 177432 riak mem REG 253,8 32116736 1476398089 /data/riak/leveldb/1438665674247607560106752257205091097473808596992/000382.log
beam.smp 177432 riak mem REG 253,8 32694272 1476398103 /data/riak/leveldb/1039036320289938793410432185759232459286639542272/000382.log
...
After the stop, there are no leveldb entries in the output of lsof. Similarly, there are no beam processes running.
So when Riak restarts, there's no obvious reason why there should be a failure. When the customer then manually attempts a restart after realising the node is down, it comes up fine.
Customer is running RHEL7.
The text was updated successfully, but these errors were encountered:
Basho-JIRA
changed the title
Riak sometimes fails to start due to LOCK files being held due to lock "already held by process"
Riak sometimes fails to start due to LOCK files being held due to lock "already held by process" [JIRA: RIAK-3118]
Dec 12, 2016
Perhaps it could also mention that while no process has the LOCK files open after riak stop, they do persist in the file system, and also that the file system is xfs on thinly provisioned lvm logical volume, which is used to create a snapshot during the node stop.
A customer occasionally runs into a problem where their node will not start back up after it's been shutdown to perform a backup.
On startup, they see this kind of output in console.log:
I've attached the full console.log file for context.
The customer has also implemented some basic debugging in their startup script as it was initially thought a process may have been holding a lock on these
console.log.txt
files. This is an example of what they looked like when the node was up and running normally:
lsof
:After the stop, there are no leveldb entries in the output of
lsof
. Similarly, there are no beam processes running.So when Riak restarts, there's no obvious reason why there should be a failure. When the customer then manually attempts a restart after realising the node is down, it comes up fine.
Customer is running RHEL7.
The text was updated successfully, but these errors were encountered: