Disco master crashes because of ddfs queries. #386

Closed
pooya opened this Issue Jan 10, 2014 · 2 comments

1 participant

@pooya
Disco Project member

The error message:

=SUPERVISOR REPORT====
Supervisor: {<0.72.0>,disco_main}
Context: child_terminated
Reason:
{timeout,{gen_server,call,[fair_scheduler,{next_task,["disco_node"]}]}}
Offender:
[{pid,<0.81.0>},{name,disco_server},{mfargs,{disco_server,start_link,[]}},{restart_type,permanent},{shutdown,10},{child_type,worker}]

@pooya pooya was assigned Jan 21, 2014
@pooya
Disco Project member

There may be a lot of timeouts when ddfs is queried too often. I have been able to reproduce this issue with the following steps:
1. Start disco.
2. Start a long running jobs with a lot of tasks (I use test_50k).
3. Run a lot of ddfs queries concurrently:
I use the following steps for doing so:

for i in $(seq 100)
do
echo test$i
ddfs chunk test$i ./AUTHORS &
ddfs xcat test$i &
ddfs rm test$i &
done

As soon as the http requests kick in, disco master will behave oddly. These are some of the behavior I have seen:

  • The cpu utilization drops on the master.
  • The disco web page loads very slowly.
  • No new worker tasks are started.
  • Sometimes there are some crashes. Almost all of them start from a timeout. These have been observed in different parts of the code.
  • Jobs fail because of these timeouts.
  • Sometimes different erlang processes fail without being restarted (they don't have a supervisor or they exceed the restart policy).
@pooya
Disco Project member

Disco master has been made more resilient with increasing the timeouts for the more important gen_server calls.

@pooya pooya closed this Jul 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment