New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock when rewriting resources #840

Closed
GoogleCodeExporter opened this Issue Apr 6, 2015 · 10 comments

Comments

Projects
None yet
1 participant
@GoogleCodeExporter
Copy link

GoogleCodeExporter commented Apr 6, 2015

Sometimes under high load I can see rewrites being stuck for 10 seconds. gdb 
gives this backtrace:

#0  pthread_cond_timedwait () at 
../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x0000000000aabc36 in net_instaweb::PthreadCondvar::TimedWait 
(this=0x11456e0, timeout_ms=0) at pagespeed/kernel/thread/pthread_condvar.cc:66
#2  0x000000000066dabe in 
net_instaweb::CheckingThreadSystem::CheckingCondvar::TimedWait (this=0x1145730, 
timeout_ms=10000)
    at pagespeed/kernel/base/checking_thread_system.cc:55
#3  0x0000000000695d64 in net_instaweb::Scheduler::AwaitWakeupUntilUs 
(this=0x11455d0, wakeup_time_us=1385368997267462)
    at pagespeed/kernel/thread/scheduler.cc:382
#4  0x0000000000695e51 in net_instaweb::Scheduler::ProcessAlarms 
(this=0x11455d0, timeout_us=10000000) at 
pagespeed/kernel/thread/scheduler.cc:401
#5  0x0000000000696034 in net_instaweb::SchedulerBlockingFunction::Block 
(this=0x7ffff0b69dd0) at pagespeed/kernel/thread/scheduler.cc:434
#6  0x0000000000540226 in net_instaweb::RewriteDriver::FinishParse 
(this=0x13e66e0) at net/instaweb/rewriter/rewrite_driver.cc:2470

It seems that sometimes Wakeup() can be called before AwaitWakeupUntilUs starts 
waiting on condition variable and thus misses condition variable broadcast and 
stucks waiting for 10 second timeout given at 
SchedulerBlockingFunction::Block().

If I lock mutex in Scheduler::Wakeup before broadcasting on condition variable 
then I can no longer see timeouts, everything works fine.

I'm using custom reverse proxy with PSOL 1.5.28.2. Looking at code I believe 
that issue is still present in trunk.

Original issue reported on code.google.com by mikear...@gmail.com on 25 Nov 2013 at 9:03

@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

Thanks; will take a look. If Wakeup() got called early it shouldn't really 
matter, though, since it should evaluate the predicate before sleeping. 

Original comment by morlov...@google.com on 25 Nov 2013 at 3:29

@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

This looks (from the backtrace) like it might be 
https://code.google.com/p/modpagespeed/issues/detail?id=759 which was fixed.  
Maks, what do you think?

If so, upgrading should sort the problem out.

Original comment by jmaes...@google.com on 25 Nov 2013 at 3:53

@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

Issue 759 was reported by me as well and I have fix backported.

Original comment by mikear...@gmail.com on 25 Nov 2013 at 4:09

@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

Likely schedule for the race:
1) Thread 1: Acquire mutex()
2) Thread 1: Read done = false (+memory acquire)
3) Thread 2: Write done = true  (+memory release)
4) Thread 2: Wakeup()
5) Thread 1: Wait()

Locking in Wakeup() of course makes it impossible for it to serialize at (4), 
but I think I would prefer to lock in SchedulerBlockingFunction::Cancel to also 
give atomicity to steps 3 + 4 to make things a bit easier to reason about.

Original comment by morlov...@google.com on 3 Dec 2013 at 6:35

@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

Yup, that looks like it.  Agree with your suggested fix, sounds like we're not 
consistent about whether Cancel() is run under mutex() or not and that will 
need to be fixed first.

Original comment by jmaes...@google.com on 3 Dec 2013 at 6:56

@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

Subbmitted https://code.google.com/p/modpagespeed/source/detail?r=3670
It's similar to what you suggested, but with an additional change to make sure 
that locking state for callbacks is consistent (w/o it one can deadlock trying 
to 
acquire the same lock twice from the same thread).

Original comment by morlov...@google.com on 9 Dec 2013 at 3:20

@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

Original comment by jefftk@google.com on 2 Jan 2014 at 4:19

  • Changed state: Fixed
@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

Original comment by jefftk@google.com on 2 Jan 2014 at 5:21

  • Added labels: Milestone-v31
@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

Original comment by hui...@google.com on 6 Jan 2014 at 6:25

  • Changed title: Deadlock when rewriting resources
@GoogleCodeExporter

This comment has been minimized.

Copy link

GoogleCodeExporter commented Apr 6, 2015

Not deadlock, just missed wake up.

Original comment by morlov...@google.com on 7 Jan 2014 at 3:09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment