Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lock wait timeouts #6

Closed
aecker opened this issue Jan 8, 2013 · 14 comments
Closed

Lock wait timeouts #6

aecker opened this issue Jan 8, 2013 · 14 comments

Comments

@aecker
Copy link
Contributor

aecker commented Jan 8, 2013

I'm getting a lot of "lock wait timeout" errors these days when using the cluster to parpopulate tables (happens already with 16 workers but increasingly often with larger numbers). I already researched it quite a bit and it is related to InnoDB's row level locking but I don't understand yet what causes the lock. I posted a question on stackexchange as well: http://dba.stackexchange.com/questions/31611/innodb-row-level-locks-without-selects

I was wondering if anyone else has the same issue and/or any ideas what causes it or how to fix it.

@peabody124
Copy link

I don't have a good answer for this but I do have a work around. First though be sure that you haven't been debugging recently because I've definitely found accidentally breaking out of a process and not canceling the transaction makes things much worse.

The workaround though is to change the granularity of the tasks you are computing. For example if it is a cell level analysis then make the parpopulate poprel run at the level of SpikeSets and then inside the populate function loop over the cells below that. It means that each cluster node will be more rows apart and decrease the frequency of the problem.

Can you try and get more information? E.g. can you query the innodb status and try and see which table is getting locked? It would be information to know if it's the jobs table (bug in parpopulate) or the transaction locks for the populate. If the later and you are not doing anything with subtables I still don't think the lock in that case is beneficial enough for the downsides.

@aecker
Copy link
Contributor Author

aecker commented Jan 15, 2013

Ok, good to know that' I'm not the only one that has the problem.

It's not because of uncanceled transactions. It occurs even if I restart the database server and create the tables from scratch before doing anything. I'm pretty sure it's the row-level locks of InnoDB.

It's definitely not the jobs table that is being locked. I can't change the granularity easily since it's a population analysis and the computation is being done on the SpikeSet level.

I now have a hypothesis how I can get around it (below) but I would also like to understand what's causing the lock so I can prevent it in the future.

The situation is as follows: I'm fitting a model on the entire population using different values for the parameters and multiple cross-validation runs. So I have a ModelSet table as the parent, a Model table containing the individual fits, and a ModelUnits table to store which cells went into the model. I'm doing the inserts into each table as I go along and I guess for some reason those inserts are causing some kind of lock (no idea why). Since the fitting takes quite a while each ModelSet makeTuples runs for a few minutes, during which the ModelSet table seems to be locked and concurrent makeTuples get the timeouts.

I can see two fixes, both of which are not very satisfactory:

  1. Increase the lock wait timeout. Would cause less errors but still mean that the parallelization is pretty inefficient since it would spent a lot of time waiting for locks.
  2. Compute all the tuples and do the inserts at the end. Would make the code much less clean.

@dimitri-yatsenko
Copy link
Member

Have you tried MySQL 5.6 yet? It is supposed to be much faster.

@aecker
Copy link
Contributor Author

aecker commented Feb 14, 2013

No I haven't but I should probably give that a shot, too

It looks like after flattening the joins and moving all restrictions as far to the end out as possible I got rid of most of my problems.

@dimitri-yatsenko
Copy link
Member

Since this is not a DataJoint-specific problem, I propose writing a wiki page on query optimization and closing this issue, unless you think that there is something specific to DataJoint that could be improved. Performance issues will always exist for complex queries.

@aecker
Copy link
Contributor Author

aecker commented Feb 14, 2013

Haha, I thought this was the issue on performance problems when I answered to your comment... of course this has nothing to do with query optimization.

The solution was indeed to do all the computations first and then insert all tuples at the end. Makes the makeTuples a bit less clean but has worked every time so far.

You have a point that it's not really DJ related, so I'll put together a wiki page as soon as I get to it. Let's leave the issue open for the time being as a reminder.

@dimitri-yatsenko
Copy link
Member

I still have never encountered this problem without an obvious explanation. I can reproduce this problem in several ways, which result from incorrect populate calls or when makeTuples tries to insert something that it shouldn't. This may happen when another process is computing the same thing. This should not happen when parpopulate is used.

Do you still see run into this problem? How often? Is it reproducible? Could I examine the schemas and classes where this problem occurs?

@dimitri-yatsenko
Copy link
Member

okay, i have reproduced this consistently and I think understand when this happens. It only happens when multiple processes populate an overlapping range of primary key values. This usually happens when makeTuples fill out a primary key attribute that is not last in the the primary key. Transactions block continuous ranges of indices.

The solution is to design the schemas so that makeTuples only adds primary key attributes at the end of the primary key. The reason I have not seen this problem before is that I have typically designed tables this way anyway. You could probably reduce this problem by changing the order of the primary foreign keys or changing the granularity of your popRel.

The workaround above is good: delay the insertions until all the tuples are computed. This shortens the total time that the index range is locked by any one process.

I think we should modify populate to recognize timeouts and not log them as errors. Instead, it should try the same call again later.

@aecker
Copy link
Contributor Author

aecker commented Sep 21, 2013

Can you post your specific example to reproduce the problem somewhere? I've always been pretty sure that it's related to locking of index ranges, but could never figure out why it would do it. What do you mean by changing the order of primary foreign keys?

@dimitri-yatsenko
Copy link
Member

Ideally, popRel's primary key comprises the top attributes of the table. Then makeTuples fills out the remaining primary attributes. However, it is easy to create cases when popRel's primary key includes the first and third attributes of the primary key and makeTuples must fill the second. This will result in timeouts and that's the expected behavior. A timeout will occur anytime when process y attempts to insert a tuple between two tuples inserted by process x before x completes its transaction.

One such example is reso.PeriStimTrace.
Its primary key is defined as follows:

->reso.Sync
->reso.Trace
->psy.Trial

which produces the following primary key (animal_id,tp_session,scan_idx,slice_num,trace_id,psy_id,trial_idx).

Its popRel is reso.Sync*reso.Segment*psy.Session whose primary key is (animal_id,tp_session,scan_idx,slice_num,psy_id).

Then makeTuples must fill out trial_idx and trace_id. Because trace_id comes before psy_id, timeouts ensue.

We could attempt to solve the problem by changing the order of the primary foreign keys. For example,

reso.PeriStimTrace (computed) # calcium signals around the stimulus
->reso.Sync
->psy.Trial
->reso.Trace

Then the primary key will be (animal_id,tp_session,scan_idx,psy_id,trial_idx,slice_num,trace_id. The problem remains unsovled.

Instead I could keep the original order but change the popRel to include the trace: popRel = reso.Sync*reso.Trace*psy.Session. Then makeTuples only needs to fill out trial_idx, which is last in the primary key. I would then also need to change the makeTuples to process only one trace at a time.

@dimitri-yatsenko
Copy link
Member

in light of the above, I am making the following two changes in populate in dj.version 2.6.5.

  1. If popRel's primary key attributes are not top in the primary key list, parpopulate will display a warning that timeouts may occur.
  2. If a timeout does occur, the offending key will be saved and attempted again after all other keys have been processed up to maxTimeouts=5 attempts. This will not always solve the problem, but will make tasks more likely to complete, though less efficiently.

@dimitri-yatsenko
Copy link
Member

We need to remember to describe this more concisely on wiki.

@eywalker
Copy link
Contributor

This issue needs more thorough coverage in Wiki with graphics to explain - it will be to much to ask the users to read through this issue and to understand.

@mjlm
Copy link
Contributor

mjlm commented Feb 8, 2018

I agree with @eywalker ... I run into the timeouts a lot but I don't understand what the problem is and how to solve it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants