-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock wait timeouts #6
Comments
I don't have a good answer for this but I do have a work around. First though be sure that you haven't been debugging recently because I've definitely found accidentally breaking out of a process and not canceling the transaction makes things much worse. The workaround though is to change the granularity of the tasks you are computing. For example if it is a cell level analysis then make the parpopulate poprel run at the level of SpikeSets and then inside the populate function loop over the cells below that. It means that each cluster node will be more rows apart and decrease the frequency of the problem. Can you try and get more information? E.g. can you query the innodb status and try and see which table is getting locked? It would be information to know if it's the jobs table (bug in parpopulate) or the transaction locks for the populate. If the later and you are not doing anything with subtables I still don't think the lock in that case is beneficial enough for the downsides. |
Ok, good to know that' I'm not the only one that has the problem. It's not because of uncanceled transactions. It occurs even if I restart the database server and create the tables from scratch before doing anything. I'm pretty sure it's the row-level locks of InnoDB. It's definitely not the jobs table that is being locked. I can't change the granularity easily since it's a population analysis and the computation is being done on the SpikeSet level. I now have a hypothesis how I can get around it (below) but I would also like to understand what's causing the lock so I can prevent it in the future. The situation is as follows: I'm fitting a model on the entire population using different values for the parameters and multiple cross-validation runs. So I have a ModelSet table as the parent, a Model table containing the individual fits, and a ModelUnits table to store which cells went into the model. I'm doing the inserts into each table as I go along and I guess for some reason those inserts are causing some kind of lock (no idea why). Since the fitting takes quite a while each ModelSet makeTuples runs for a few minutes, during which the ModelSet table seems to be locked and concurrent makeTuples get the timeouts. I can see two fixes, both of which are not very satisfactory:
|
Have you tried MySQL 5.6 yet? It is supposed to be much faster. |
No I haven't but I should probably give that a shot, too It looks like after flattening the joins and moving all restrictions as far to the end out as possible I got rid of most of my problems. |
Since this is not a DataJoint-specific problem, I propose writing a wiki page on query optimization and closing this issue, unless you think that there is something specific to DataJoint that could be improved. Performance issues will always exist for complex queries. |
Haha, I thought this was the issue on performance problems when I answered to your comment... of course this has nothing to do with query optimization. The solution was indeed to do all the computations first and then insert all tuples at the end. Makes the makeTuples a bit less clean but has worked every time so far. You have a point that it's not really DJ related, so I'll put together a wiki page as soon as I get to it. Let's leave the issue open for the time being as a reminder. |
I still have never encountered this problem without an obvious explanation. I can reproduce this problem in several ways, which result from incorrect populate calls or when makeTuples tries to insert something that it shouldn't. This may happen when another process is computing the same thing. This should not happen when parpopulate is used. Do you still see run into this problem? How often? Is it reproducible? Could I examine the schemas and classes where this problem occurs? |
okay, i have reproduced this consistently and I think understand when this happens. It only happens when multiple processes populate an overlapping range of primary key values. This usually happens when makeTuples fill out a primary key attribute that is not last in the the primary key. Transactions block continuous ranges of indices. The solution is to design the schemas so that makeTuples only adds primary key attributes at the end of the primary key. The reason I have not seen this problem before is that I have typically designed tables this way anyway. You could probably reduce this problem by changing the order of the primary foreign keys or changing the granularity of your popRel. The workaround above is good: delay the insertions until all the tuples are computed. This shortens the total time that the index range is locked by any one process. I think we should modify populate to recognize timeouts and not log them as errors. Instead, it should try the same call again later. |
Can you post your specific example to reproduce the problem somewhere? I've always been pretty sure that it's related to locking of index ranges, but could never figure out why it would do it. What do you mean by changing the order of primary foreign keys? |
Ideally, One such example is reso.PeriStimTrace.
which produces the following primary key Its Then We could attempt to solve the problem by changing the order of the primary foreign keys. For example,
Then the primary key will be Instead I could keep the original order but change the |
in light of the above, I am making the following two changes in
|
We need to remember to describe this more concisely on wiki. |
This issue needs more thorough coverage in Wiki with graphics to explain - it will be to much to ask the users to read through this issue and to understand. |
I agree with @eywalker ... I run into the timeouts a lot but I don't understand what the problem is and how to solve it. |
I'm getting a lot of "lock wait timeout" errors these days when using the cluster to parpopulate tables (happens already with 16 workers but increasingly often with larger numbers). I already researched it quite a bit and it is related to InnoDB's row level locking but I don't understand yet what causes the lock. I posted a question on stackexchange as well: http://dba.stackexchange.com/questions/31611/innodb-row-level-locks-without-selects
I was wondering if anyone else has the same issue and/or any ideas what causes it or how to fix it.
The text was updated successfully, but these errors were encountered: