insert_id is wrong after batch insert of > 100 #42

Closed
bitbucket-import opened this Issue Aug 19, 2011 · 12 comments

Comments

Projects
None yet
6 participants

Since DB_active_rec does batch insert 100 at a time, insert_id will be wrong if the batch inset is more than 100 elems. it will return the insert id of the last 100-chunk.

A solution could be to save insert_id after the first 100-chunk.

derekjones pushed a commit that referenced this issue Aug 19, 2011

Merge pull request #42 from CUBRID/develop
CUBRID Database Driver for CI.
Contributor

narfbg commented Feb 28, 2012

How does that make it wrong?

sviande pushed a commit to sviande/CodeIgniter that referenced this issue Jan 3, 2014

Merge pull request #42 from CUBRID/develop
CUBRID Database Driver for CI.

bgeisel1 commented Feb 2, 2016

We just ran into the same issue as bitbucket-import. The reason this is "wrong" is two-fold:

  1. Calling insert_batch() invalidates the last_insert_id, which is unexpected by anyone who didn't read the insert_batch() code.
  2. Calling insert_batch() does not maintain isolation integrity of systems that implement an isolation solution for multiple insert (such that all inserts are guaranteed to be all-or-nothing and sequential without a race condition from an external writer).

In the case of MySQL at least, last_insert_id is guaranteed to point to the first item inserted in an autoincrement table. That guarantee is lost if you're using Codeigniter's insert_batch. What's worse, is if you need the value of the first insert there is now no safe way to obtain that information, it is lost forever.

I didn't notice any reason in the comments for choosing 100 for the limit. Perhaps it is a limit of one of the particular database systems, but if that's the case, it seems best to impose the limit in the driver for that specific database. My bug-fix recommendation would be to remove the limit altogether and send the entire insert query, or at least to document in the function why there is an imposed limit and how said limit was chosen. The MySQL limit seems to be constrained only by the maximum packet size which defaults to 16M.

How to fix outside of Codeigniter:
In case this issue isn't fixed and you're reading this on the Internet, the fix I would recommend would be to implement your own form of batch insert that does not use an arbitrary limit based on the average number of fingers possessed by humans. ;-) Or, you could re-write the insert_batch function to be wrapped in a transaction (so you know no other inserts asynchronously get added) that saves the last_insert_id from the first insert and returns it upon successful completion of the transaction. That solution is less clean, however, as you might have other code expecting a true/false answer or relying on last_insert_id, which would still be corrupted.

Contributor

narfbg commented Feb 3, 2016

Calling insert_batch() invalidates the last_insert_id

What do you mean by "invalidate" exactly?

In the case of MySQL at least, last_insert_id is guaranteed to point to the first item inserted in an autoincrement table

The first one? Am I missing something here? Isn't it supposed to be the last one?

Contributor

mwhitneysdsu commented Feb 3, 2016

http://dev.mysql.com/doc/refman/5.7/en/information-functions.html#function_last-insert-id

Apparently, MySQL's last_insert_id() function returns the first ID inserted by a batch insert. CodeIgniter doesn't use that function directly, but I'm not sure whether MySQLi's insert_id uses the function (or mimics its behavior). Either way, if you do a batch insert of more than 100 rows, CI's insert_id() function is probably not going to give you a useful value (though I wouldn't expect the behavior of MySQL's last_insert_id() function if I wasn't reading the MySQL manual).

One of the reasons for splitting up a batch insert is that MySQLi (and probably other PHP database extensions) has a limit on the size of the query ( http://dev.mysql.com/doc/refman/5.7/en/packet-too-large.html ).

While 100 is arbitrary, it apparently works for most common uses of insert_batch(), and anything non-arbitrary would have to be a platform-specific implementation, as you would have to determine the actual limit for the platform, then divide the insert into the appropriate size (which, at least for MySQLi, would break the batch down into a number of inserts based on the amount of data inserted into each row, which would be unnecessary 99.9% of the time).

At the very least, adding a "note" to the insert_batch public documentation would be worthwhile. Given that insert_batch and insert_id are both part of the public API, inserting >100 rows and retrieving the insert_id should either be supported, or marked as a known limitation. If keeping current behavior and updating documentation is the desired path, I'm happy to submit a pull request for the docs.

Contributor

narfbg commented Feb 3, 2016

I've got another idea about an eventual solution, but I still don't quite see the problem and am waiting for an answer to my first question for @bgeisel1.

bgeisel1 commented Feb 3, 2016

Hi Andrey & Mat, thanks for the quick response.

Details and examples below, but the main issue is this: insert_batch silently changes the API to a database. One of the quintessential aspects of a database is atomicity. From wikipedia:

In database systems, atomicity (or atomicness; from Greek a-tomos, undividable) is one of the ACID transaction properties. An atomic transaction is a series of database operations such that either all occur, or nothing occurs. The series of operations cannot be divided apart and executed partially from each other, which makes the series of operations "indivisible", hence the name.

In Codeigniter, the implementation of insert_batch defies the very definition of "atomicity" and "indivisible" for all of the databases behind it. The fact that MySQL has a limit means that at some point it will return an error, which the user code can handle. However, insert_batch makes the operation non-atomic which may result in errors the user cannot handle (i.e. non-deterministic change to last_insert_id). Sorry, I was probably unclear when I said "invalidates", but that's what I meant. It silently (no errors, etc.) changes the behavior of the underlying DB and I can no longer accurately obtain the IDs returned by the insert.

Because atomicity is broken, not only is last_insert_id broken, but what happens if one of the chunks in the middle fails? Now I have 1 set of IDs inserted and another set (or sets) that failed with no way to correlate what worked and what didn't. It's why the ACID principle is so important to databases.

As it was reported earlier, I suspect that other users are running into it. We didn't notice until data started getting corrupted in the asynchronous case (see below). It would be very easy to use this and never know that inserts > 100 could be corrupting your data (i.e. because you expect last_insert_id to be deterministic).

Ok, the examples and answer to @narfbg 's other question:

The first one? Am I missing something here? Isn't it supposed to be the last one?

No, as per the MySQL documentation, last_insert_id is always the FIRST entry in multiple inserts.

Important
If you insert multiple rows using a single INSERT statement, LAST_INSERT_ID() returns the value generated for the first inserted row only. The reason for this is to make it possible to reproduce easily the same INSERT statement against some other server.

For an example, let's say I'm inserting 500 entries. The first entry is inserted as id 2000. I do a MySQL multiple insert (not using insert_batch, but just a raw query). Calling last_insert_id will return 2000 since it was the first entry. I inserted 500 entries, so I now know my IDs are 2000 to 2499 (or of course if we hit a limit somewhere it will fail, and I will know because I received an error).

When I use insert_batch, two things go wrong here. In the non-threaded (simpler) case, I insert the same 500 entries in a table whose next ID will be 2000. After this call, however, last_insert_id won't return 2000 as we would expect, it will return 2400. Same as before, my IDs are 2000 to 2499, but now last_insert_id points somewhere in the middle. Not what the user is expecting.

In the threaded case, I can't even determine what my IDs are. For example, say insert_batch does its first batch, IDs 2000 through 2099. Then _insert_batch returns and our loop is about to call it again with the next batch. Right at that moment another thread writes 5 entries into the database. Now my second set is going to be 2104 through 2203 and information is lost (the list of IDs is now non-deterministic). My ids are now 2000-2099, 2104-2203, and so forth. By breaking my atomic query into a non-atomic query it has broken the ACID transaction property of such a database.

I hope that makes more sense. Please let me know if my examples are confusing or if I've brought up other questions.

Thanks!

narfbg added a commit that referenced this issue Feb 4, 2016

Contributor

narfbg commented Feb 4, 2016

To be honest, most of what you said is kind of irrelevant (and further confusing), but I believe I've got what you mean. :)

Does that last commit resolve the problem?

That commit is definitely an improvement, since it would be possible to pass count($set) in as $batch_size and bypass the batching mechanism. If you were to call insert_batch that way, I think you would re-gain access to insert_id afterward.

I don't like that the current batch implementation can silently fail to insert some of the batches, such as experienced by this user:
http://stackoverflow.com/questions/18015852/codeigniters-insert-batch-with-thousands-of-inserts-has-missing-records

I would suggest wrapping the batch inserts in a transaction, but not all databases support that. Any ideas on how to prevent some of the inserts from silently failing?

Contributor

narfbg commented Feb 4, 2016

That's not possible if you don't utilize the new parameter, not in a way that would work on all databases.

That's not possible if you don't utilize the new parameter, not in a way that would work on all databases.

That's really the reason it's such a bad idea to break a database query into chunks -- it will not work on all databases. It will seem to work until using it in a large production environment. Then it will silently corrupt data.

Breaking the query into chunks is bad. In some very valid use cases, it corrupts data.
Not breaking the query into chunks is good. It models the way databases work and doesn't corrupt data.

My recommendation would be to default insert_batch to not break the query into chunks. It was a mistake to chunk it in the first place. There is no good reason to continue chunking and many good reasons to discontinue chunking. Just remove the for loop that is doing the array_slice.

It used to corrupt data and no one noticed is not a good reason to go on doing so.

Contributor

mwhitneysdsu commented Feb 19, 2016

While it would probably be a good idea to not break the query up, it would also be a good idea to:

  • run the batch in a transaction (which you can do without any modification to the query builder code)
  • compare the result of $this->db->affected_rows() to the number of rows you wanted to insert
  • don't rely on the result of $this->db->insert_id() matching the behavior you described

The last is just because it won't match your description for all database platforms. In some cases, the method is simply not supported; in at least one case, the method will return the last ID inserted, rather than the first. (I expected it to return the last ID inserted, which is why I linked the documentation earlier to support your description of the behavior under MySQL.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment