Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
find_in_batches should use lexicographic order #150
Also please note that I did not attempt to fix #148. The new logic was complicated enough were I thought it would be best to fix the simple case first, then take a swing at the more complicated example later.
Let me know if you have any questions / concerns. I tried to keep the code code as readable as possible but I am alway open to feedback.
Thanks for taking the time to maintain this project and a big thanks to @bradediger who helped me fix this problem
Thanks for the patch!
Having looked at it though, I think we are taking the wrong approach. AR is making an assumption that the primary key is an interger that increments by 1 for each record. With CPK that's not going to happen.
I think the better approach is skip the equality comparison totally. Instead just use offset.
So say you have 10 records, batch size is 2. Order the records by the primary keys, then read the frst two, then the next two, etc. There is no need to do any comparison at all against the primary key values. In fact, I think AR should just do that by default since comparing against the primary key is fragile (what happens if you delete a record?).
So a patch would:
Do you want to do that or do you want me to take a look at it?
referenced this pull request
Apr 14, 2013
@cfis Hey there. I'm not the guy who wrote the patch but I discussed it with him :-)
I do think that the approach in this patch works generally for integer composite keys, and could be extended to work with any data types that can be ordered. If there is a counterexample that I'm missing, I'd love to hear it.
The issue that I see with offset is that in some databases (PostgreSQL at least), large offsets are very expensive, as the entire result must be generated, and then
I'm not sure I follow your "comparing against the primary key is fragile (what if you delete a record?)" line of argument. First of all, if you are concerned about consistency between multiple batches, you probably need to run the entire batch in a transaction, as otherwise all bets are off.
But secondly, outside of a transaction, I think the offset-based approach is more problematic than a comparison-based approach in the presence of modification. Say you are finding in batches of 20, and during your work on the first batch, one of the first 20 is deleted. With a comparison-based query (as in this patch), the query will grab the records lexicographically following the first batch (they will pick up in the right place). But with an offset-based query, the second batch will skip the first 20 records even though only 19 of them now exist (one had been deleted). So the 20th record will not be processed at all, and it was completely unrelated to the deleted item except by virtue of being in the same batch.
Thanks for the review -- I'm interested in hearing more about other potential approaches too if some arise.
Thanks for the feedback. So on n the fragile bit, I take that back, I think I'm mistaken on that.
The performance point is a good one, but I don't see how this could be made to work using a comparison approach. Let's say you have keys like this:
Batch is 5. So the last record of the first batch is 5 and e. Using a comparison, and following what AR does, you would create a query that says the first key is greater then 5 and a second key is greater than e. That will return the wrong answer. Even with integer keys, I'm still not seeing this working:
Query is first key is greater than 5, second key is greater than 5. Wrong answer again.
Maybe I'm missing something, but it seems to me that the comparison approach only works in some very specific cases. The offset case might be slower, but at least its correct.
Oh, I think you're misunderstanding how the code works. If the last key of the first batch is (5, e), then the query looks like:
(key1 > 5) OR ((key1 = 5) AND (key2 > e))
Same way you look something up in an alphabetized list... you compare on the first key, falling down to the second, third, and subsequent keys when you find an equal value.
Does that clarify things? Thanks!