-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve insert performance with batched insert queries #8260
base: 2.8.x
Are you sure you want to change the base?
Conversation
6370a74
to
237bfdd
Compare
I don't think you should target 2.7, this does not qualify as a bugfix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be extra tests, and documentation?
@@ -273,38 +271,41 @@ public function executeInserts() | |||
$stmt = $this->conn->prepare($this->getInsertSQL()); | |||
$tableName = $this->class->getTableName(); | |||
|
|||
foreach ($this->queuedInserts as $entity) { | |||
$insertData = $this->prepareInsertData($entity); | |||
foreach (array_chunk($this->queuedInserts, $this->getMaxBatchedInsertQueries()) as $queuedInsertsChunk) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as you do that, you now have the contents of $this->queuedInserts
twice in memory. Wouldn't it be better to extract the first $this->getMaxBatchedInsertQueries()
elements out of that array with array_slice()
until it is empty? I see it will be emptied anyway after the loop, on line 308.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok, as values in php are not duplicated if not modified (aka copy on modify)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look ok to me: https://3v4l.org/1LYhg
array_chunk
does not create a simple copy, it creates another array entirely
{ | ||
if ($this->insertSql !== null) { | ||
return $this->insertSql; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now if we call this method several times in a row and there is no batching, we will re-compute the insertSQL every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is correct, I was not able to measure any significant performance difference. Also, other queries like UPDATE and DELETE are rebuild from scratch every time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful if you can provide the method how you run the performance tests to see any improvements. If this PR gets merged, we will need a way to prevent regressions in the future that will drop any performance improvements of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree that full insert SQL cache is not possible, however, we should be able to store columns and placeholders to prevent us from recomputing every time. I recall a 2s slowdown in the whole testing suite by dropping out this cache.
Actually, now that I think about it... we should still store the full insert SQL referring to a single entry. Also, we should store the column count, so we can easily build new sets of placeholders.
Now for update and delete, the reason why they are not cached is because we're not doing full updates. If we were to be more inline with EclipseLink, Hibernate and EntityFramework, we would do full updates (where you update all columns, excluding LOB fields) and also cache this computation. For delete, id based removals would be ok too...
But this is another micro-optimization that could be done in a separate PR.
return 1; | ||
} | ||
|
||
return 50; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number! Why 50? Shouldn't it be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should. Please advise how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use a constructor argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here. Any fixed number may perform differently depending on the original amount of entities that need to be processed. Making it configurable e.g. in form of a constructor argument would be an advantage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information should be set by a new annotation @BatchSize(size = 100)
that is configured at the entity level.
77a0f3f
to
3750a0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of doing this logic in ORM directly. Batch insert should be contributed to DBAL, only then used by ORM. We don't even know if this syntax is supported by all DBMSes.
This needs to target 2.8, and the batch size should be globally configurable via Configuration class |
@guilhermeblanco does this work with goals for Next? |
I agree with @ostrolucky here. There was an initial work on DBAL. We should look into solving things there IMHO. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a big gain of performance by this PR, I would like to see how it was calculated/measured and if this measurement can be used by CI to avoid performance regressions.
I can add a test counting executed queries. INSERT query takes typically <100 us (ie. database that can insert 10k rows in single connection/thread) when RTT is like 500 us (ie. typical RTT when DB is not on the same machine), then there is 5x overhead per query @SenseException is this test what you want?
@ostrolucky I can move the code of INSERT query build to DBAL. Can |
Logic for creating SQL query can be dropped from method, not drop whole method. |
$isPostInsertId = $idGenerator->isPostInsertGenerator(); | ||
|
||
if ($isPostInsertId || $this->class->isVersioned) { | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can start naively here with 1, but we could be smarter depending of post generated identifier. If using auto incremented, certain RDBMS (such as MySQL and MariaDB, very likely PgSQL too) guarantee the ordering of incremental identifiers on a per transaction basis. This means in case of concurrent writes, the first transaction would be 1 - X as ID incrementally, and second transaction would generate X+1 - Y, also incrementally.
But as an initial implementation, this is more than enough.
Yes, this is one of the goals of ORM 3. The insert logic could be either defined linearly as we do today (using strings) or using SQLStatement value objects like I remember talking about this with Steve Müller (deeky666, not @ 'ing him) in the past, as most of this logic could be ideally moved to DBAL, but very likely it was lost on removal of old develop branch there. |
In my opinion this should be done on application level, and on a library level it is unneeded complication. |
3750a0b
to
d25172c
Compare
Non-versioned not-post incremented inserts can be grouped together and performed with much less query count.
Based on my measurements batched inserts are about 50% faster with local DB and several times faster with DB on another machine due less total RTTs (round trip times) needed. This change largely increases real-world performace with larger imports.
The provided implementation is fully back compatible.