Improve insert performance with batched insert queries #8260

mvorisek · 2020-09-04T18:02:12Z

Non-versioned not-post incremented inserts can be grouped together and performed with much less query count.

Based on my measurements batched inserts are about 50% faster with local DB and several times faster with DB on another machine due less total RTTs (round trip times) needed. This change largely increases real-world performace with larger imports.

The provided implementation is fully back compatible.

greg0ire · 2020-09-12T09:20:32Z

I don't think you should target 2.7, this does not qualify as a bugfix.

greg0ire

Shouldn't there be extra tests, and documentation?

greg0ire · 2020-09-12T09:26:59Z

lib/Doctrine/ORM/Persisters/Entity/BasicEntityPersister.php

@@ -273,38 +271,41 @@ public function executeInserts()
        $stmt       = $this->conn->prepare($this->getInsertSQL());
        $tableName  = $this->class->getTableName();

-        foreach ($this->queuedInserts as $entity) {
-            $insertData = $this->prepareInsertData($entity);
+        foreach (array_chunk($this->queuedInserts, $this->getMaxBatchedInsertQueries()) as $queuedInsertsChunk) {


As soon as you do that, you now have the contents of $this->queuedInserts twice in memory. Wouldn't it be better to extract the first $this->getMaxBatchedInsertQueries() elements out of that array with array_slice() until it is empty? I see it will be emptied anyway after the loop, on line 308.

I think this is ok, as values in php are not duplicated if not modified (aka copy on modify)

Doesn't look ok to me: https://3v4l.org/1LYhg
array_chunk does not create a simple copy, it creates another array entirely

greg0ire · 2020-09-12T09:32:33Z

lib/Doctrine/ORM/Persisters/Entity/BasicEntityPersister.php

    {
-        if ($this->insertSql !== null) {
-            return $this->insertSql;
-        }


So now if we call this method several times in a row and there is no batching, we will re-compute the insertSQL every time?

This is correct, I was not able to measure any significant performance difference. Also, other queries like UPDATE and DELETE are rebuild from scratch every time.

It would be helpful if you can provide the method how you run the performance tests to see any improvements. If this PR gets merged, we will need a way to prevent regressions in the future that will drop any performance improvements of this.

I completely agree that full insert SQL cache is not possible, however, we should be able to store columns and placeholders to prevent us from recomputing every time. I recall a 2s slowdown in the whole testing suite by dropping out this cache.
Actually, now that I think about it... we should still store the full insert SQL referring to a single entry. Also, we should store the column count, so we can easily build new sets of placeholders.

Now for update and delete, the reason why they are not cached is because we're not doing full updates. If we were to be more inline with EclipseLink, Hibernate and EntityFramework, we would do full updates (where you update all columns, excluding LOB fields) and also cache this computation. For delete, id based removals would be ok too...
But this is another micro-optimization that could be done in a separate PR.

greg0ire · 2020-09-12T09:34:18Z

lib/Doctrine/ORM/Persisters/Entity/BasicEntityPersister.php

+            return 1;
+        }
+
+        return 50;


Magic number! Why 50? Shouldn't it be configurable?

Yes, it should. Please advise how.

I would use a constructor argument

I agree here. Any fixed number may perform differently depending on the original amount of entities that need to be processed. Making it configurable e.g. in form of a constructor argument would be an advantage.

This information should be set by a new annotation @BatchSize(size = 100) that is configured at the entity level.

ostrolucky

Not a fan of doing this logic in ORM directly. Batch insert should be contributed to DBAL, only then used by ORM. We don't even know if this syntax is supported by all DBMSes.

beberlei · 2020-09-12T14:48:51Z

This needs to target 2.8, and the batch size should be globally configurable via Configuration class

beberlei · 2020-09-12T14:59:30Z

@guilhermeblanco does this work with goals for Next?

lcobucci · 2020-09-13T11:24:05Z

I agree with @ostrolucky here. There was an initial work on DBAL. We should look into solving things there IMHO.

SenseException

If there is a big gain of performance by this PR, I would like to see how it was calculated/measured and if this measurement can be used by CI to avoid performance regressions.

mvorisek · 2020-09-15T12:50:07Z

If there is a big gain of performance by this PR, I would like to see how it was calculated/measured and if this measurement can be used by CI to avoid performance regressions.

I can add a test counting executed queries.

INSERT query takes typically <100 us (ie. database that can insert 10k rows in single connection/thread)

when RTT is like 500 us (ie. typical RTT when DB is not on the same machine), then there is 5x overhead per query

@SenseException is this test what you want?

Not a fan of doing this logic in ORM directly. Batch insert should be contributed to DBAL, only then used by ORM. We don't even know if this syntax is supported by all DBMSes.

@ostrolucky I can move the code of INSERT query build to DBAL. Can Doctrine/ORM/Persisters/Entity/BasicEntityPersister::getInsertSQL() method be dropped in 2.8.x then?

ostrolucky · 2020-09-15T13:10:52Z

Can Doctrine/ORM/Persisters/Entity/BasicEntityPersister::getInsertSQL() method be dropped in 2.8.x then?

Logic for creating SQL query can be dropped from method, not drop whole method.

guilhermeblanco · 2020-09-15T14:19:47Z

lib/Doctrine/ORM/Persisters/Entity/BasicEntityPersister.php

+        $isPostInsertId = $idGenerator->isPostInsertGenerator();
+
+        if ($isPostInsertId || $this->class->isVersioned) {
+            return 1;


We can start naively here with 1, but we could be smarter depending of post generated identifier. If using auto incremented, certain RDBMS (such as MySQL and MariaDB, very likely PgSQL too) guarantee the ordering of incremental identifiers on a per transaction basis. This means in case of concurrent writes, the first transaction would be 1 - X as ID incrementally, and second transaction would generate X+1 - Y, also incrementally.

But as an initial implementation, this is more than enough.

guilhermeblanco · 2020-09-15T14:32:24Z

@guilhermeblanco does this work with goals for Next?

Yes, this is one of the goals of ORM 3. The insert logic could be either defined linearly as we do today (using strings) or using SQLStatement value objects like Statement\Insert, Statement\Update and Statement\Delete.

I remember talking about this with Steve Müller (deeky666, not @ 'ing him) in the past, as most of this logic could be ideally moved to DBAL, but very likely it was lost on removal of old develop branch there.

Fedik · 2020-09-15T14:55:04Z

In my opinion this should be done on application level, and on a library level it is unneeded complication.
I often use similar approach for batch import ("chunk" insert / update), but in regular routine it does not make much sense.

mvorisek changed the title ~~Improve insert performance with batched insert~~ Improve insert performance with batched insert queries Sep 4, 2020

mvorisek force-pushed the grouped_insert branch 2 times, most recently from 6370a74 to 237bfdd Compare September 4, 2020 18:16

mvorisek marked this pull request as ready for review September 4, 2020 18:18

greg0ire added the Improvement label Sep 12, 2020

mvorisek changed the base branch from 2.7 to 2.8.x September 12, 2020 09:32

greg0ire reviewed Sep 12, 2020

View reviewed changes

greg0ire requested review from ostrolucky and SenseException September 12, 2020 09:39

mvorisek force-pushed the grouped_insert branch 2 times, most recently from 77a0f3f to 3750a0b Compare September 12, 2020 13:51

ostrolucky requested changes Sep 12, 2020

View reviewed changes

SenseException requested changes Sep 14, 2020

View reviewed changes

guilhermeblanco reviewed Sep 15, 2020

View reviewed changes

mvorisek added 2 commits November 4, 2020 23:41

Improve insert performance with batched/multi insert

beed09e

Move can group check/chunk size into an extra method

d25172c

mvorisek force-pushed the grouped_insert branch from 3750a0b to d25172c Compare November 4, 2020 22:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve insert performance with batched insert queries #8260

Improve insert performance with batched insert queries #8260

mvorisek commented Sep 4, 2020 •

edited

Loading

greg0ire commented Sep 12, 2020

greg0ire left a comment

greg0ire Sep 12, 2020

mvorisek Sep 12, 2020

greg0ire Sep 12, 2020

greg0ire Sep 12, 2020

mvorisek Sep 12, 2020 •

edited

Loading

SenseException Sep 14, 2020

guilhermeblanco Sep 15, 2020

greg0ire Sep 12, 2020

mvorisek Sep 12, 2020

greg0ire Sep 12, 2020

SenseException Sep 14, 2020 •

edited

Loading

guilhermeblanco Sep 15, 2020 •

edited

Loading

ostrolucky left a comment

beberlei commented Sep 12, 2020

beberlei commented Sep 12, 2020

lcobucci commented Sep 13, 2020

SenseException left a comment

mvorisek commented Sep 15, 2020

ostrolucky commented Sep 15, 2020

guilhermeblanco Sep 15, 2020

guilhermeblanco commented Sep 15, 2020

Fedik commented Sep 15, 2020

Improve insert performance with batched insert queries #8260

Are you sure you want to change the base?

Improve insert performance with batched insert queries #8260

Conversation

mvorisek commented Sep 4, 2020 • edited Loading

greg0ire commented Sep 12, 2020

greg0ire left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mvorisek Sep 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SenseException Sep 14, 2020 • edited Loading

Choose a reason for hiding this comment

guilhermeblanco Sep 15, 2020 • edited Loading

Choose a reason for hiding this comment

ostrolucky left a comment

Choose a reason for hiding this comment

beberlei commented Sep 12, 2020

beberlei commented Sep 12, 2020

lcobucci commented Sep 13, 2020

SenseException left a comment

Choose a reason for hiding this comment

mvorisek commented Sep 15, 2020

ostrolucky commented Sep 15, 2020

Choose a reason for hiding this comment

guilhermeblanco commented Sep 15, 2020

Fedik commented Sep 15, 2020

mvorisek commented Sep 4, 2020 •

edited

Loading

mvorisek Sep 12, 2020 •

edited

Loading

SenseException Sep 14, 2020 •

edited

Loading

guilhermeblanco Sep 15, 2020 •

edited

Loading