Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve insert performance with batched insert queries #8260

Open
wants to merge 2 commits into
base: 2.8.x
Choose a base branch
from

Conversation

mvorisek
Copy link
Contributor

@mvorisek mvorisek commented Sep 4, 2020

Non-versioned not-post incremented inserts can be grouped together and performed with much less query count.

Based on my measurements batched inserts are about 50% faster with local DB and several times faster with DB on another machine due less total RTTs (round trip times) needed. This change largely increases real-world performace with larger imports.

The provided implementation is fully back compatible.

@mvorisek mvorisek changed the title Improve insert performance with batched insert Improve insert performance with batched insert queries Sep 4, 2020
@mvorisek mvorisek force-pushed the grouped_insert branch 2 times, most recently from 6370a74 to 237bfdd Compare September 4, 2020 18:16
@mvorisek mvorisek marked this pull request as ready for review September 4, 2020 18:18
@greg0ire
Copy link
Member

I don't think you should target 2.7, this does not qualify as a bugfix.

@mvorisek mvorisek changed the base branch from 2.7 to 2.8.x September 12, 2020 09:32
Copy link
Member

@greg0ire greg0ire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there be extra tests, and documentation?

@@ -273,38 +271,41 @@ public function executeInserts()
$stmt = $this->conn->prepare($this->getInsertSQL());
$tableName = $this->class->getTableName();

foreach ($this->queuedInserts as $entity) {
$insertData = $this->prepareInsertData($entity);
foreach (array_chunk($this->queuedInserts, $this->getMaxBatchedInsertQueries()) as $queuedInsertsChunk) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as you do that, you now have the contents of $this->queuedInserts twice in memory. Wouldn't it be better to extract the first $this->getMaxBatchedInsertQueries() elements out of that array with array_slice() until it is empty? I see it will be emptied anyway after the loop, on line 308.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, as values in php are not duplicated if not modified (aka copy on modify)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look ok to me: https://3v4l.org/1LYhg
array_chunk does not create a simple copy, it creates another array entirely

{
if ($this->insertSql !== null) {
return $this->insertSql;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now if we call this method several times in a row and there is no batching, we will re-compute the insertSQL every time?

Copy link
Contributor Author

@mvorisek mvorisek Sep 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, I was not able to measure any significant performance difference. Also, other queries like UPDATE and DELETE are rebuild from scratch every time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful if you can provide the method how you run the performance tests to see any improvements. If this PR gets merged, we will need a way to prevent regressions in the future that will drop any performance improvements of this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree that full insert SQL cache is not possible, however, we should be able to store columns and placeholders to prevent us from recomputing every time. I recall a 2s slowdown in the whole testing suite by dropping out this cache.
Actually, now that I think about it... we should still store the full insert SQL referring to a single entry. Also, we should store the column count, so we can easily build new sets of placeholders.

Now for update and delete, the reason why they are not cached is because we're not doing full updates. If we were to be more inline with EclipseLink, Hibernate and EntityFramework, we would do full updates (where you update all columns, excluding LOB fields) and also cache this computation. For delete, id based removals would be ok too...
But this is another micro-optimization that could be done in a separate PR.

return 1;
}

return 50;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number! Why 50? Shouldn't it be configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should. Please advise how.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a constructor argument

Copy link
Member

@SenseException SenseException Sep 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here. Any fixed number may perform differently depending on the original amount of entities that need to be processed. Making it configurable e.g. in form of a constructor argument would be an advantage.

Copy link
Member

@guilhermeblanco guilhermeblanco Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information should be set by a new annotation @BatchSize(size = 100) that is configured at the entity level.

Copy link
Member

@ostrolucky ostrolucky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of doing this logic in ORM directly. Batch insert should be contributed to DBAL, only then used by ORM. We don't even know if this syntax is supported by all DBMSes.

@beberlei
Copy link
Member

This needs to target 2.8, and the batch size should be globally configurable via Configuration class

@beberlei
Copy link
Member

@guilhermeblanco does this work with goals for Next?

@lcobucci
Copy link
Member

I agree with @ostrolucky here. There was an initial work on DBAL. We should look into solving things there IMHO.

Copy link
Member

@SenseException SenseException left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a big gain of performance by this PR, I would like to see how it was calculated/measured and if this measurement can be used by CI to avoid performance regressions.

@mvorisek
Copy link
Contributor Author

If there is a big gain of performance by this PR, I would like to see how it was calculated/measured and if this measurement can be used by CI to avoid performance regressions.

I can add a test counting executed queries.

INSERT query takes typically <100 us (ie. database that can insert 10k rows in single connection/thread)

when RTT is like 500 us (ie. typical RTT when DB is not on the same machine), then there is 5x overhead per query

@SenseException is this test what you want?

Not a fan of doing this logic in ORM directly. Batch insert should be contributed to DBAL, only then used by ORM. We don't even know if this syntax is supported by all DBMSes.

@ostrolucky I can move the code of INSERT query build to DBAL. Can Doctrine/ORM/Persisters/Entity/BasicEntityPersister::getInsertSQL() method be dropped in 2.8.x then?

@ostrolucky
Copy link
Member

Can Doctrine/ORM/Persisters/Entity/BasicEntityPersister::getInsertSQL() method be dropped in 2.8.x then?

Logic for creating SQL query can be dropped from method, not drop whole method.

$isPostInsertId = $idGenerator->isPostInsertGenerator();

if ($isPostInsertId || $this->class->isVersioned) {
return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start naively here with 1, but we could be smarter depending of post generated identifier. If using auto incremented, certain RDBMS (such as MySQL and MariaDB, very likely PgSQL too) guarantee the ordering of incremental identifiers on a per transaction basis. This means in case of concurrent writes, the first transaction would be 1 - X as ID incrementally, and second transaction would generate X+1 - Y, also incrementally.

But as an initial implementation, this is more than enough.

@guilhermeblanco
Copy link
Member

@guilhermeblanco does this work with goals for Next?

Yes, this is one of the goals of ORM 3. The insert logic could be either defined linearly as we do today (using strings) or using SQLStatement value objects like Statement\Insert, Statement\Update and Statement\Delete.

I remember talking about this with Steve Müller (deeky666, not @ 'ing him) in the past, as most of this logic could be ideally moved to DBAL, but very likely it was lost on removal of old develop branch there.

@Fedik
Copy link
Contributor

Fedik commented Sep 15, 2020

In my opinion this should be done on application level, and on a library level it is unneeded complication.
I often use similar approach for batch import ("chunk" insert / update), but in regular routine it does not make much sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants