Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize DBAFS file sync #672

Open
wants to merge 2 commits into
base: master
from

Conversation

@m-vo
Copy link
Contributor

commented Aug 25, 2019

The DBAFS file sync gets slow with lots of files. Here is an idea on how to improve it without changing the sync procedure (spoiler alert: only works for InnoDB tables).

Background

I profiled the sync routine for a bit: Nearly all of the time is needed for actual inserting into or updating the database (e.g. https://github.com/contao/contao/blob/master/core-bundle/src/Resources/contao/library/Contao/Dbafs.php#L576-L588). Using models vs. native queries doesn't make a big difference. Lots of multiple single inserts just aren't that effective in MySQL. 馃槃 If we wrap the whole thing in a transaction, things can be handled way more effectively (think of allocation or index manipulation for instance). Sidenote: It doesn't do anything better when leaving the table locking in place (not exactly sure why), but that shouldn't be needed anymore in this case.

This only works for InnoDB tables of course - for MyISAM we need to fallback to table locking. There probably is a nicer way to find out which engine is beeing used or if transactions are available.

Comparison

Here are some measurements (synchronizing 1600 images, around 970MB) on my local machine. I triggered the synchronization via the console and simply measured the time that Dbafs::syncFiles(); took to complete.

Original:

Time spent: 16.840146064758s | Memory peak usage: 21.987663269043MiB
Synchronization complete (see system/tmp/d2d28f47cb00fbddd47a8bcab78b345c).

With this PR:

Time spent: 4.7125709056854s | Memory peak usage: 21.98282623291MiB
Synchronization complete (see system/tmp/34f891232625a5d3eea230725ddb60c1).

In my test this reduced the time down to arround 28% the original duration.

@leofeyer leofeyer added the feature label Aug 26, 2019

@leofeyer leofeyer added this to the 4.9 milestone Aug 26, 2019

$objDatabase->beginTransaction();
} else {
$objDatabase->lockTables(array('tl_files'=>'WRITE'));
}

This comment has been minimized.

Copy link
@ausi

ausi Sep 3, 2019

Member

Couldn鈥檛 we use both, table locking and transactions at the same time?

$objDatabase->lockTables(array('tl_files'=>'WRITE'));
$objDatabase->beginTransaction();
// ...
$objDatabase->commitTransaction();
$objDatabase->unlockTables();

This way we wouldn not need to detect if transactions are supported at all and I think the performance gains should be similar?

This comment has been minimized.

Copy link
@m-vo

m-vo Sep 3, 2019

Author Contributor

Nope, I tried that as well. Unfortunately as soon as table locking is in place the whole performance benefit goes away again. (See the sidenote my initial description.)

This comment has been minimized.

Copy link
@m-vo

m-vo Sep 14, 2019

Author Contributor

@ausi @leofeyer

I just tried this again in my new implementation (locking together with a transaction) and there it didn't make a huge difference. I'm not sure why this wasn't the case in my last tests here. I'm going to change this here - maybe anyone of you could run a test as well. I don't want run into a problem with database config intricacies 馃槃.

@m-vo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 8, 2019

@fritzmg Ok you got me.

I started working on a new implementation of the DBAFS (as a service, with tests...). In my latest tests it seems we can reduce the time even more so that the md5 hashing becomes the bottleneck (with around ~0.4GB/s).

Version t [s] %
Current 16.84 100
This PR 4.71 28
New 3.34 20

btw:

  • Do we need the found flag in tl_files for something else than the way the current implementation works?
  • Is there a reason the current implementation forces me to execute the sync twice in order to update the folder hashes?
@ausi

This comment has been minimized.

Copy link
Member

commented Sep 9, 2019

Is there a reason the current implementation forces me to execute the sync twice in order to update the folder hashes?

The folder hashes have to be generated from the bottom up after the file hashes are calculated. We changed that in contao/core#8856 to fix a performance issue. Maybe this is related? Can you point me to the code where the sync currently gets run twice?

@m-vo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 9, 2019

I'll have to dive in the old code to find out. Currently I'm running my implementation against the old one to see if they differ.

It's easy to reproduce, though: Add some folders/files, sync everything (BE), Then move a folder to another place and sync again 鈫 the file hashes get updated, but the folder hashes only if you sync again.


btw., just found out: with parallel execution in place (amphp/parallel) the hash generation could even be four times as fast 馃槑...

@leofeyer

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

I started working on a new implementation of the DBAFS (as a service, with tests...).

I was actually going to merge this PR. Are you sure a re-implementation is worth the effort?

@m-vo

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2019

Yeah, I think it's fine to merge this one as is only a small change. Can we improve checking for the engine smh?

I think having tests, better console output, getting rid of some legacy things and another good performance gain is still a good thing. I'll make a PR with a draft in some days - we can discuss the broader changes there. Wdyt?

@leofeyer

This comment has been minimized.

Copy link
Member

commented Sep 11, 2019

I was able to decrease the synchronization time on contao.org from 49s to 8s (鈮-84%). That's already huge! 馃挭

@m-vo m-vo force-pushed the m-vo:feature/optimize-filesync branch from cad8c23 to c3bfa09 Sep 14, 2019

@m-vo m-vo referenced this pull request Sep 15, 2019
0 of 13 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can鈥檛 perform that action at this time.