Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Optimize DBAFS file sync #672
The DBAFS file sync gets slow with lots of files. Here is an idea on how to improve it without changing the sync procedure (spoiler alert: only works for InnoDB tables).
I profiled the sync routine for a bit: Nearly all of the time is needed for actual inserting into or updating the database (e.g. https://github.com/contao/contao/blob/master/core-bundle/src/Resources/contao/library/Contao/Dbafs.php#L576-L588). Using models vs. native queries doesn't make a big difference. Lots of multiple single inserts just aren't that effective in MySQL.
This only works for InnoDB tables of course - for MyISAM we need to fallback to table locking. There probably is a nicer way to find out which engine is beeing used or if transactions are available.
Here are some measurements (synchronizing 1600 images, around 970MB) on my local machine. I triggered the synchronization via the console and simply measured the time that
With this PR:
In my test this reduced the time down to arround 28% the original duration.
@fritzmg Ok you got me.
I started working on a new implementation of the DBAFS (as a service, with tests...). In my latest tests it seems we can reduce the time even more so that the
The folder hashes have to be generated from the bottom up after the file hashes are calculated. We changed that in contao/core#8856 to fix a performance issue. Maybe this is related? Can you point me to the code where the sync currently gets run twice?
I'll have to dive in the old code to find out. Currently I'm running my implementation against the old one to see if they differ.
It's easy to reproduce, though: Add some folders/files, sync everything (BE), Then move a folder to another place and sync again → the file hashes get updated, but the folder hashes only if you sync again.
btw., just found out: with parallel execution in place (amphp/parallel) the hash generation could even be four times as fast
Yeah, I think it's fine to merge this one as is only a small change. Can we improve checking for the engine smh?
I think having tests, better console output, getting rid of some legacy things and another good performance gain is still a good thing. I'll make a PR with a draft in some days - we can discuss the broader changes there. Wdyt?