Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize DBAFS file sync #672

Merged
merged 2 commits into from
Oct 7, 2019
Merged

Conversation

m-vo
Copy link
Member

@m-vo m-vo commented Aug 25, 2019

The DBAFS file sync gets slow with lots of files. Here is an idea on how to improve it without changing the sync procedure (spoiler alert: only works for InnoDB tables).

Background

I profiled the sync routine for a bit: Nearly all of the time is needed for actual inserting into or updating the database (e.g. https://github.com/contao/contao/blob/master/core-bundle/src/Resources/contao/library/Contao/Dbafs.php#L576-L588). Using models vs. native queries doesn't make a big difference. Lots of multiple single inserts just aren't that effective in MySQL. 😄 If we wrap the whole thing in a transaction, things can be handled way more effectively (think of allocation or index manipulation for instance). Sidenote: It doesn't do anything better when leaving the table locking in place (not exactly sure why), but that shouldn't be needed anymore in this case.

This only works for InnoDB tables of course - for MyISAM we need to fallback to table locking. There probably is a nicer way to find out which engine is beeing used or if transactions are available.

Comparison

Here are some measurements (synchronizing 1600 images, around 970MB) on my local machine. I triggered the synchronization via the console and simply measured the time that Dbafs::syncFiles(); took to complete.

Original:

Time spent: 16.840146064758s | Memory peak usage: 21.987663269043MiB
Synchronization complete (see system/tmp/d2d28f47cb00fbddd47a8bcab78b345c).

With this PR:

Time spent: 4.7125709056854s | Memory peak usage: 21.98282623291MiB
Synchronization complete (see system/tmp/34f891232625a5d3eea230725ddb60c1).

In my test this reduced the time down to arround 28% the original duration.

@leofeyer leofeyer added this to the 4.9 milestone Aug 26, 2019
@m-vo
Copy link
Member Author

m-vo commented Sep 8, 2019

@fritzmg Ok you got me.

I started working on a new implementation of the DBAFS (as a service, with tests...). In my latest tests it seems we can reduce the time even more so that the md5 hashing becomes the bottleneck (with around ~0.4GB/s).

Version t [s] %
Current 16.84 100
This PR 4.71 28
New 3.34 20

btw:

  • Do we need the found flag in tl_files for something else than the way the current implementation works?
  • Is there a reason the current implementation forces me to execute the sync twice in order to update the folder hashes?

@ausi
Copy link
Member

ausi commented Sep 9, 2019

Is there a reason the current implementation forces me to execute the sync twice in order to update the folder hashes?

The folder hashes have to be generated from the bottom up after the file hashes are calculated. We changed that in contao/core#8856 to fix a performance issue. Maybe this is related? Can you point me to the code where the sync currently gets run twice?

@m-vo
Copy link
Member Author

m-vo commented Sep 9, 2019

I'll have to dive in the old code to find out. Currently I'm running my implementation against the old one to see if they differ.

It's easy to reproduce, though: Add some folders/files, sync everything (BE), Then move a folder to another place and sync again → the file hashes get updated, but the folder hashes only if you sync again.


btw., just found out: with parallel execution in place (amphp/parallel) the hash generation could even be four times as fast 😎...

@leofeyer
Copy link
Member

I started working on a new implementation of the DBAFS (as a service, with tests...).

I was actually going to merge this PR. Are you sure a re-implementation is worth the effort?

@m-vo
Copy link
Member Author

m-vo commented Sep 10, 2019

Yeah, I think it's fine to merge this one as is only a small change. Can we improve checking for the engine smh?

I think having tests, better console output, getting rid of some legacy things and another good performance gain is still a good thing. I'll make a PR with a draft in some days - we can discuss the broader changes there. Wdyt?

@leofeyer
Copy link
Member

I was able to decrease the synchronization time on contao.org from 49s to 8s (≈-84%). That's already huge! 💪

@m-vo m-vo mentioned this pull request Sep 15, 2019
13 tasks
@leofeyer
Copy link
Member

leofeyer commented Oct 7, 2019

Thank you @m-vo.

@m-vo m-vo deleted the feature/optimize-filesync branch January 2, 2021 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants