Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve --single-directory sync performance #992

Merged
merged 2 commits into from
Jul 21, 2020
Merged

Improve --single-directory sync performance #992

merged 2 commits into from
Jul 21, 2020

Conversation

kapec94
Copy link
Contributor

@kapec94 kapec94 commented Jul 14, 2020

Hello @abraunegg,
I've prepared a PR that immensely improves performance of --single-directory and sync_list synchronizations, especially those that require iterating through full OneDrive tree.

Disclaimer

Please note that this is simply a change proposal that may or may not be acceptable yet, but I'm willing to cooperate to make it better.

What does this change include?

This PR moves oneDriveMovedNotDeleted check part from sync.d:1597 from the beginning of the code block just before idsToDelete modification in sync.d:1643. What it does is that onedrive.getPathDetailsById invocation is only made if it can really make an impact whether a change item is to be discarded or deleted in synced directory.

Why is the change made?

Such change is important, because onedrive.getPathDetailsById invocation there is responsible for almost 90% of the time spent inside that function if there are a lot of discarded change items (I've done some profiling with valgrind). Example: on my 200 GB OneDrive: syncing --single-directory of a folder with one file took almost 12 hours before I stopped it out of pity (and it didn't finish!) (all the time was spent discarding changes).

How&why does this change works?

Basically said code block makes three mostly independent checks:

  1. Whether the changed item still exists on OneDrive
  2. Whether the changed item is already in local database
  3. Whether the changed item's local path matches the synced path.

Item is only deleted from synced folder if all of those checks are true. In every other case the change is discarded. So no matter the result of check 1., change will be discarded if check 2. and check 3. are false. The solution made is to make check 1. (which is extremely expensive) execute only after checks 2. and 3. pass.

That way changes that would be discarded because of check 2. or 3. are processed much faster. 10x faster.

What is the benefit?

  • faster synchronization code, especially for big drives, small sync_lists and resyncs,
  • less network traffic.

Is the change tested?

Well... Didn't find any automated tests around, so just made several smoke tests, like checking if the code works and it works the same for basic --single-directory operations like move, delete, out-of-folder delete. It does work and it does work the same. But I didn't manage to trigger oneDriveMovedNotDeleted check code path in any case.

Notes

I've just made a PR, instead of first reporting a ticket, because I needed to have this made change now for my private project needs.

I would also like to discuss with you several findings around the code I've modified:

  • I'm not sure how returns in oneDriveMovedNotDeleted finally blocks affect the code (for example sync.d:1605). If such return statement is executed, then applyDifferences function is interrupted and changes are not further processed. Are those returns there for any particular purpose?
  • I have some doubts about canFind(originalLocalPath, syncFolderChildPath) call in sync.d:1639. originalLocalPath evaluates to the file path that is not prefixed by /drive/root, as syncFolderChildPath variable is, so that canFind call always seem to return false. Unless I'm missing something?

I would really appreciate getting in touch with you to try making that change. I feel like it has a potential to make your tool's experience even greater.

This commit modifies code inside applyDifferences function,
the part responsible for deleting changes/moving them out
of the selected sync directories. This change makes an HTTP
request responsible for checking whether a changed item still
exists on OneDrive only be sent only if it may in fact influence
whether an item will be deleted from the synced folder or not
or just discarded.

This makes an enormous performance boost, because it limits
redundant HTTP requests that ask about changed items that will
be discarded or not.

Signed-off-by: Mateusz P-K <mateusz.kaplon@gmail.com>
@abraunegg
Copy link
Owner

@kapec94
Many thanks for the detailed analysis and work you have done here. It is GREATLY aprpeciated.

This PR moves oneDriveMovedNotDeleted check part ...

The premise behind this was when files / folders were moved either into scope of sync_list or now out of scope. With this code change (will test shortly) to see if this is still working correctly for moving files into / out of scope and correctly deleting / adding locally, but always remaining on OneDrive. Those files should only be deleted on OneDrive if deleted locally, when in sync scope.

I'm not sure how returns in oneDriveMovedNotDeleted finally blocks affect the code ...

I think you are right here. Those returns should not be there, and if executed, if there were any other changes to process, those returns would prevent them from being processed.

I have some doubts about canFind(originalLocalPath, syncFolderChildPath) call in ....

This call should be performing a 'search match' to see if 'originalLocalPath' is anywhere in 'syncFolderChildPath'. Depending on account types (business / personal) and folder types (local in account OneDrive, remote in another users account). This only ever gets set if the item object from OneDrive contains a 'parentReference' - which at this stage only ever is set when a shared folder is added to the users account via the 'Add to my OneDrive' web function (caveat: In testing business accounts / Sharepoint - this is also set) .. so yes ... 99% of the time this should return false, unless the item is from a shared folder (or one of the business accounts that add the data in - MS Graph API is at times very inconsistent).

Well... Didn't find any automated tests around ...

Yes ... I know. I have some basic testing via the Travis CI process (which is also why this PR is showing up as failed because you do not have access to the private key) - see #711 for further details.

Basically, it would be ideal to have a script or process that can generate data / config & test each of the options to ensure that nothing is broken. This way anyone could run those tests either with their own account or they create a test account. The challenge here would be 'shared folder' testing, which right now I have to do all manually with a few 'test' accounts that I have created.

@abraunegg abraunegg added this to the v2.4.4 milestone Jul 21, 2020
@abraunegg abraunegg self-requested a review July 21, 2020 04:45
abraunegg
abraunegg previously approved these changes Jul 21, 2020
* Remove erroneous 'return' statement which could prematurely end processing all changes returned from OneDrive
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators May 27, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants