Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up publish when no media files were altered #56

Closed
frankenjoe opened this issue Apr 28, 2021 · 6 comments · Fixed by #216
Closed

Speed up publish when no media files were altered #56

frankenjoe opened this issue Apr 28, 2021 · 6 comments · Fixed by #216
Labels
publish question Further information is requested

Comments

@frankenjoe
Copy link
Collaborator

Currently publish() always checks if media was changed. To do that the checksum of all media files in the database has to be calculated. This can take quite some while on large databases. However, most of the time only the metadata is changed and maybe new media is added. That existing media changes is a rather rare case. So I wonder if we should give the user the option to skip the test for altered media.

@hagenw
Copy link
Member

hagenw commented Apr 28, 2021

So I wonder if we should give the user the option to skip the test for altered media.

I think that would be an easy and good solution. In the long run it could also help that you don't have to download all media files with audb.load_to() if you just want to fix the header or tables.

@hagenw
Copy link
Member

hagenw commented Apr 28, 2021

But of course it's not exactly the same, as you might not alter existing media files, but add new ones.

@hagenw hagenw added the publish label May 3, 2021
@hagenw
Copy link
Member

hagenw commented Jun 8, 2021

The "Find media" part seems indeed to be the slowest part of publishing a database. We cannot easily avoid this with new data (besides maybe providing the opportunity to provide pre-calculated values?).

But for updating large databases we should definitely provide an option to skip it.

@hagenw
Copy link
Member

hagenw commented May 9, 2022

Speed of checking existing media has increased, but it might still be a problem when you have a large number of files. On the other hand when adding the argument to skip checking, we introduce a possible source of error during publication.

@hagenw hagenw added the question Further information is requested label May 9, 2022
@frankenjoe
Copy link
Collaborator Author

The worst use-case is if you neither upload or alter media, but only change the metadata.

On the other hand when adding the argument to skip checking, we introduce a possible source of error during publication.

I would say we could take that risk given the extreme speed up we would gain.

@frankenjoe
Copy link
Collaborator Author

#216 now implements a solution without adding a new argument. Media files that are referenced in the tables and are part of the previous version, must no longer exist in the build folder since for those files, we can safely assume they remain unchanged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
publish question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants