New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As the number of objects in a bucket increase, the publish command slows to a crawl #103
Comments
This is not a fix to the core problem, but I've devised a workaround. A review of the publish command's mechanics shows there are three reasons why we need to pull the complete list of the target bucket's keys from Amazon.
Overall, this system is intended to prevent pushing files that have not been changed -- but also to identify which keys present in the bucket are not longer present in the build directory. Once the bucket is filled with certain amount of keys -- particularly if they are small files that can be quickly uploaded -- it is faster to upload all of the local files than it is to download the lengthy key list and perform the comparisons. The only problem being that you still need the key list to identify deletions. So, I tweaked the publish command so that if the user provides both the The result is obviously due to some balance in the number and size of my bucket's keys and YMMV, but it resulted in hugely trimming the publish time for the @latimes bucket that led me to file this ticket. I don't see much downside to including this edge case handling to our next release. Nobody will be forced to use it. We can add a little documentation to the "common challenges" section of the docs. The main thing I think I'd want a new user to know is that uploading all you keys all the time might end up costing you a little more money since Amazon charges for that. |
This is not a solution, but I've developed a second workaround option. The publish command now accepts an This is possible because the In cases where you only need to sync a certain, pre-defined segment of your bucket, this offers a strategy for avoiding a costly API hit to download all s3 keys. |
I do not have any more ideas about to solve this problem. If you do, please post them or submit a patch. The comment just above this is the best strategy I've devised for myself. |
Here's the log from a recent build of an @latimes bucket. You can see it took nearly 7 minutes to retrieve the object list for comparison. I wonder if we can figure out a way to speed this up, or at least to allow users with large buckets to
--force-publish
without having to pull down their full object list.cc @sheats @anthonyjpesce
The text was updated successfully, but these errors were encountered: