Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As the number of objects in a bucket increase, the publish command slows to a crawl #103

Closed
palewire opened this issue Apr 22, 2017 · 3 comments

Comments

@palewire
Copy link
Owner

palewire commented Apr 22, 2017

Here's the log from a recent build of an @latimes bucket. You can see it took nearly 7 minutes to retrieve the object list for comparison. I wonder if we can figure out a way to speed this up, or at least to allow users with large buckets to --force-publish without having to pull down their full object list.

DEBUG|21/Apr/2017 16:58:04|publish|Retrieving objects now published in bucket
DEBUG|21/Apr/2017 17:04:55|publish|Retrieving files built locally

cc @sheats @anthonyjpesce

@palewire
Copy link
Owner Author

palewire commented Apr 22, 2017

This is not a fix to the core problem, but I've devised a workaround.

A review of the publish command's mechanics shows there are three reasons why we need to pull the complete list of the target bucket's keys from Amazon.

  1. CREATE: Files in the local build directory not in the bucket's key list are considered to be new and pushed to the bucket.
  2. UPDATE: The key list includes the md5 hash summarizing the content of each key. That is used to compare the published file with the same key in local build directory. If the local key's hash is different, it is pushed to Amazon to update the published one.
  3. DELETE: To identify which keys in the published bucket do not exist in the latest build directory. Those keys are then deleted from the published bucket.

Overall, this system is intended to prevent pushing files that have not been changed -- but also to identify which keys present in the bucket are not longer present in the build directory.

Once the bucket is filled with certain amount of keys -- particularly if they are small files that can be quickly uploaded -- it is faster to upload all of the local files than it is to download the lengthy key list and perform the comparisons. The only problem being that you still need the key list to identify deletions.

So, I tweaked the publish command so that if the user provides both the --force option and the --no-delete option, it triggers a new "blind upload" mode in which the key list is never retrieved from Amazon S3. All files are uploaded, as was already done with our --force option. And no files are deleted, as was already the case with our --no-delete option.

The result is obviously due to some balance in the number and size of my bucket's keys and YMMV, but it resulted in hugely trimming the publish time for the @latimes bucket that led me to file this ticket.

I don't see much downside to including this edge case handling to our next release. Nobody will be forced to use it. We can add a little documentation to the "common challenges" section of the docs. The main thing I think I'd want a new user to know is that uploading all you keys all the time might end up costing you a little more money since Amazon charges for that.

@palewire
Copy link
Owner Author

This is not a solution, but I've developed a second workaround option.

The publish command now accepts an --aws-bucket-prefix input, which, when provided, will only request and sync your local folders with keys that start with a prefix.

This is possible because the list_objects method in boto has a Prefix keyword argument. When provided, only keys with that prefix are pulled from s3.

In cases where you only need to sync a certain, pre-defined segment of your bucket, this offers a strategy for avoiding a costly API hit to download all s3 keys.

@palewire
Copy link
Owner Author

palewire commented Dec 5, 2018

I do not have any more ideas about to solve this problem. If you do, please post them or submit a patch. The comment just above this is the best strategy I've devised for myself.

@palewire palewire closed this as completed Dec 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant