As the number of objects in a bucket increase, the publish command slows to a crawl #103

palewire · 2017-04-22T00:13:04Z

Here's the log from a recent build of an @latimes bucket. You can see it took nearly 7 minutes to retrieve the object list for comparison. I wonder if we can figure out a way to speed this up, or at least to allow users with large buckets to --force-publish without having to pull down their full object list.

DEBUG|21/Apr/2017 16:58:04|publish|Retrieving objects now published in bucket
DEBUG|21/Apr/2017 17:04:55|publish|Retrieving files built locally

cc @sheats @anthonyjpesce

The text was updated successfully, but these errors were encountered:

… for #103

…lete are True. For #103.

palewire · 2017-04-22T14:57:51Z

This is not a fix to the core problem, but I've devised a workaround.

A review of the publish command's mechanics shows there are three reasons why we need to pull the complete list of the target bucket's keys from Amazon.

CREATE: Files in the local build directory not in the bucket's key list are considered to be new and pushed to the bucket.
UPDATE: The key list includes the md5 hash summarizing the content of each key. That is used to compare the published file with the same key in local build directory. If the local key's hash is different, it is pushed to Amazon to update the published one.
DELETE: To identify which keys in the published bucket do not exist in the latest build directory. Those keys are then deleted from the published bucket.

Overall, this system is intended to prevent pushing files that have not been changed -- but also to identify which keys present in the bucket are not longer present in the build directory.

Once the bucket is filled with certain amount of keys -- particularly if they are small files that can be quickly uploaded -- it is faster to upload all of the local files than it is to download the lengthy key list and perform the comparisons. The only problem being that you still need the key list to identify deletions.

So, I tweaked the publish command so that if the user provides both the --force option and the --no-delete option, it triggers a new "blind upload" mode in which the key list is never retrieved from Amazon S3. All files are uploaded, as was already done with our --force option. And no files are deleted, as was already the case with our --no-delete option.

The result is obviously due to some balance in the number and size of my bucket's keys and YMMV, but it resulted in hugely trimming the publish time for the @latimes bucket that led me to file this ticket.

I don't see much downside to including this edge case handling to our next release. Nobody will be forced to use it. We can add a little documentation to the "common challenges" section of the docs. The main thing I think I'd want a new user to know is that uploading all you keys all the time might end up costing you a little more money since Amazon charges for that.

palewire · 2017-04-26T22:15:12Z

This is not a solution, but I've developed a second workaround option.

The publish command now accepts an --aws-bucket-prefix input, which, when provided, will only request and sync your local folders with keys that start with a prefix.

This is possible because the list_objects method in boto has a Prefix keyword argument. When provided, only keys with that prefix are pulled from s3.

In cases where you only need to sync a certain, pre-defined segment of your bucket, this offers a strategy for avoiding a costly API hit to download all s3 keys.

palewire · 2018-12-05T03:46:17Z

I do not have any more ideas about to solve this problem. If you do, please post them or submit a patch. The comment just above this is the best strategy I've devised for myself.

palewire added a commit that referenced this issue Apr 22, 2017

Added logger in s3 get_objects method so we can study it more closely…

c1e8694

… for #103

palewire added a commit that referenced this issue Apr 22, 2017

Added an blind_upload attribute for cases for force_publish and no_de…

01116a4

…lete are True. For #103.

palewire closed this as completed Dec 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

As the number of objects in a bucket increase, the publish command slows to a crawl #103

As the number of objects in a bucket increase, the publish command slows to a crawl #103

palewire commented Apr 22, 2017 •

edited

palewire commented Apr 22, 2017 •

edited

palewire commented Apr 26, 2017

palewire commented Dec 5, 2018

As the number of objects in a bucket increase, the publish command slows to a crawl #103

As the number of objects in a bucket increase, the publish command slows to a crawl #103

Comments

palewire commented Apr 22, 2017 • edited

palewire commented Apr 22, 2017 • edited

palewire commented Apr 26, 2017

palewire commented Dec 5, 2018

palewire commented Apr 22, 2017 •

edited

palewire commented Apr 22, 2017 •

edited