-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
I want to sync just a few hundred specific files from my bucket to my local machine. The bucket contains about 500 000 files.
Issue1: The sync process needs a lot of time, because aws cli checks each file, in bucket and local, for changes
Solution1: Using 'aws s3api list-objects' and jmespath to query all files which have changed since the last 14 days.
Issue2: 'aws s3api list-objects' returns a list of file paths. How could I sync each file path?
Solution2: Add each file path as '--include=<FILE_PATH>' to the 'aws s3 sync' command
Issue3: With a few hundred '--include' arguments the sync command takes a lot of time, because for each include, sync is iterating over all files checking whether one file matches the --include pattern.
Solution3: ???
There are two feature requests I want to address here. Fortunately for one of them there is already a ticket for a few days: #5160
The other feature request would be a command line argument for 'aws s3 sync' that allows just passing paths instead of patterns. It's needed to iterate over all files if you pass a pattern, but if you are sure that you just pass paths, it's not. This would create a huge speed up for sync processes of large buckets with many files, if you just want to sync a few files.
A alternative would be calling the cp command for each file path, appending the file path to the bucket url. But there are two issues using this way:
- It is slow as hell, because the whole sync command is executed x times, which includes the connection creation to aws servers.
- cp does not check whether the file has to be downloaded because it was changed, it just downloads the file, which makes the script also slower