Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asw s3 sync repeatedly uploads the same, unchanged files #5216

Closed
2 tasks done
BLuFeNiX opened this issue May 20, 2020 · 18 comments
Closed
2 tasks done

asw s3 sync repeatedly uploads the same, unchanged files #5216

BLuFeNiX opened this issue May 20, 2020 · 18 comments
Labels
closed-for-staleness guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.

Comments

@BLuFeNiX
Copy link

Confirm by changing [ ] to [x] below to ensure that it's a bug:

Describe the bug
A certain folder in my S3-compatible storage is consistently detected as out of sync with my local source directory. ie: aws s3 sync /data/foo/ s3://my.bucket.tld/foo/ will always print dozens of lines like this, every time it is run.

upload: data/foo/bar/buzz/123.jpg to s3://my.bucket.tld/foo/bar/buzz/123.jpg

The files are identical, and I have confirmed so by hashing them locally, and then downloading a copy directly via aws s3 cp and hashing that too.

There is an interesting clue as to the nature of the bug: changing the path to be more specific will get rid of the behavior. For example, rather than running aws s3 sync /data/foo/ s3://my.bucket.tld/foo/, if you run aws s3 sync /data/foo/bar/buzz/ s3://my.bucket.tld/foo/bar/buzz/ it will show no changed files, and not upload anything. The directories are the same in both scenarios, but it seems that there is some sort of comparison against other objects in the bucket.

SDK version number
aws-cli/2.0.14 Python/3.7.3 Linux/4.19.107 botocore/2.0.0dev18

Platform/OS/Hardware/Device
Ubuntu 20.04 in a docker container

To Reproduce (observed behavior)
Unknown. It only happens with some files.

Expected behavior
Unchanged files will not be uploaded.

Logs/output
I do not wish to post full logs, but here is a relevant message:

2020-05-20 07:01:57,245 - MainThread - awscli.customizations.s3.syncstrategy.base - DEBUG - syncing: [REDACTED] -> [REDACTED], file does not exist at destination

It is the same message for every file, even though the file does exist in he destination bucket.

Additional context
It looks like this problem has existed since at least 2014, based on a similar report here:
https://forums.aws.amazon.com/thread.jspa?threadID=146851

I have also tried s3cmd, which does not exhibit this behavior.

@BLuFeNiX BLuFeNiX added the needs-triage This issue or PR still needs to be triaged. label May 20, 2020
@KaibaLopez
Copy link
Contributor

Hi @BLuFeNiX ,
Sorry for the late response, I have not been able to reproduce this problem, is there anything you can think of that might be special about your case? any non-default bucket settings for example?

@KaibaLopez KaibaLopez added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. and removed needs-triage This issue or PR still needs to be triaged. labels May 29, 2020
@KaibaLopez KaibaLopez self-assigned this May 29, 2020
@albinj
Copy link

albinj commented May 30, 2020

Same happens for me when I try to run s3 sync multiple times:
aws s3 sync "s3://static.domain.tld/" "folder"

It keeps downloading the same identical files. It only happens with files that have a special character in their filename. Such as ó æ á ð ø ú é í.

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label May 30, 2020
@agconti
Copy link

agconti commented Jun 9, 2020

I have this same problem but I don't have special characters in my file names. The commands I'm using are:

aws s3 sync $BUILD_DIR $S3_BUCKET --include "*.html" \
--metadata-directive REPLACE --expires 2034-01-01T00:00:00Z --acl public-read \
--cache-control no-cache

aws s3 sync $BUILD_DIR $S3_BUCKET --exclude "*.html"  \
--metadata-directive REPLACE --expires 2034-01-01T00:00:00Z --acl public-read \
--cache-control max-age=31536000,public

A file like icon.svg will be always be reuploaded, despite having not changed.

@agconti
Copy link

agconti commented Jun 10, 2020

I figured out what my problem was. sync decides what to sync based on the file's timestamp and size. My website build process was always updating the output files timestamps even if their contents did not change.

Adding the --size-only fixed my problem by ignoring timestamps and telling sync to update files only if their contents changed.

Hope this helps!

@BLuFeNiX
Copy link
Author

I figured out what my problem was. sync decides what to sync based on the file's timestamp and size. My website build process was always updating the output files timestamps even if their contents did not change.

Adding the --size-only fixed my problem by ignoring timestamps and telling sync to update files only if their contents changed.

Hope this helps!

This is not the case with my situation. Files are identical (immutable, in fact) and still upload every time.

@boomshadow
Copy link

I figured out what my problem was. sync decides what to sync based on the file's timestamp and size. My website build process was always updating the output files timestamps even if their contents did not change.

Adding the --size-only fixed my problem by ignoring timestamps and telling sync to update files only if their contents changed.

Hope this helps!

This was exactly my problem. Thank you so much!

@agconti
Copy link

agconti commented Aug 7, 2020

@boomshadow sure ting!

@kdaily
Copy link
Member

kdaily commented Sep 23, 2020

Hi @BLuFeNiX,

Without seeing more of the logs this is difficult to diagnose. Do you happen to be using a third-party S3-compatible API (BackBlaze, DreamHost, etc). The symptom you're encountering sounds a lot like this: #5456 (comment)

In that case, the S3-compatible API was using the ListObjects implementation instead of ListObjects v2.

This would explain why specifying a more specific prefix path (probably with less files, under 1000) would change the behavior.

@kdaily kdaily added guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. labels Sep 23, 2020
@github-actions
Copy link

Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Sep 30, 2020
@github-actions github-actions bot closed this as completed Oct 4, 2020
@seemantk-schweitzer
Copy link

the expected behavior of aws s3 sync would be similar to how rsync works on *nix systems, which checks the file hashes. can you please advise if that is planned or being considered?

@BLuFeNiX
Copy link
Author

BLuFeNiX commented Jul 3, 2021

Quick update, in case anyone else experiences similar problems. After much effort, I was able to prove that the original problem that I reported here was due to a bug in the Wasabi S3 service. I reported the bug with a reproducible test case, so the ball is in their court now.

@vikrambharadwaj
Copy link

Even i'm facing the same issue

@snerpton
Copy link

I was experiencing this issue. Running the following command:

aws s3 sync --profile myProfile --region myRegion c:\path\a-directory s3://my-s3-bucket/a-directory --delete

repeatedly uploaded some files.

I confirmed the re-uploaded files hadn't changed and also that the timestamp hadn't changed since the file was initially uploaded.

Adding the --size-only flag prevented these files from being reuploaded, but this shouldn't be required as the file's content and timestamp haven't changed.

However, it became apparent that the files that were repeatedly being uploaded had a modified date in year 2076. One might expect that the aws s3 sync command should respect that the modified date hasn't changed, but on the other hand as the file can't have been modified in the future it maybe considers it safest to reupload to be sure the destination has the latest file.

In anycase, I resolved the issue by setting the last modified file dates to something sensible (create date which was in the past).

AWS CLI info...
aws-cli/2.0.13 Python/3.7.7 Windows/10 botocore/2.0.0dev17

@anshul-pinto0410
Copy link

@agconti How will we account for when the file has changed but the file size has not changed? For example we change "hello" to "help" this will result in the file not being uploaded to s3? Is there a solution for this case?

@13Flo
Copy link

13Flo commented Feb 13, 2023

+1.. facing the same issue where existing files are being uploaded again for no reason.
This problem does not happen when uploading to a Google Cloud Bucket (gsutil -m cp -r -n)

@iilei
Copy link

iilei commented May 5, 2023

Encountered the same issue. In my case I could mitigate it like so:

# before mitigation: incorrect detection of untouched files:
aws s3 sync _some_directory_name_ ...

VS.:

# after mitigation: correct detection of untouched files:
aws s3 sync (Resolve-Path -Path '_some_directory_name_').Path  ...

In other words: An absolute path made it work as expected.

@khilnani
Copy link

khilnani commented Jun 28, 2023

Including some debug logs of the issues

The local file was not modified between runs and shows "2023-06-28 03:53:05-04:00" on the local file system before/after each run

Dry run shows the modified time is different, local file is newer than remote file in s3

2023-06-28 00:10:18,671 - MainThread - awscli.customizations.s3.syncstrategy.base - DEBUG - syncing: /ULES01372DLCVOICE/PARAM.SFO -> BUCKET_NAME/ULES01372DLCVOICE/PARAM.SFO, size: 4912 -> 4912, modified time: 2023-06-28 03:53:05-04:00 -> 2023-06-28 00:02:03-04:00

Running without dry run shows the file was uploaded

2023-06-28 00:19:42,458 - MainThread - awscli.customizations.s3.syncstrategy.base - DEBUG - syncing: /ULES01372DLCVOICE/PARAM.SFO -> BUCKET_NAME/ULES01372DLCVOICE/PARAM.SFO, size: 4912 -> 4912, modified time: 2023-06-28 03:53:05-04:00 -> 2023-06-28 00:02:03-04:00

2023-06-28 00:19:42,459 - ThreadPoolExecutor-1_2 - s3transfer.futures - DEBUG - Submitting task PutObjectTask(transfer_id=78, {'bucket': 'BUCKET_NAME', 'key': 'ULES01372DLCVOICE/PARAM.SFO', 'extra_args': {}}) to executor <s3transfer.futures.BoundedExecutor object at 0x11091bc50> for transfer request: 78.

2023-06-28 00:20:58,911 - ThreadPoolExecutor-0_3 - s3transfer.tasks - DEBUG - PutObjectTask(transfer_id=78, {'bucket': 'BUCKET_NAME', 'key': 'ULES01372DLCVOICE/PARAM.SFO', 'extra_args': {}}) about to wait for the following futures []

2023-06-28 00:20:58,911 - ThreadPoolExecutor-0_3 - s3transfer.tasks - DEBUG - PutObjectTask(transfer_id=78, {'bucket': 'BUCKET_NAME', 'key': 'ULES01372DLCVOICE/PARAM.SFO', 'extra_args': {}}) done waiting for dependent futures

2023-06-28 00:20:58,911 - ThreadPoolExecutor-0_3 - s3transfer.tasks - DEBUG - Executing task PutObjectTask(transfer_id=78, {'bucket': 'BUCKET_NAME', 'key': 'ULES01372DLCVOICE/PARAM.SFO', 'extra_args': {}}) with kwargs {'client': <botocore.client.S3 object at 0x110a4be10>, 'fileobj': <s3transfer.utils.ReadFileChunk object at 0x110c39050>, 'bucket': 'BUCKET_NAME', 'key': 'ULES01372DLCVOICE/PARAM.SFO', 'extra_args': {}}

2023-06-28 00:20:58,914 - ThreadPoolExecutor-0_3 - botocore.endpoint - DEBUG - Making request for OperationModel(name=PutObject) with params: {'url_path': '/ULES01372DLCVOICE/PARAM.SFO', 'query_string': {}, 'method': 'PUT', 'headers': {'User-Agent': 'aws-cli/2.11.0 Python/3.11.2 Darwin/22.4.0 exe/x86_64 prompt/off command/s3.sync', 'Content-MD5': 'jdvdvxuBnVv4Pl2SaLsGvA==', 'Expect': '100-continue'}, 'body': <s3transfer.utils.ReadFileChunk object at 0x110c39050>, 'auth_path': '/BUCKET_NAME/ULES01372DLCVOICE/PARAM.SFO', 'url': 'https://BUCKET_NAME.s3.us-east-1.amazonaws.com/ULES01372DLCVOICE/PARAM.SFO', 'context': {'client_region': 'us-east-1', 'client_config': <botocore.config.Config object at 0x110a4bd90>, 'has_streaming_input': True, 'auth_type': 'v4', 'signing': {'region': 'us-east-1', 'signing_name': 's3', 'disableDoubleEncoding': True}, 's3_redirect': {'redirected': False, 'bucket': 'BUCKET_NAME', 'params': {'Bucket': 'BUCKET_NAME', 'Key': 'ULES01372DLCVOICE/PARAM.SFO', 'Body': <s3transfer.utils.ReadFileChunk object at 0x110c39050>}}}} upload: ULES01372DLCVOICE/PARAM.SFO to s3://BUCKET_NAME/ULES01372DLCVOICE/PARAM.SFO

Performing the sync again shows that the uploaded file modified timestamp was updated in s3 but incorrectly, it does not match the local modified time and is selected to be re-uploaded as it is evaluated to be newer

2023-06-28 00:31:29,157 - MainThread - awscli.customizations.s3.syncstrategy.base - DEBUG - syncing: /ULES01372DLCVOICE/PARAM.SFO -> BUCKET_NAME/ULES01372DLCVOICE/PARAM.SFO, size: 4912 -> 4912, modified time: 2023-06-28 03:53:05-04:00 -> 2023-06-28 00:20:59-04:00

The commands used were run in zsh on Mac M1

Version - aws-cli/2.11.0 Python/3.11.2 Darwin/22.4.0 exe/x86_64 prompt/off

Command without dryrun - aws s3 sync . s3://BUCKET_NAME --no-follow-symlinks --delete --exclude '*/.tmp.driveupload*' --debug

Command with dryrunaws s3 sync . s3://BUCKET_NAME--no-follow-symlinks --delete --exclude '*/.tmp.driveupload*' --dryrun --debug

@acro5piano
Copy link

acro5piano commented Sep 16, 2023

Note that --size-only is NOT appropriate for updating HTML files, especially if you are hosting a static website.

Static Websites typically built with bundled JS/CSS, and the html tag become something like <script src="/vendor.123abc45.js />. If a new build submitted and the build changes only CSS, it affects HTML changes like <script src="/vendor.678def90.js />. This apparently does not change the size of built HTML, meaning the HTML file is not updated with --size-only flag and it still points to the old CSS.

I did the very stupid mistake as written above. I hope this comment prevents people like me to do the same mistake before rolling to production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-for-staleness guidance Question that needs advice or information. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days.
Projects
None yet
Development

No branches or pull requests