Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 sync from local to s3 bucket re-uploads files with version 1.3.6 #749

Closed
irfan-syed opened this issue Apr 8, 2014 · 10 comments · Fixed by #755
Closed

s3 sync from local to s3 bucket re-uploads files with version 1.3.6 #749

irfan-syed opened this issue Apr 8, 2014 · 10 comments · Fixed by #755
Assignees
Labels
bug This issue is a bug. s3sync

Comments

@irfan-syed
Copy link

It is probably related to #718 but it is still happening on my OSX Mavericks with Python 2.7. I sync local files to S3 using "aws s3 sync" but some files always want to re-upload. I did a debug and comparator thinks that "file does not exist at destination".

I even removed all scripts and python packages for awscli, botocore and jmespath, and then reinstalled latest 1.3.6 but no luck. I use awscli for s3 sync since version 1.0 without any issues and have only started to experiences such issues starting with 1.3.2.

Appreciate if you could look into it. Let me know if additional information was needed.

@jamesls
Copy link
Member

jamesls commented Apr 8, 2014

Can you share the filename? If it's related to 718 it would have whitespace in the filename.

@irfan-syed
Copy link
Author

There certainly are whitespaces in full path (not necessarily in filename) as below. Should it not have been fixed with #718?

2014-04-09 09:30:06,383 - awscli.customizations.s3.comparator - DEBUG - syncing: /Users/OSXUser/Documents/Work/MEGA SHOTS/EmptyRoad/Hiring/Jeff So/iPhone Apps/MyFirstApp/build/Debug-iphonesimulator/Animation.app/ChooseStyle.nib -> megabucket/Documents/Work/MEGA SHOTS/EmptyRoad/Hiring/Jeff So/iPhone Apps/MyFirstApp/build/Debug-iphonesimulator/Animation.app/ChooseStyle.nib, file does not exist at destination

@jamesls
Copy link
Member

jamesls commented Apr 9, 2014

Yep. Seems to be working, here's what I tried:

$ aws --version
aws-cli/1.3.6 Python/2.7.5 Darwin/12.5.0
$ mkdir /tmp/synctest && cd /tmp/synctest
$ echo "test contents" > "file with spaces.txt"
$ aws s3 sync . s3://bucket-name/testwithspaces/
upload: ./file with spaces.txt to s3://bucket-name/testwithspaces/file with spaces.txt
$ aws s3 sync . s3://bucket-name/testwithspaces/   # <--- note did not resync the file.
$

Perhaps for some reason after determining that the file needs to be synced the actual file transfer fails. Can you confirm that the sync commands exits cleanly (RC of 0) and that the file in question is actually uploaded to S3 successfully?

@irfan-syed
Copy link
Author

Your test works for me too, but there is something else going on with the files I have.

Files are indeed re-uploaded every time successfully. In fact my local files and S3 files (50,000+) are in sync except for 13 files, and 1.3.6 want to upload 1200+ files every time.

I just did some tests with same set of local files and S3 bucket and:

awscli Version Behaviour
1.2.13 Works as expected
1.3.1 Works as expected
1.3.2. Has the behaviour described in #718. Wants to upload half of my files with spaces
1.3.6 Doesn't have same behaviour as #718 but it still want to upload 1200+ every time. That is even if let the upload complete for all files, or even delete files in S3 and re-sync.

Please let me know if you need anything from me to identify this issue. I've downgraded to 1.3.1 (last working version) for now.

@irfan-syed irfan-syed changed the title s3 sync from local to s3 bucket re-uploads file with version 1.3.6 s3 sync from local to s3 bucket re-uploads files with version 1.3.6 Apr 9, 2014
@laupow
Copy link

laupow commented Apr 14, 2014

This is similar to issue #648 regarding awscli.customizations.s3.comparator yielding a file does not exist at destination error.

In that issue files move the opposite direction (S3 to local disk) but there are no spaces in the file names (we replace them with underscores).

@jonbrock
Copy link

I was just looking into this a little bit. Seems to be that the two lists being analyzed in the comparator get out of sync at some point. From what I can tell, it's going along just fine and then suddenly a retry is needed. Then there's a big chunk of xml showing another set of keys. I think it's overwriting dest_files with this new list instead of appending.

I added a simple print of src_file.compare_key and dest_file.compare_key.

Before retry:
2014-04-14 17:18:04,966 - awscli.customizations.s3.comparator - DEBUG - src_file = My Pictures/October/IMG_0308.JPG, dest_file = My Pictures/October/IMG_0308.JPG
2014-04-14 17:18:04,967 - awscli.customizations.s3.comparator - DEBUG - src_file = My Pictures/October/IMG_0314.JPG, dest_file = My Pictures/October/IMG_0314.JPG
2014-04-14 17:18:04,967 - awscli.customizations.s3.comparator - DEBUG - src_file = My Pictures/October/IMG_0315.JPG, dest_file = My Pictures/October/IMG_0315.JPG
2014-04-14 17:18:04,968 - awscli.customizations.s3.comparator - DEBUG - src_file = My Pictures/October/IMG_0319.JPG, dest_file = My Pictures/October/IMG_0319.JPG
2014-04-14 17:18:04,969 - awscli.customizations.s3.comparator - DEBUG - src_file = My Pictures/October/IMG_0320.JPG, dest_file = My Pictures/October/IMG_0320.JPG

...

Then the retry.

Then:

2014-04-14 17:18:05,902 - botocore.hooks - DEBUG - Event service-created: calling handler <function register_retries_for_service at 0x1e7bc08>
2014-04-14 17:18:05,902 - botocore.handlers - DEBUG - Registering retry handlers for service: Service(s3)
2014-04-14 17:18:05,902 - awscli.customizations.s3.executor - DEBUG - Submitting task: <awscli.customizations.s3.tasks.BasicTask object at 0x29c3510>
2014-04-14 17:18:05,902 - awscli.customizations.s3.comparator - DEBUG - src_file = My Pictures/October/IMG_0323.JPG, dest_file = Photos/2009/11/24/IMG_1336-1.CR2
2014-04-14 17:18:05,903 - awscli.customizations.s3.executor - DEBUG - Worker thread invoking task: <awscli.customizations.s3.tasks.BasicTask object at 0x29c3510>
2014-04-14 17:18:05,903 - awscli.customizations.s3.comparator - DEBUG - syncing: /home/jon/Pictures/My Pictures/October/IMG_0323.JPG -> 8342d92d-3b90-42e2-ac14-4877ed8aaee7/Pictures/My Pictures/October/IMG_0323.JPG, less_than, file does not exist at destination
2014-04-14 17:18:05,905 - botocore.service - DEBUG - Creating service object for: s3
2014-04-14 17:18:05,905 - botocore.hooks - DEBUG - Event service-data-loaded.s3: calling handler <function signature_overrides at 0x1e7bde8>
2014-04-14 17:18:05,904 - awscli.customizations.s3.executor - DEBUG - Received print task: {'message': u'(dryrun) upload: ./IMG_0322.JPG to s3://8342d92d-3b90-42e2-ac14-4877ed8aaee7/Pictures/My Pictures/October/IMG_0322.JPG', 'error': False}
2014-04-14 17:18:05,906 - botocore.hooks - DEBUG - Event service-created: calling handler <function register_retries_for_service at 0x1e7bc08>
2014-04-14 17:18:05,906 - botocore.handlers - DEBUG - Registering retry handlers for service: Service(s3)
yyyyyyyyyyyy My Pictures/October/IMG_0323.JPG
(dryrun) upload: ./IMG_0322.JPG to s3://8342d92d-3b90-42e2-ac14-4877ed8aaee7/Pictures/My Pictures/October/IMG_0322.JPG
Completed 2 part(s) with ... file(s) remaining^M2014-04-14 17:18:05,907 - awscli.customizations.s3.executor - DEBUG - Submitting task: <awscli.customizations.s3.tasks.BasicTask object at 0x29c3790>
2014-04-14 17:18:05,907 - awscli.customizations.s3.comparator - DEBUG - src_file = My Pictures/October/IMG_0324.JPG, dest_file = Photos/2009/11/24/IMG_1336-1.CR2
2014-04-14 17:18:05,908 - awscli.customizations.s3.executor - DEBUG - Worker thread invoking task: <awscli.customizations.s3.tasks.BasicTask object at 0x29c3790>
2014-04-14 17:18:05,908 - awscli.customizations.s3.comparator - DEBUG - syncing: /home/jon/Pictures/My Pictures/October/IMG_0324.JPG -> 8342d92d-3b90-42e2-ac14-4877ed8aaee7/Pictures/My Pictures/October/IMG_0324.JPG, less_than, file does not exist at destination

Again, the dest file that is not working is the first file returned in the new chunk of xml.

Hope that helps..

@jonbrock
Copy link

Hi,

I don't really know how to use github, but I have a patch that fixes the issue. The problem is botocore/paginate.py is not unquoting the "next_marker" key. When it creates the query for the next request, the "+" in the key gets encoded as %2B.

diff --git a/botocore/paginate.py b/botocore/paginate.py
index c6efff6..95fedb4 100644
--- a/botocore/paginate.py
+++ b/botocore/paginate.py
@@ -11,6 +11,7 @@
 # ANY KIND, either express or implied. See the License for the specific
 # language governing permissions and limitations under the License.

+from compat import unquote_str
 from itertools import tee
 try:
     from itertools import zip_longest
@@ -251,7 +252,7 @@ class PageIterator(object):
                 return [None]
         next_tokens = []
         for token in self._output_token:
-            next_tokens.append(token.search(parsed))
+            next_tokens.append(unquote_str(token.search(parsed)))
         return next_tokens

     def result_key_iters(self):

@jamesls
Copy link
Member

jamesls commented Apr 15, 2014

Awesome, thanks for your help. Looking at this now...

@jamesls jamesls self-assigned this Apr 15, 2014
@jamesls
Copy link
Member

jamesls commented Apr 17, 2014

Just wanted to give an update here. I'm able to repro the issue and can confirm what @jonbrock has said. I want to work on a set of test cases here to ensure we don't regress on this again. I'll send a PR once I have this worked out.

jamesls added a commit to jamesls/aws-cli that referenced this issue Apr 17, 2014
Fixes aws#749.  This was a regression from the fix for aws#675
where we use the encoding_type of "url" to workaround
the stdlib xmlparser not handling new lines.

The problem is that pagination in s3 uses the last key name as
the marker, and because the keys are returned urlencoded, we
need to urldecode the keys so botocore sends the correct next
marker.  In the case where urldecoded(key) != key we will incorrectly
sync new files.
jamesls added a commit to jamesls/aws-cli that referenced this issue Apr 17, 2014
Fixes aws#749.  This was a regression from the fix for aws#675
where we use the encoding_type of "url" to workaround
the stdlib xmlparser not handling new lines.

The problem is that pagination in s3 uses the last key name as
the marker, and because the keys are returned urlencoded, we
need to urldecode the keys so botocore sends the correct next
marker.  In the case where urldecoded(key) != key we will incorrectly
sync new files.

Also added an integ test for syncing with '+' chars.
jamesls added a commit to jamesls/aws-cli that referenced this issue Apr 17, 2014
Fixes aws#749.  This was a regression from the fix for aws#675
where we use the encoding_type of "url" to workaround
the stdlib xmlparser not handling new lines.

The problem is that pagination in s3 uses the last key name as
the marker, and because the keys are returned urlencoded, we
need to urldecode the keys so botocore sends the correct next
marker.  In the case where urldecoded(key) != key we will incorrectly
sync new files.

Also added an integ test for syncing with '+' chars.
jamesls added a commit that referenced this issue Apr 18, 2014
@jamesls
Copy link
Member

jamesls commented Apr 18, 2014

Ok I believe this issue has been fixed (#755). Thanks to everyone for their help in debugging this issue.

jamesls added a commit that referenced this issue Apr 21, 2014
* release-1.3.7: (28 commits)
  Bumping version to 1.3.7
  Add #742 to changelog
  Add a comment about why get_stdout_text_writer is needed.
  Code review feedback
  Py3 integ test fixes
  Update changelog with #749
  Add compat layer for text based stream writers
  Fix S3 sync issue with keys containing urlencode values
  Add issue to changelog
  Remove print statement in test
  Fix issue with scalar/non-scalar lists
  Fix doc example for s3api put-object
  Refactor load-cli-arg common event code
  Add 750 to the changelog
  Update paramfile custom argument to use events
  Aggregate dupe keys into a list in datapipeline translation
  Add issue to CHANGELOG
  Do not auto parse JSON based on filename
  Update tests to not mock __builtin__.open
  Allow custom param values to be read from files/urls
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. s3sync
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants