-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide an option to perfom unicode normalization on local file names #1639
Comments
For what it's worth, the following patch solves my problem: diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/filegenerator.py awscli/customizations/s3/filegenerator.py
--- awscli.orig/customizations/s3/filegenerator.py 2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/filegenerator.py 2015-11-15 18:56:31.000000000 +0100
@@ -13,6 +13,7 @@
import os
import sys
import stat
+import unicodedata
from dateutil.parser import parse
from dateutil.tz import tzlocal
@@ -116,11 +117,12 @@
``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
"""
def __init__(self, client, operation_name, follow_symlinks=True,
- page_size=None, result_queue=None):
+ page_size=None, normalize_unicode=False, result_queue=None):
self._client = client
self.operation_name = operation_name
self.follow_symlinks = follow_symlinks
self.page_size = page_size
+ self.normalize_unicode = normalize_unicode
self.result_queue = result_queue
if not result_queue:
self.result_queue = queue.Queue()
@@ -167,6 +169,8 @@
"""
join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
error, listdir = os.error, os.listdir
+ if self.normalize_unicode:
+ path = unicodedata.normalize('NFKC', path)
if not self.should_ignore_file(path):
if not dir_op:
size, last_update = get_file_stat(path)
@@ -185,6 +189,8 @@
listdir_names = listdir(path)
names = []
for name in listdir_names:
+ if self.normalize_unicode:
+ name = unicodedata.normalize('NFKC', name)
if not self.should_ignore_file_with_decoding_warnings(
path, name):
file_path = join(path, name)
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/s3handler.py awscli/customizations/s3/s3handler.py
--- awscli.orig/customizations/s3/s3handler.py 2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/s3handler.py 2015-11-15 09:25:54.000000000 +0100
@@ -64,7 +64,8 @@
'grants': None, 'only_show_errors': False,
'is_stream': False, 'paths_type': None,
'expected_size': None, 'metadata_directive': None,
- 'ignore_glacier_warnings': False}
+ 'ignore_glacier_warnings': False,
+ 'normalize_unicode': False}
self.params['region'] = params['region']
for key in self.params.keys():
if key in params:
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/subcommands.py awscli/customizations/s3/subcommands.py
--- awscli.orig/customizations/s3/subcommands.py 2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/subcommands.py 2015-11-15 18:18:23.000000000 +0100
@@ -301,12 +301,21 @@
}
+NORMALIZE_UNICODE = {
+ 'name': 'normalize-unicode', 'action': 'store_true',
+ 'help_text': (
+ 'Normalizes file names read from the local filesystem in unicode '
+ 'normal form KC. This is mainly useful when running on OS X.'
+ )
+}
+
+
TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
SSE, STORAGE_CLASS, GRANTS, WEBSITE_REDIRECT, CONTENT_TYPE,
CACHE_CONTROL, CONTENT_DISPOSITION, CONTENT_ENCODING,
CONTENT_LANGUAGE, EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
- PAGE_SIZE, IGNORE_GLACIER_WARNINGS]
+ PAGE_SIZE, IGNORE_GLACIER_WARNINGS, NORMALIZE_UNICODE]
def get_client(session, region, endpoint_url, verify):
@@ -770,10 +779,12 @@
operation_name,
self.parameters['follow_symlinks'],
self.parameters['page_size'],
+ self.parameters['normalize_unicode'],
result_queue=result_queue)
rev_generator = FileGenerator(self._client, '',
self.parameters['follow_symlinks'],
self.parameters['page_size'],
+ self.parameters['normalize_unicode'],
result_queue=result_queue)
taskinfo = [TaskInfo(src=files['src']['path'],
src_type='s3', I'm not submitting it as a PR because it's missing at least tests and documentation. I'm mostly leaving it here in case others find it helpful. Of course, feel free to use it as a starting point for fixing this issue if my approach doesn't seem too off base. EDIT: just updated the patch to apply unicode normalization before sorting file names. |
Wow, nice work! We'll look into it |
I created a branch and opened a pull request in order to make it easier to maintain the patch -- the recent release broke it. |
Here's a new version of the patch, recreated against the latest release. In case someone else uses it:
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/filegenerator.py awscli/customizations/s3/filegenerator.py
--- awscli.orig/customizations/s3/filegenerator.py 2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/filegenerator.py 2015-11-15 18:56:31.000000000 +0100
@@ -13,6 +13,7 @@
import os
import sys
import stat
+import unicodedata
from dateutil.parser import parse
from dateutil.tz import tzlocal
@@ -116,11 +117,12 @@
``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
"""
def __init__(self, client, operation_name, follow_symlinks=True,
- page_size=None, result_queue=None):
+ page_size=None, normalize_unicode=False, result_queue=None):
self._client = client
self.operation_name = operation_name
self.follow_symlinks = follow_symlinks
self.page_size = page_size
+ self.normalize_unicode = normalize_unicode
self.result_queue = result_queue
if not result_queue:
self.result_queue = queue.Queue()
@@ -167,6 +169,8 @@
"""
join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
error, listdir = os.error, os.listdir
+ if self.normalize_unicode:
+ path = unicodedata.normalize('NFKC', path)
if not self.should_ignore_file(path):
if not dir_op:
size, last_update = get_file_stat(path)
@@ -185,6 +189,8 @@
listdir_names = listdir(path)
names = []
for name in listdir_names:
+ if self.normalize_unicode:
+ name = unicodedata.normalize('NFKC', name)
if not self.should_ignore_file_with_decoding_warnings(
path, name):
file_path = join(path, name)
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/s3handler.py awscli/customizations/s3/s3handler.py
--- awscli.orig/customizations/s3/s3handler.py 2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/s3handler.py 2015-11-15 09:25:54.000000000 +0100
@@ -64,7 +64,8 @@
'grants': None, 'only_show_errors': False,
'is_stream': False, 'paths_type': None,
'expected_size': None, 'metadata_directive': None,
- 'ignore_glacier_warnings': False}
+ 'ignore_glacier_warnings': False,
+ 'normalize_unicode': False}
self.params['region'] = params['region']
for key in self.params.keys():
if key in params:
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/subcommands.py awscli/customizations/s3/subcommands.py
--- awscli.orig/customizations/s3/subcommands.py 2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/subcommands.py 2015-11-15 18:18:23.000000000 +0100
@@ -301,12 +301,21 @@
}
+NORMALIZE_UNICODE = {
+ 'name': 'normalize-unicode', 'action': 'store_true',
+ 'help_text': (
+ 'Normalizes file names read from the local filesystem in unicode '
+ 'normal form KC. This is mainly useful when running on OS X.'
+ )
+}
+
+
TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
SSE, STORAGE_CLASS, GRANTS, WEBSITE_REDIRECT, CONTENT_TYPE,
CACHE_CONTROL, CONTENT_DISPOSITION, CONTENT_ENCODING,
CONTENT_LANGUAGE, EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
- PAGE_SIZE, IGNORE_GLACIER_WARNINGS]
+ PAGE_SIZE, IGNORE_GLACIER_WARNINGS, NORMALIZE_UNICODE]
def get_client(session, region, endpoint_url, verify):
@@ -770,10 +779,12 @@
operation_name,
self.parameters['follow_symlinks'],
self.parameters['page_size'],
+ self.parameters['normalize_unicode'],
result_queue=result_queue)
rev_generator = FileGenerator(self._client, '',
self.parameters['follow_symlinks'],
self.parameters['page_size'],
+ self.parameters['normalize_unicode'],
result_queue=result_queue)
taskinfo = [TaskInfo(src=files['src']['path'],
src_type='s3', |
Thanks for the excellent analysis Aymeric, this is exactly the issue I'm experiencing and it was difficult to track down. I hope somebody from AWS can help us here. |
Updated version of the patch against the latest release. commit 78640c7f7a345fb3740b72c239007470a5709caf
Author: Aymeric Augustin
Date: Tue Dec 20 23:05:49 2016 +0100
Add an option to normalize file names.
diff --git a/awscli/customizations/s3/filegenerator.py b/awscli/customizations/s3/filegenerator.py
index d33b77f..13a7f1d 100644
--- a/awscli/customizations/s3/filegenerator.py
+++ b/awscli/customizations/s3/filegenerator.py
@@ -13,6 +13,7 @@
import os
import sys
import stat
+import unicodedata
from dateutil.parser import parse
from dateutil.tz import tzlocal
@@ -116,7 +117,8 @@ class FileGenerator(object):
``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
"""
def __init__(self, client, operation_name, follow_symlinks=True,
- page_size=None, result_queue=None, request_parameters=None):
+ page_size=None, result_queue=None, request_parameters=None,
+ normalize_unicode=False):
self._client = client
self.operation_name = operation_name
self.follow_symlinks = follow_symlinks
@@ -127,6 +129,7 @@ class FileGenerator(object):
self.request_parameters = {}
if request_parameters is not None:
self.request_parameters = request_parameters
+ self.normalize_unicode = normalize_unicode
def call(self, files):
"""
@@ -170,6 +173,8 @@ class FileGenerator(object):
"""
join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
error, listdir = os.error, os.listdir
+ if self.normalize_unicode:
+ path = unicodedata.normalize('NFKC', path)
if not self.should_ignore_file(path):
if not dir_op:
size, last_update = get_file_stat(path)
@@ -189,6 +194,8 @@ class FileGenerator(object):
listdir_names = listdir(path)
names = []
for name in listdir_names:
+ if self.normalize_unicode:
+ name = unicodedata.normalize('NFKC', name)
if not self.should_ignore_file_with_decoding_warnings(
path, name):
file_path = join(path, name)
diff --git a/awscli/customizations/s3/subcommands.py b/awscli/customizations/s3/subcommands.py
index 4bc7398..04afe3f 100644
--- a/awscli/customizations/s3/subcommands.py
+++ b/awscli/customizations/s3/subcommands.py
@@ -417,6 +417,14 @@ REQUEST_PAYER = {
)
}
+NORMALIZE_UNICODE = {
+ 'name': 'normalize-unicode', 'action': 'store_true',
+ 'help_text': (
+ 'Normalizes file names read from the local filesystem in unicode '
+ 'normal form KC. This is mainly useful when running on OS X.'
+ )
+}
+
TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
SSE, SSE_C, SSE_C_KEY, SSE_KMS_KEY_ID, SSE_C_COPY_SOURCE,
@@ -424,7 +432,8 @@ TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
WEBSITE_REDIRECT, CONTENT_TYPE, CACHE_CONTROL,
CONTENT_DISPOSITION, CONTENT_ENCODING, CONTENT_LANGUAGE,
EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
- PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER]
+ PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER,
+ NORMALIZE_UNICODE]
def get_client(session, region, endpoint_url, verify, config=None):
@@ -963,12 +972,14 @@ class CommandArchitecture(object):
'client': self._source_client, 'operation_name': operation_name,
'follow_symlinks': self.parameters['follow_symlinks'],
'page_size': self.parameters['page_size'],
+ 'normalize_unicode': self.parameters['normalize_unicode'],
'result_queue': result_queue,
}
rgen_kwargs = {
'client': self._client, 'operation_name': '',
'follow_symlinks': self.parameters['follow_symlinks'],
'page_size': self.parameters['page_size'],
+ 'normalize_unicode': self.parameters['normalize_unicode'],
'result_queue': result_queue,
}
|
This patch is perfect for me, thanks. 👍 |
Patch rebased on top of develop. commit c5466f2191b073303edef62d531761591e7e6c90
Author: Aymeric Augustin <aymeric.augustin@m4x.org>
Date: Tue Dec 20 23:05:49 2016 +0100
Add an option to normalize file names.
diff --git a/awscli/customizations/s3/filegenerator.py b/awscli/customizations/s3/filegenerator.py
index f24ca187..70a17581 100644
--- a/awscli/customizations/s3/filegenerator.py
+++ b/awscli/customizations/s3/filegenerator.py
@@ -13,6 +13,7 @@
import os
import sys
import stat
+import unicodedata
from dateutil.parser import parse
from dateutil.tz import tzlocal
@@ -116,7 +117,8 @@ class FileGenerator(object):
``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
"""
def __init__(self, client, operation_name, follow_symlinks=True,
- page_size=None, result_queue=None, request_parameters=None):
+ page_size=None, result_queue=None, request_parameters=None,
+ normalize_unicode=False):
self._client = client
self.operation_name = operation_name
self.follow_symlinks = follow_symlinks
@@ -127,6 +129,7 @@ class FileGenerator(object):
self.request_parameters = {}
if request_parameters is not None:
self.request_parameters = request_parameters
+ self.normalize_unicode = normalize_unicode
def call(self, files):
"""
@@ -170,6 +173,8 @@ class FileGenerator(object):
"""
join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
error, listdir = os.error, os.listdir
+ if self.normalize_unicode:
+ path = unicodedata.normalize('NFKC', path)
if not self.should_ignore_file(path):
if not dir_op:
stats = self._safely_get_file_stats(path)
@@ -188,6 +193,8 @@ class FileGenerator(object):
listdir_names = listdir(path)
names = []
for name in listdir_names:
+ if self.normalize_unicode:
+ name = unicodedata.normalize('NFKC', name)
if not self.should_ignore_file_with_decoding_warnings(
path, name):
file_path = join(path, name)
diff --git a/awscli/customizations/s3/subcommands.py b/awscli/customizations/s3/subcommands.py
index 02d591ea..b9b1d6c9 100644
--- a/awscli/customizations/s3/subcommands.py
+++ b/awscli/customizations/s3/subcommands.py
@@ -418,6 +418,14 @@ REQUEST_PAYER = {
)
}
+NORMALIZE_UNICODE = {
+ 'name': 'normalize-unicode', 'action': 'store_true',
+ 'help_text': (
+ 'Normalizes file names read from the local filesystem in unicode '
+ 'normal form KC. This is mainly useful when running on OS X.'
+ )
+}
+
TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
SSE, SSE_C, SSE_C_KEY, SSE_KMS_KEY_ID, SSE_C_COPY_SOURCE,
@@ -425,7 +433,8 @@ TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
WEBSITE_REDIRECT, CONTENT_TYPE, CACHE_CONTROL,
CONTENT_DISPOSITION, CONTENT_ENCODING, CONTENT_LANGUAGE,
EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
- PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER]
+ PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER,
+ NORMALIZE_UNICODE]
def get_client(session, region, endpoint_url, verify, config=None):
@@ -964,12 +973,14 @@ class CommandArchitecture(object):
'client': self._source_client, 'operation_name': operation_name,
'follow_symlinks': self.parameters['follow_symlinks'],
'page_size': self.parameters['page_size'],
+ 'normalize_unicode': self.parameters['normalize_unicode'],
'result_queue': result_queue,
}
rgen_kwargs = {
'client': self._client, 'operation_name': '',
'follow_symlinks': self.parameters['follow_symlinks'],
'page_size': self.parameters['page_size'],
+ 'normalize_unicode': self.parameters['normalize_unicode'],
'result_queue': result_queue,
} |
I had a bit of free time this morning so I took a look at this. It doesn't look like this will work since we will need to operate on those files down the line and having the altered path will break that. I think the changes necessary to fully support this feature would need to be more invasive. |
Good Morning! We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI. This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports. As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions. We’ve imported existing feature requests from GitHub - Search for this issue there! And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue. GitHub will remain the channel for reporting bugs. Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface -The AWS SDKs & Tools Team This entry can specifically be found on UserVoice at: https://aws.uservoice.com/forums/598381-aws-command-line-interface/suggestions/33168379-provide-an-option-to-perfom-unicode-normalization |
This message was created automatically by mail delivery software.
A message that you sent could not be delivered to one or more of its
recipients. This is a temporary error. The following address(es) deferred:
mkdirenv@gmail.com
Domain salmanwaheed.info has exceeded the max emails per hour (163/150 (108%)) allowed. Message will be reattempted later
…------- This is a copy of the message, including all the headers. ------
Received: from github-smtp2-ext1.iad.github.net ([192.30.252.192]:34761 helo=github-smtp2a-ext-cp1-prd.iad.github.net)
by box1177.bluehost.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256)
(Exim 4.89_1)
(envelope-from <noreply@github.com>)
id 1ej0Pc-001aoJ-Eq
for hello@salmanwaheed.info; Tue, 06 Feb 2018 03:23:40 -0700
Date: Tue, 06 Feb 2018 02:23:29 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com;
s=pf2014; t=1517912609;
bh=s25/ZHjWhyhYV9V97C8YTJNZ5BORhSs5xPzdklFZIKk=;
h=From:Reply-To:To:Cc:In-Reply-To:References:Subject:List-ID:
List-Archive:List-Post:List-Unsubscribe:From;
b=Z5vLfuztlKa3gUlFxh+rQiu6Swt+G7hinUV/cSIOkbzYfAWamnhD0ULyBqsv52peJ
stwTFQoWt4in2Tf4AhG9ZXAivaotPW0i81bIOZjiXnFd8vfgaVj0s3bxRpwx4Tj/6r
FuFEFp5+1eaUj88/4+viBqt+X152syrZ3YEkGWjo=
From: Andre Sayre <notifications@github.com>
Reply-To: aws/aws-cli <reply@reply.github.com>
To: aws/aws-cli <aws-cli@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <aws/aws-cli/issue/1639/issue_event/1459789997@github.com>
In-Reply-To: <aws/aws-cli/issues/1639@github.com>
References: <aws/aws-cli/issues/1639@github.com>
Subject: Re: [aws/aws-cli] Provide an option to perfom unicode normalization
on local file names (#1639)
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary="--==_mimepart_5a798221e024c_1fee2ad9784d8ed44348e1";
charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: list
X-GitHub-Sender: ASayre
X-GitHub-Recipient: salmanwaheed
X-GitHub-Reason: subscribed
List-ID: aws/aws-cli <aws-cli.aws.github.com>
List-Archive: https://github.com/aws/aws-cli
List-Post: <mailto:reply@reply.github.com>
List-Unsubscribe: <mailto:unsub+00ef1b3886c2f355df86ecca0a66fe83b63582510a0cc5b792cf000000011691442192a169ce06f89887@reply.github.com>,
<https://github.com/notifications/unsubscribe/AO8bOM9ETFXf7BbCu4Gt-bci8Pk4jmUHks5tSCghgaJpZM4Gibvq>
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: hello@salmanwaheed.info
X-Spam-Status: No, score=0.5
X-Spam-Score: 5
X-Spam-Bar: /
X-Ham-Report: Spam detection software, running on the system "box1177.bluehost.com",
has NOT identified this incoming email as spam. The original
message has been attached to this so you can view it or label
similar future email. If you have any questions, see
root\@localhost for details.
Content preview: Closed #1639. -- You are receiving this because you are subscribed
to this thread. Reply to this email directly or view it on GitHub: #1639 (comment)
Closed #1639. [...]
Content analysis details: (0.5 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: github.com]
-0.5 SPF_PASS SPF: sender matches SPF record
0.0 HTML_MESSAGE BODY: HTML included in message
0.7 HTML_IMAGE_ONLY_20 BODY: HTML: images with 1600-2000 bytes of words
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
2.5 DCC_CHECK No description available.
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-2.1 AWL AWL: Adjusted score from AWL reputation of From: address
X-Spam-Flag: NO
----==_mimepart_5a798221e024c_1fee2ad9784d8ed44348e1
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: 7bit
Closed #1639.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1639 (comment)
----==_mimepart_5a798221e024c_1fee2ad9784d8ed44348e1
Content-Type: text/html;
charset=UTF-8
Content-Transfer-Encoding: 7bit
<p>Closed <a href="#1639" class="issue-link js-issue-link" data-error-text="Failed to load issue title" data-id="116955271" data-permission-text="Issue title is private" data-url="#1639">#1639</a>.</p>
<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">—<br />You are receiving this because you are subscribed to this thread.<br />Reply to this email directly, <a href="#1639 (comment)">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AO8bOH0GWWFDu1FuJwEkYeYWabZe-JxAks5tSCghgaJpZM4Gibvq">mute the thread</a>.<img alt="" height="1" src="https://github.com/notifications/beacon/AO8bOIDe48jpv8H8ahJ-WgYLUmbv3v2rks5tSCghgaJpZM4Gibvq.gif" width="1" /></p>
<div itemscope itemtype="http://schema.org/EmailMessage">
<div itemprop="action" itemscope itemtype="http://schema.org/ViewAction">
<link itemprop="url" href="#1639 (comment)"></link>
<meta itemprop="name" content="View Issue"></meta>
</div>
<meta itemprop="description" content="View this Issue on GitHub"></meta>
</div>
<script type="application/json" data-scope="inboxmarkup">{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/aws/aws-cli","title":"aws/aws-cli","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/aws/aws-cli"}},"updates":{"snippets":[{"icon":"DESCRIPTION","message":"Closed #1639."}],"action":{"name":"View Issue","url":"#1639 (comment)"}}}</script>
----==_mimepart_5a798221e024c_1fee2ad9784d8ed44348e1--
|
This message was created automatically by mail delivery software.
A message that you sent could not be delivered to one or more of its
recipients. This is a temporary error. The following address(es) deferred:
mkdirenv@gmail.com
Domain salmanwaheed.info has exceeded the max emails per hour (162/150 (108%)) allowed. Message will be reattempted later
…------- This is a copy of the message, including all the headers. ------
------ The body of the message is 6170 characters long; only the first
------ 5000 or so are included here.
Received: from github-smtp2-ext1.iad.github.net ([192.30.252.192]:34195 helo=github-smtp2a-ext-cp1-prd.iad.github.net)
by box1177.bluehost.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256)
(Exim 4.89_1)
(envelope-from <noreply@github.com>)
id 1ej0Pb-001aoA-8m
for hello@salmanwaheed.info; Tue, 06 Feb 2018 03:23:39 -0700
Date: Tue, 06 Feb 2018 02:23:28 -0800
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=github.com;
s=pf2014; t=1517912608;
bh=Y/hd9JmoeMXxH6KcRXvfPyHL6nLfCP0pkkFmBhdNXcw=;
h=From:Reply-To:To:Cc:In-Reply-To:References:Subject:List-ID:
List-Archive:List-Post:List-Unsubscribe:From;
b=cAiSo4/7KEkv8Y09Jc9toFjiBRsftUbnU6o4wAN3r99MK75KQdvfWNMs47IuPeIUc
iLCjtWYRi66OiNWPx41icZ/f1wzH67rnKH4BuzQh6wgR//S+gtQfFyNCEHUh7Y+fHN
bzgdujckmQC6NeZe79OADG6IM+i3wW0Cx/+8B6sw=
From: Andre Sayre <notifications@github.com>
Reply-To: aws/aws-cli <reply@reply.github.com>
To: aws/aws-cli <aws-cli@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Message-ID: <aws/aws-cli/issues/1639/363377996@github.com>
In-Reply-To: <aws/aws-cli/issues/1639@github.com>
References: <aws/aws-cli/issues/1639@github.com>
Subject: Re: [aws/aws-cli] Provide an option to perfom unicode normalization
on local file names (#1639)
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary="--==_mimepart_5a7982208e1ea_167e2aec7b20eecc215875";
charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: list
X-GitHub-Sender: ASayre
X-GitHub-Recipient: salmanwaheed
X-GitHub-Reason: subscribed
List-ID: aws/aws-cli <aws-cli.aws.github.com>
List-Archive: https://github.com/aws/aws-cli
List-Post: <mailto:reply@reply.github.com>
List-Unsubscribe: <mailto:unsub+00ef1b3846cf8d2c826fcd2da1df396c9316499bb49bdbe792cf000000011691442092a169ce06f89887@reply.github.com>,
<https://github.com/notifications/unsubscribe/AO8bOGxOP_4Qx_TAGx-UXBEgDiRQuEKBks5tSCgggaJpZM4Gibvq>
X-Auto-Response-Suppress: All
X-GitHub-Recipient-Address: hello@salmanwaheed.info
X-Spam-Status: No, score=-1.1
X-Spam-Score: -10
X-Spam-Bar: -
X-Ham-Report: Spam detection software, running on the system "box1177.bluehost.com",
has NOT identified this incoming email as spam. The original
message has been attached to this so you can view it or label
similar future email. If you have any questions, see
root\@localhost for details.
Content preview: Good Morning! We're closing this issue here on GitHub, as
part of our migration to [UserVoice](https://aws.uservoice.com/forums/598381-aws-command-line-interface)
for feature requests involving the AWS CLI. [...]
Content analysis details: (-1.1 points, 5.0 required)
pts rule name description
---- ---------------------- --------------------------------------------------
0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked.
See
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
for more information.
[URIs: github.com]
-0.5 SPF_PASS SPF: sender matches SPF record
0.0 HTML_MESSAGE BODY: HTML included in message
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature
-0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's
domain
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid
-0.5 AWL AWL: Adjusted score from AWL reputation of From: address
X-Spam-Flag: NO
----==_mimepart_5a7982208e1ea_167e2aec7b20eecc215875
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Good Morning!
We're closing this issue here on GitHub, as part of our migration to [Use=
rVoice](https://aws.uservoice.com/forums/598381-aws-command-line-interfac=
e) for feature requests involving the AWS CLI.
This will let us get the most important features to you, by making it eas=
ier to search for and show support for the features you care the most abo=
ut, without diluting the conversation with bug reports.
As a quick UserVoice primer (if not already familiar): after an idea is p=
osted, people can vote on the ideas, and the product team will be respond=
ing directly to the most popular suggestions.
We=E2=80=99ve imported existing feature requests from GitHub - Search for=
this issue there!
And don't worry, this issue will still exist on GitHub for posterity's sa=
ke. As it=E2=80=99s a text-only import of the original post into UserVoi=
ce, we=E2=80=99ll still be keeping in mind the comments and discussion th=
at already exist here on the GitHub issue.
GitHub will remain the channel for reporting bugs. =
Once again, this issue can now be found by searching for the title on: ht=
tps://aws.uservoice.com/forums/598381-aws-command-line-interface =
-The AWS SDKs & Tools Team
-- =
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
#1639 (comment)=
----==_mimepart_5a7982208e1ea_167e2aec7b20eecc215875
Content-Type: text/html;
charset=UTF-8
Content-Transfer-Encoding: quoted-printable
<p>Good Morning!</p>
<p>We're closing this issue here on GitHub, as part of our migration to <=
a href=3D"https://aws.uservoice.com/forums/598381-aws-command-line-interf=
ace" rel=3D"nofollow">UserVoice</a> for feature requests involving the AW=
S CLI.</p>
<p>This will let us get the most important features to you, by making it =
easier to search for and show support for the features you care the most =
about, without diluting the conversation with bug reports.</p>
<p>As a quick UserVoice primer (if not already familiar): after an idea i=
s posted, people can vote on the ideas, and the product team will be resp=
onding directly to the most popular suggestions.</p>
<p>We=E2=80=99ve imported existing feature requests from GitHub - Search =
for this issue there!</p>
<p>And don't worry, this issue will still exist on GitHub for posterity's=
sake. As it=E2=80=99s a text-only import of the original post into User=
Voice, we=E2=80=99ll still be keeping in mind the comments and discussion=
that already exist here on the GitHub issue.</p>
<p>GitHub will remain the channel for reporting bugs.</p>
<p>Once again, this issue can now be found by searching for the title on:=
<a href=3D"https://aws.uservoice.com/forums/598381-aws-command-line-inte=
rface" rel=3D"nofollow">https://aws.uservoice.com/forums/598381-aws-comma=
nd-line-interface</a></p>
<p>-The AWS SDKs & Tools Team</p>
<p style=3D"font-size:small;-webkit-text-size-adjust:none;color:#666;">&m=
dash;<br />You are receiving this because you are subscribed to this thre=
ad.<br />Reply to this email directly, <a href=3D"https://github.com/aws/=
aws-cli/issues/1639#issuecomment-363377996">view it on GitHub</a>, or <a =
href=3D"https://github.com/notifications/unsubscribe-auth/AO8bOC976GYj3UV=
8WvsNlQnu_09eegh2ks5tSCgggaJpZM4Gibvq">mute the thread</a>.<img alt=3D"" =
height=3D"1" src=3D"https://github.com/notifications/beacon/AO8bOOCYeob5q=
Ex--sRg66CGL3nhM2rLks5tSCgggaJpZM4Gibvq.gif" width=3D"1" /></p>
<div itemscope itemtype=3D"http://schema.org/EmailMessage">
<div itemprop=3D"action" itemscope itemtype=3D"http://schema.org/ViewActi=
on">
<link itemprop=3D"url" href=3D"#16=
39#issuecomment-363377996"></link>
<meta itemprop=3D"name" content=3D"View Issue"></meta>
</div>
<meta itemprop=3D"description" content=3D"View this Issue on GitHub"></me=
ta>
</div>
<script type=3D"application/json" data-scope=3D"inboxmarkup">{"api_versio=
n":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name"=
:"GitHub"},"entity":{"external_key":"github/aws/aws-cli","title":"aws/aws=
-cli","subtitle":"GitHub repository","main_image_url":"https://cloud.gith=
ubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c=
7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/=
143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name=
":"Open in GitHub","url":"https://github.com/aws/aws-cli"}},"updates":{"s=
nippets":[{"icon":"PERSON","message":"@ASayre in #1639: Good Morning!\r\n=
\r\nWe're closing this issue here on GitHub, as part of our migration to =
[UserVoice](https://aws.uservoice.com/forums/598381-aws-command-line-inte=
rface) for feature requests involving the AWS CLI.\r\n\r\nThis will let u=
s get the most important features t
|
Patch updated (again). diff -Naur awscli.orig/customizations/s3/filegenerator.py awscli/customizations/s3/filegenerator.py
--- awscli.orig/customizations/s3/filegenerator.py 2018-03-04 21:29:37.000000000 +0100
+++ awscli/customizations/s3/filegenerator.py 2018-03-04 21:31:07.000000000 +0100
@@ -13,6 +13,7 @@
import os
import sys
import stat
+import unicodedata
from dateutil.parser import parse
from dateutil.tz import tzlocal
@@ -116,7 +117,8 @@
``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
"""
def __init__(self, client, operation_name, follow_symlinks=True,
- page_size=None, result_queue=None, request_parameters=None):
+ page_size=None, result_queue=None, request_parameters=None,
+ normalize_unicode=False):
self._client = client
self.operation_name = operation_name
self.follow_symlinks = follow_symlinks
@@ -127,6 +129,7 @@
self.request_parameters = {}
if request_parameters is not None:
self.request_parameters = request_parameters
+ self.normalize_unicode = normalize_unicode
def call(self, files):
"""
@@ -170,6 +173,8 @@
"""
join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
error, listdir = os.error, os.listdir
+ if self.normalize_unicode:
+ path = unicodedata.normalize('NFKC', path)
if not self.should_ignore_file(path):
if not dir_op:
stats = self._safely_get_file_stats(path)
@@ -188,6 +193,8 @@
listdir_names = listdir(path)
names = []
for name in listdir_names:
+ if self.normalize_unicode:
+ name = unicodedata.normalize('NFKC', name)
if not self.should_ignore_file_with_decoding_warnings(
path, name):
file_path = join(path, name)
diff -Naur awscli.orig/customizations/s3/subcommands.py awscli/customizations/s3/subcommands.py
--- awscli.orig/customizations/s3/subcommands.py 2018-03-04 21:29:37.000000000 +0100
+++ awscli/customizations/s3/subcommands.py 2018-03-04 21:33:41.000000000 +0100
@@ -427,6 +427,15 @@
)
}
+NORMALIZE_UNICODE = {
+ 'name': 'normalize-unicode', 'action': 'store_true',
+ 'help_text': (
+ 'Normalizes file names read from the local filesystem in unicode '
+ 'normal form KC. This is mainly useful when running on macOS.'
+ )
+}
+
+
TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
SSE, SSE_C, SSE_C_KEY, SSE_KMS_KEY_ID, SSE_C_COPY_SOURCE,
@@ -435,7 +444,7 @@
CONTENT_DISPOSITION, CONTENT_ENCODING, CONTENT_LANGUAGE,
EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS, NO_PROGRESS,
PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER,
- REQUEST_PAYER]
+ REQUEST_PAYER, NORMALIZE_UNICODE]
def get_client(session, region, endpoint_url, verify, config=None):
@@ -978,12 +987,14 @@
'follow_symlinks': self.parameters['follow_symlinks'],
'page_size': self.parameters['page_size'],
'result_queue': result_queue,
+ 'normalize_unicode': self.parameters['normalize_unicode'],
}
rgen_kwargs = {
'client': self._client, 'operation_name': '',
'follow_symlinks': self.parameters['follow_symlinks'],
'page_size': self.parameters['page_size'],
'result_queue': result_queue,
+ 'normalize_unicode': self.parameters['normalize_unicode'],
}
fgen_request_parameters = \ |
FTR the last version of the patch still works. |
Summary
aws s3 sync
doesn't play well with HFS+ unicode normalization on OS X. I suggest to add an option to normalize file names read locally in normal form C before doing anything with them.Reproduction steps
Create a file on S3 containing an accented character. For reasons that will become apparent later, do this on a Linux system.
Synchronize that file on a Mac.
Synchronize it back to S3.
At this point the file shows up twice in S3!
Why this happens
Unicode defines two normal forms — NFC and NFD — for some characters, typically accented characters which are common in Western European languages and even occur in English.
The documentation of unicodedata.normalize, the Python function that converts between the two forms, has a good explanation.
A quick illustration:
The default filesystem of OS X, HFS+, enforces something that resembles NFD. (Let's say I haven't encountered the difference yet.)
Pretty much everything else, including typing on a keyboard on Linux or OS X, uses NFC. I'm not sure about Windows.
Of course this is entirely HFS+'s fault, but since OS X is a popular system among your target audience, I hope you may have some interest in providing a solution to this problem.
What you can do about it
I think a
--normalize-unicode
option (possibly with a better name) foraws s3 sync
would be useful. It would normalize file names read from the local filesystem withunicodedata.normalize('NFKC', filepath)
.Its primary purpose would be to interact with S3 on OS X and have file names in NFC form on S3, which is what the rest of the world expects and will cause the least amount of problems.
I don't know
aws cli
well enough to tell which other parts could use this option. I just encountered the problem when trying to replace "rsync
to file server" with "aws s3 sync
to S3".FWIW
rsync
provides a solution to this problem with the--iconv
option. A common idiom is--iconv=UTF8-MAC,UTF8
when rsync'ing from OS X to Linux and--iconv=UTF8,UTF8-MAC
when rsync'ing from Linux to OS X.UTF8-MAC
is howrsync
calls the encoding of file names on HFS+.However this isn't a good API to tackle the specific problem I'm raising here. This API is about the encoding of file names. The bug is related to Unicode normalization. These are different concepts.
UTF8-MAC
mixes them.Thanks!
The text was updated successfully, but these errors were encountered: