New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide an option to perfom unicode normalization on local file names #1639

Open
aaugustin opened this Issue Nov 14, 2015 · 13 comments

Comments

Projects
None yet
8 participants
@aaugustin
Copy link

aaugustin commented Nov 14, 2015

Summary

aws s3 sync doesn't play well with HFS+ unicode normalization on OS X. I suggest to add an option to normalize file names read locally in normal form C before doing anything with them.

Reproduction steps

  1. Create a file on S3 containing an accented character. For reasons that will become apparent later, do this on a Linux system.

    (linux) % echo test > test/café.txt
    (linux) % aws s3 sync test s3://<test-bucket>/test
    
  2. Synchronize that file on a Mac.

    (OS X) % aws s3 sync s3://<test-bucket>/test test
    download: s3://<test-bucket>/test/café.txt to test/café.txt
    
  3. Synchronize it back to S3.

    (OS X) % aws s3 sync s3://<test-bucket>/test test
    upload: test/café.txt to s3://<test-bucket>/test/café.txt
    
  • Expected result: no upload because the file is identical locally and on S3: I was just sync'd!
  • Actual result: the file is uploaded again.

At this point the file shows up twice in S3!

screen shot 2015-11-14 at 22 45 38

## Why this happens

Unicode defines two normal forms — NFC and NFD — for some characters, typically accented characters which are common in Western European languages and even occur in English.

The documentation of unicodedata.normalize, the Python function that converts between the two forms, has a good explanation.

A quick illustration:

>>> "café".encode('utf-8')
>>> b'caf\xc3\xa9'
>>> unicodedata.normalize('NFC', "café").encode('utf-8')
>>> b'caf\xc3\xa9'
>>> unicodedata.normalize('NFD', "café").encode('utf-8')
>>> b'cafe\xcc\x81'

The default filesystem of OS X, HFS+, enforces something that resembles NFD. (Let's say I haven't encountered the difference yet.)

Pretty much everything else, including typing on a keyboard on Linux or OS X, uses NFC. I'm not sure about Windows.

Of course this is entirely HFS+'s fault, but since OS X is a popular system among your target audience, I hope you may have some interest in providing a solution to this problem.

What you can do about it

I think a --normalize-unicode option (possibly with a better name) for aws s3 sync would be useful. It would normalize file names read from the local filesystem with unicodedata.normalize('NFKC', filepath).

Its primary purpose would be to interact with S3 on OS X and have file names in NFC form on S3, which is what the rest of the world expects and will cause the least amount of problems.

I don't know aws cli well enough to tell which other parts could use this option. I just encountered the problem when trying to replace "rsync to file server" with "aws s3 sync to S3".

FWIW rsync provides a solution to this problem with the --iconv option. A common idiom is --iconv=UTF8-MAC,UTF8 when rsync'ing from OS X to Linux and --iconv=UTF8,UTF8-MAC when rsync'ing from Linux to OS X. UTF8-MAC is how rsync calls the encoding of file names on HFS+.

However this isn't a good API to tackle the specific problem I'm raising here. This API is about the encoding of file names. The bug is related to Unicode normalization. These are different concepts. UTF8-MAC mixes them.

Thanks!

@aaugustin

This comment has been minimized.

Copy link

aaugustin commented Nov 15, 2015

For what it's worth, the following patch solves my problem:

diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/filegenerator.py awscli/customizations/s3/filegenerator.py
--- awscli.orig/customizations/s3/filegenerator.py  2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/filegenerator.py   2015-11-15 18:56:31.000000000 +0100
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata

 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,11 +117,12 @@
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None):
+                 page_size=None, normalize_unicode=False, result_queue=None):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
         self.page_size = page_size
+        self.normalize_unicode = normalize_unicode
         self.result_queue = result_queue
         if not result_queue:
             self.result_queue = queue.Queue()
@@ -167,6 +169,8 @@
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 size, last_update = get_file_stat(path)
@@ -185,6 +189,8 @@
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/s3handler.py awscli/customizations/s3/s3handler.py
--- awscli.orig/customizations/s3/s3handler.py  2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/s3handler.py   2015-11-15 09:25:54.000000000 +0100
@@ -64,7 +64,8 @@
                        'grants': None, 'only_show_errors': False,
                        'is_stream': False, 'paths_type': None,
                        'expected_size': None, 'metadata_directive': None,
-                       'ignore_glacier_warnings': False}
+                       'ignore_glacier_warnings': False,
+                       'normalize_unicode': False}
         self.params['region'] = params['region']
         for key in self.params.keys():
             if key in params:
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/subcommands.py awscli/customizations/s3/subcommands.py
--- awscli.orig/customizations/s3/subcommands.py    2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/subcommands.py 2015-11-15 18:18:23.000000000 +0100
@@ -301,12 +301,21 @@
 }


+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on OS X.'
+    )
+}
+
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, STORAGE_CLASS, GRANTS, WEBSITE_REDIRECT, CONTENT_TYPE,
                  CACHE_CONTROL, CONTENT_DISPOSITION, CONTENT_ENCODING,
                  CONTENT_LANGUAGE, EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
-                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS]
+                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, NORMALIZE_UNICODE]


 def get_client(session, region, endpoint_url, verify):
@@ -770,10 +779,12 @@
                                        operation_name,
                                        self.parameters['follow_symlinks'],
                                        self.parameters['page_size'],
+                                       self.parameters['normalize_unicode'],
                                        result_queue=result_queue)
         rev_generator = FileGenerator(self._client, '',
                                       self.parameters['follow_symlinks'],
                                       self.parameters['page_size'],
+                                      self.parameters['normalize_unicode'],
                                       result_queue=result_queue)
         taskinfo = [TaskInfo(src=files['src']['path'],
                              src_type='s3',

I'm not submitting it as a PR because it's missing at least tests and documentation. I'm mostly leaving it here in case others find it helpful.

Of course, feel free to use it as a starting point for fixing this issue if my approach doesn't seem too off base.

EDIT: just updated the patch to apply unicode normalization before sorting file names.

@JordonPhillips JordonPhillips added the bug label Nov 16, 2015

@JordonPhillips

This comment has been minimized.

Copy link
Member

JordonPhillips commented Nov 16, 2015

Wow, nice work! We'll look into it

@aaugustin

This comment has been minimized.

Copy link

aaugustin commented Nov 21, 2015

I created a branch and opened a pull request in order to make it easier to maintain the patch -- the recent release broke it.

@aaugustin

This comment has been minimized.

Copy link

aaugustin commented Oct 14, 2016

Here's a new version of the patch, recreated against the latest release.

In case someone else uses it:

  • I plan to maintain it for the foreseeable future because I need it. I'll post occasional updates here. Changes are extremely limited and should be easy to port to future versions.
  • While the initial response was positive, it's unclear whether AWS plans to fix this bug. Unfortunately, in my experience, Americans companies tend not to care much about Unicode, even if they do business internationally, so I'm not getting my hopes too high.
  • For this reason, I suggest sticking to ASCII file names on S3 rather than using this if it isn't too late for you. (It is too late for me.)
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/filegenerator.py awscli/customizations/s3/filegenerator.py
--- awscli.orig/customizations/s3/filegenerator.py  2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/filegenerator.py   2015-11-15 18:56:31.000000000 +0100
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata

 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,11 +117,12 @@
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None):
+                 page_size=None, normalize_unicode=False, result_queue=None):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
         self.page_size = page_size
+        self.normalize_unicode = normalize_unicode
         self.result_queue = result_queue
         if not result_queue:
             self.result_queue = queue.Queue()
@@ -167,6 +169,8 @@
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 size, last_update = get_file_stat(path)
@@ -185,6 +189,8 @@
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/s3handler.py awscli/customizations/s3/s3handler.py
--- awscli.orig/customizations/s3/s3handler.py  2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/s3handler.py   2015-11-15 09:25:54.000000000 +0100
@@ -64,7 +64,8 @@
                        'grants': None, 'only_show_errors': False,
                        'is_stream': False, 'paths_type': None,
                        'expected_size': None, 'metadata_directive': None,
-                       'ignore_glacier_warnings': False}
+                       'ignore_glacier_warnings': False,
+                       'normalize_unicode': False}
         self.params['region'] = params['region']
         for key in self.params.keys():
             if key in params:
diff -r -u -x '*.pyc' -w awscli.orig/customizations/s3/subcommands.py awscli/customizations/s3/subcommands.py
--- awscli.orig/customizations/s3/subcommands.py    2015-11-15 08:50:45.000000000 +0100
+++ awscli/customizations/s3/subcommands.py 2015-11-15 18:18:23.000000000 +0100
@@ -301,12 +301,21 @@
 }


+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on OS X.'
+    )
+}
+
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, STORAGE_CLASS, GRANTS, WEBSITE_REDIRECT, CONTENT_TYPE,
                  CACHE_CONTROL, CONTENT_DISPOSITION, CONTENT_ENCODING,
                  CONTENT_LANGUAGE, EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
-                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS]
+                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, NORMALIZE_UNICODE]


 def get_client(session, region, endpoint_url, verify):
@@ -770,10 +779,12 @@
                                        operation_name,
                                        self.parameters['follow_symlinks'],
                                        self.parameters['page_size'],
+                                       self.parameters['normalize_unicode'],
                                        result_queue=result_queue)
         rev_generator = FileGenerator(self._client, '',
                                       self.parameters['follow_symlinks'],
                                       self.parameters['page_size'],
+                                      self.parameters['normalize_unicode'],
                                       result_queue=result_queue)
         taskinfo = [TaskInfo(src=files['src']['path'],
                              src_type='s3',
@BenAbineriBubble

This comment has been minimized.

Copy link

BenAbineriBubble commented Dec 14, 2016

Thanks for the excellent analysis Aymeric, this is exactly the issue I'm experiencing and it was difficult to track down.

I hope somebody from AWS can help us here.

@aaugustin

This comment has been minimized.

Copy link

aaugustin commented Dec 20, 2016

Updated version of the patch against the latest release.

commit 78640c7f7a345fb3740b72c239007470a5709caf
Author: Aymeric Augustin
Date:   Tue Dec 20 23:05:49 2016 +0100

    Add an option to normalize file names.

diff --git a/awscli/customizations/s3/filegenerator.py b/awscli/customizations/s3/filegenerator.py
index d33b77f..13a7f1d 100644
--- a/awscli/customizations/s3/filegenerator.py
+++ b/awscli/customizations/s3/filegenerator.py
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata
 
 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,7 +117,8 @@ class FileGenerator(object):
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None, request_parameters=None):
+                 page_size=None, result_queue=None, request_parameters=None,
+                 normalize_unicode=False):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
@@ -127,6 +129,7 @@ class FileGenerator(object):
         self.request_parameters = {}
         if request_parameters is not None:
             self.request_parameters = request_parameters
+        self.normalize_unicode = normalize_unicode
 
     def call(self, files):
         """
@@ -170,6 +173,8 @@ class FileGenerator(object):
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 size, last_update = get_file_stat(path)
@@ -189,6 +194,8 @@ class FileGenerator(object):
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff --git a/awscli/customizations/s3/subcommands.py b/awscli/customizations/s3/subcommands.py
index 4bc7398..04afe3f 100644
--- a/awscli/customizations/s3/subcommands.py
+++ b/awscli/customizations/s3/subcommands.py
@@ -417,6 +417,14 @@ REQUEST_PAYER = {
     )
 }
 
+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on OS X.'
+    )
+}
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, SSE_C, SSE_C_KEY, SSE_KMS_KEY_ID, SSE_C_COPY_SOURCE,
@@ -424,7 +432,8 @@ TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  WEBSITE_REDIRECT, CONTENT_TYPE, CACHE_CONTROL,
                  CONTENT_DISPOSITION, CONTENT_ENCODING, CONTENT_LANGUAGE,
                  EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
-                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER]
+                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER,
+                 NORMALIZE_UNICODE]
 
 
 def get_client(session, region, endpoint_url, verify, config=None):
@@ -963,12 +972,14 @@ class CommandArchitecture(object):
             'client': self._source_client, 'operation_name': operation_name,
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
+            'normalize_unicode': self.parameters['normalize_unicode'],
             'result_queue': result_queue,
         }
         rgen_kwargs = {
             'client': self._client, 'operation_name': '',
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
+            'normalize_unicode': self.parameters['normalize_unicode'],
             'result_queue': result_queue,
         }
 
@ishikawa

This comment has been minimized.

Copy link

ishikawa commented Jan 1, 2017

This patch is perfect for me, thanks. 👍

ishikawa added a commit to ishikawa/aws-cli that referenced this issue Jan 1, 2017

@aaugustin

This comment has been minimized.

Copy link

aaugustin commented May 2, 2017

Patch rebased on top of develop.

commit c5466f2191b073303edef62d531761591e7e6c90
Author: Aymeric Augustin <aymeric.augustin@m4x.org>
Date:   Tue Dec 20 23:05:49 2016 +0100

    Add an option to normalize file names.

diff --git a/awscli/customizations/s3/filegenerator.py b/awscli/customizations/s3/filegenerator.py
index f24ca187..70a17581 100644
--- a/awscli/customizations/s3/filegenerator.py
+++ b/awscli/customizations/s3/filegenerator.py
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata

 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,7 +117,8 @@ class FileGenerator(object):
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None, request_parameters=None):
+                 page_size=None, result_queue=None, request_parameters=None,
+                 normalize_unicode=False):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
@@ -127,6 +129,7 @@ class FileGenerator(object):
         self.request_parameters = {}
         if request_parameters is not None:
             self.request_parameters = request_parameters
+        self.normalize_unicode = normalize_unicode

     def call(self, files):
         """
@@ -170,6 +173,8 @@ class FileGenerator(object):
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 stats = self._safely_get_file_stats(path)
@@ -188,6 +193,8 @@ class FileGenerator(object):
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff --git a/awscli/customizations/s3/subcommands.py b/awscli/customizations/s3/subcommands.py
index 02d591ea..b9b1d6c9 100644
--- a/awscli/customizations/s3/subcommands.py
+++ b/awscli/customizations/s3/subcommands.py
@@ -418,6 +418,14 @@ REQUEST_PAYER = {
     )
 }

+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on OS X.'
+    )
+}
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, SSE_C, SSE_C_KEY, SSE_KMS_KEY_ID, SSE_C_COPY_SOURCE,
@@ -425,7 +433,8 @@ TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  WEBSITE_REDIRECT, CONTENT_TYPE, CACHE_CONTROL,
                  CONTENT_DISPOSITION, CONTENT_ENCODING, CONTENT_LANGUAGE,
                  EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS,
-                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER]
+                 PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER,
+                 NORMALIZE_UNICODE]


 def get_client(session, region, endpoint_url, verify, config=None):
@@ -964,12 +973,14 @@ class CommandArchitecture(object):
             'client': self._source_client, 'operation_name': operation_name,
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
+            'normalize_unicode': self.parameters['normalize_unicode'],
             'result_queue': result_queue,
         }
         rgen_kwargs = {
             'client': self._client, 'operation_name': '',
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
+            'normalize_unicode': self.parameters['normalize_unicode'],
             'result_queue': result_queue,
         }

@JordonPhillips JordonPhillips added feature-request and removed bug labels Jul 25, 2017

@JordonPhillips

This comment has been minimized.

Copy link
Member

JordonPhillips commented Nov 3, 2017

I had a bit of free time this morning so I took a look at this. It doesn't look like this will work since we will need to operate on those files down the line and having the altered path will break that. I think the changes necessary to fully support this feature would need to be more invasive.

@ASayre

This comment has been minimized.

Copy link
Contributor

ASayre commented Feb 6, 2018

Good Morning!

We're closing this issue here on GitHub, as part of our migration to UserVoice for feature requests involving the AWS CLI.

This will let us get the most important features to you, by making it easier to search for and show support for the features you care the most about, without diluting the conversation with bug reports.

As a quick UserVoice primer (if not already familiar): after an idea is posted, people can vote on the ideas, and the product team will be responding directly to the most popular suggestions.

We’ve imported existing feature requests from GitHub - Search for this issue there!

And don't worry, this issue will still exist on GitHub for posterity's sake. As it’s a text-only import of the original post into UserVoice, we’ll still be keeping in mind the comments and discussion that already exist here on the GitHub issue.

GitHub will remain the channel for reporting bugs.

Once again, this issue can now be found by searching for the title on: https://aws.uservoice.com/forums/598381-aws-command-line-interface

-The AWS SDKs & Tools Team

This entry can specifically be found on UserVoice at: https://aws.uservoice.com/forums/598381-aws-command-line-interface/suggestions/33168379-provide-an-option-to-perfom-unicode-normalization

@ASayre ASayre closed this Feb 6, 2018

@salmanwaheed

This comment has been minimized.

Copy link

salmanwaheed commented Feb 6, 2018

@salmanwaheed

This comment has been minimized.

Copy link

salmanwaheed commented Feb 6, 2018

@aaugustin

This comment has been minimized.

Copy link

aaugustin commented Mar 4, 2018

Patch updated (again).

diff -Naur awscli.orig/customizations/s3/filegenerator.py awscli/customizations/s3/filegenerator.py
--- awscli.orig/customizations/s3/filegenerator.py    2018-03-04 21:29:37.000000000 +0100
+++ awscli/customizations/s3/filegenerator.py 2018-03-04 21:31:07.000000000 +0100
@@ -13,6 +13,7 @@
 import os
 import sys
 import stat
+import unicodedata

 from dateutil.parser import parse
 from dateutil.tz import tzlocal
@@ -116,7 +117,8 @@
     ``FileInfo`` objects to send to a ``Comparator`` or ``S3Handler``.
     """
     def __init__(self, client, operation_name, follow_symlinks=True,
-                 page_size=None, result_queue=None, request_parameters=None):
+                 page_size=None, result_queue=None, request_parameters=None,
+                 normalize_unicode=False):
         self._client = client
         self.operation_name = operation_name
         self.follow_symlinks = follow_symlinks
@@ -127,6 +129,7 @@
         self.request_parameters = {}
         if request_parameters is not None:
             self.request_parameters = request_parameters
+        self.normalize_unicode = normalize_unicode

     def call(self, files):
         """
@@ -170,6 +173,8 @@
         """
         join, isdir, isfile = os.path.join, os.path.isdir, os.path.isfile
         error, listdir = os.error, os.listdir
+        if self.normalize_unicode:
+            path = unicodedata.normalize('NFKC', path)
         if not self.should_ignore_file(path):
             if not dir_op:
                 stats = self._safely_get_file_stats(path)
@@ -188,6 +193,8 @@
                 listdir_names = listdir(path)
                 names = []
                 for name in listdir_names:
+                    if self.normalize_unicode:
+                        name = unicodedata.normalize('NFKC', name)
                     if not self.should_ignore_file_with_decoding_warnings(
                             path, name):
                         file_path = join(path, name)
diff -Naur awscli.orig/customizations/s3/subcommands.py awscli/customizations/s3/subcommands.py
--- awscli.orig/customizations/s3/subcommands.py  2018-03-04 21:29:37.000000000 +0100
+++ awscli/customizations/s3/subcommands.py   2018-03-04 21:33:41.000000000 +0100
@@ -427,6 +427,15 @@
     )
 }

+NORMALIZE_UNICODE = {
+    'name': 'normalize-unicode', 'action': 'store_true',
+    'help_text': (
+        'Normalizes file names read from the local filesystem in unicode '
+        'normal form KC. This is mainly useful when running on macOS.'
+    )
+}
+
+
 TRANSFER_ARGS = [DRYRUN, QUIET, INCLUDE, EXCLUDE, ACL,
                  FOLLOW_SYMLINKS, NO_FOLLOW_SYMLINKS, NO_GUESS_MIME_TYPE,
                  SSE, SSE_C, SSE_C_KEY, SSE_KMS_KEY_ID, SSE_C_COPY_SOURCE,
@@ -435,7 +444,7 @@
                  CONTENT_DISPOSITION, CONTENT_ENCODING, CONTENT_LANGUAGE,
                  EXPIRES, SOURCE_REGION, ONLY_SHOW_ERRORS, NO_PROGRESS,
                  PAGE_SIZE, IGNORE_GLACIER_WARNINGS, FORCE_GLACIER_TRANSFER,
-                 REQUEST_PAYER]
+                 REQUEST_PAYER, NORMALIZE_UNICODE]


 def get_client(session, region, endpoint_url, verify, config=None):
@@ -978,12 +987,14 @@
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
             'result_queue': result_queue,
+            'normalize_unicode': self.parameters['normalize_unicode'],
         }
         rgen_kwargs = {
             'client': self._client, 'operation_name': '',
             'follow_symlinks': self.parameters['follow_symlinks'],
             'page_size': self.parameters['page_size'],
             'result_queue': result_queue,
+            'normalize_unicode': self.parameters['normalize_unicode'],
         }

         fgen_request_parameters = \
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment