Encoding for non unicode environments #365

christian-intra2net · 2018-11-05T09:43:42Z

This is a solution attempt for issue #361 (when shell environment is unicode-unfriendly, print() may fail with unicode error).

This was meant to be a simple call to a helper function that wraps sys.stdout into a encoder in case stdout can only handle ASCII or Latin1. It turned out that also open() relies on locale.getpreferredencoding which is different when LANG=C or output is redirected or piped. Therefore, I also created function uopen() which helps to correctly handle unicode in files.

Replacing open() with uopen() where appropriate revealed two cases where open() should have used binary mode in the first place.

With this branch, I can now run the unittests in my linux shell with LANG=C without problems. No guarantees on whether this is enough for running in strange windows/mac environments

christian-intra2net · 2018-12-06T14:54:41Z

Did that force-push to simplify changes to msodde; added commits to increase version and update changelog

christian-intra2net · 2019-01-04T15:36:36Z

Rebased onto current master

When print()ing unicode, python relies on locale.getpreferredencoding to determine how to represent unicode text. This fails in several cases, e.g. when redirecting output, piping output into other programs or when the shell environment has no locale defined (e.g. in linux with LANG=C). In all these cases, print()ing non-ascii characters raises unicode exceptions. Prevent these errors by encoding output in case of redirection, replacing unhandleded chars in case of unicode-unfriendly shells. This tries to solve issue decalage2#361

This replaces an earlier partial custom solution

This is only an unimportant test that apparently has never been run (had a fatal error)

open() of text-files also depends on locale.getpreferredencoding which is "ascii" (or so) if e.g. LANG=C or if redirecting output in python2. Provide a function uopen() that ensures text-files are always opened such that unicode text can be read properly.

This makes usage of uopen unnecessary.

The xml parser takes the encoding from the file header

Without this I got ASCII encoding on my machine

This way, all modules that use the log_helper do not need to call ensure_stdout_handles_unicode (e.g. msodde, olevba)

log_helper does that for us

christian-intra2net · 2019-07-16T08:37:08Z

Rebased onto master required a few changes. Also realized that log_helper can call the stdout-wrapping-function for us, so removed calls from olevba, msodde and ooxml.test

decalage2

Hi @christian-intra2net, this is a long overdue PR. In general I see why it is useful to better manage unicode output when the locale is not optimal such as "C". However, the code in io_encoding looks quite complex with lots of if/else clauses. Many things can break depending on the python version and the OS config. Moreover, it looks like Python's behaviour with unicode/UTF-8/encodings may change in the future (for example https://discuss.python.org/t/use-utf-8-as-default-text-file-encoding/1785), so it's really a tricky topic.
How to be sure all the corner cases are covered, and how to test them?
All that to say I will merge this PR, but we need to be careful not to cause exceptions on normal/modern systems that support UTF-8.
And thanks a lot for your hard work! :-)

decalage2 · 2019-10-10T06:51:34Z

oletools/common/io_encoding.py

+
+    In order to read unicode from text, python uses locale.getpreferredencoding
+    to translate bytes to str. If the environment only provides ASCII encoding,
+    this will fail since most office files contain unicode.


AFAIK, ms office files are all binary. In which cases do we need to deal with text files with unicode in oletools?

XML files are text, for example, msodde also works on CSV files, which are text.
The open() code will not do anything if a file is opened in binary mode

christian-intra2net · 2019-10-14T08:31:05Z

I tried to be careful, falling back to builtin behaviour as soon as any encoding is specified.
Interesting link, good to see the python pros are aware of the problem and try to do something about it.

decalage2 self-requested a review November 5, 2018 10:11

decalage2 added the 👍 enhancement label Nov 5, 2018

decalage2 added this to the oletools 0.54 milestone Nov 5, 2018

christian-intra2net force-pushed the encoding-for-non-unicode-environments branch 2 times, most recently from 94172db to 2302b36 Compare December 6, 2018 14:42

christian-intra2net force-pushed the encoding-for-non-unicode-environments branch from 2302b36 to e662d78 Compare January 4, 2019 15:35

decalage2 modified the milestones: oletools 0.54, oletools 0.55 Mar 15, 2019

christian-intra2net mentioned this pull request Jul 11, 2019

msodde UnicodeEncodeError Fix #267 #465

Closed

christian-intra2net added 19 commits July 16, 2019 09:21

tests: create unittests for unicode checker

9d5c9d3

msodde: Replace custom unicode checker with global one

15469ea

olevba[3]: ensure stdout can handle unicode

0cdbb2d

olemeta: ensure stdout can handle unicode

a972bb3

This replaces an earlier partial custom solution

oleobj: ensure stdout can handle unicode

6de903e

ooxml: ensure stdout can handle unicode

0798cd1

This is only an unimportant test that apparently has never been run (had a fatal error)

tests: test common.uopen

e9d29e0

msodde: open CSV files with correct mode & newlines

6d20641

This makes usage of uopen unnecessary.

ooxml: Ensure unicode can be read from text files

ffa7ec2

msodde: minor fixes

5d37234

ooxml: open files in binary mode

d796314

The xml parser takes the encoding from the file header

common: Risk calling setlocale for getting correct encoding

701e692

Without this I got ASCII encoding on my machine

common: make uopen behave like open() wrt. mode

48c1f3a

common: use encoding-related func to own module

d312787

various: adjust import of io_encoding functions

ae5ff5e

io_encoding: warn when modifying encoding

0e3efec

ooxml: Create __version__, add license and start changelog

5bf585f

christian-intra2net added 5 commits July 16, 2019 10:24

log_helper: ensure stdout handles unicode if logging there

48637fb

This way, all modules that use the log_helper do not need to call ensure_stdout_handles_unicode (e.g. msodde, olevba)

msodde: Remove unnecessary ensure_stdout_handles_unicode

ce3dd53

log_helper does that for us

olevba: Remove unnecessary ensure_stdout_handles_unicode

b59e636

log_helper does that for us

ooxml: Remove unnecessary ensure_stdout_handles_unicode

94f4566

log_helper does that for us

tests: handle unicode output in test bypassing main()

0b3af2d

christian-intra2net force-pushed the encoding-for-non-unicode-environments branch from e662d78 to 0b3af2d Compare July 16, 2019 08:35

decalage2 approved these changes Oct 10, 2019

View reviewed changes

decalage2 merged commit 2f7a1ef into decalage2:master Oct 10, 2019

decalage2 mentioned this pull request Oct 10, 2019

Avoid unicode errors in non-unicode environment #361

Closed

This was referenced Oct 14, 2019

Use log helper #449

Merged

Decrypt in oleobj #464

Closed

Pcode options + fixes #479

Merged

christian-intra2net mentioned this pull request Nov 14, 2019

Multiple tests failing on UnicodeEncodeError: 'ascii' codec can't encode character - RHEL7 + python3.6 #505

Open

christian-intra2net deleted the encoding-for-non-unicode-environments branch October 22, 2020 10:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding for non unicode environments #365

Encoding for non unicode environments #365

christian-intra2net commented Nov 5, 2018

christian-intra2net commented Dec 6, 2018 •

edited

Loading

christian-intra2net commented Jan 4, 2019

christian-intra2net commented Jul 16, 2019

decalage2 left a comment

decalage2 Oct 10, 2019

christian-intra2net Oct 14, 2019

christian-intra2net commented Oct 14, 2019

Encoding for non unicode environments #365

Encoding for non unicode environments #365

Conversation

christian-intra2net commented Nov 5, 2018

christian-intra2net commented Dec 6, 2018 • edited Loading

christian-intra2net commented Jan 4, 2019

christian-intra2net commented Jul 16, 2019

decalage2 left a comment

Choose a reason for hiding this comment

decalage2 Oct 10, 2019

Choose a reason for hiding this comment

christian-intra2net Oct 14, 2019

Choose a reason for hiding this comment

christian-intra2net commented Oct 14, 2019

christian-intra2net commented Dec 6, 2018 •

edited

Loading