Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding for non unicode environments #365

Conversation

christian-intra2net
Copy link
Contributor

This is a solution attempt for issue #361 (when shell environment is unicode-unfriendly, print() may fail with unicode error).

This was meant to be a simple call to a helper function that wraps sys.stdout into a encoder in case stdout can only handle ASCII or Latin1. It turned out that also open() relies on locale.getpreferredencoding which is different when LANG=C or output is redirected or piped. Therefore, I also created function uopen() which helps to correctly handle unicode in files.

Replacing open() with uopen() where appropriate revealed two cases where open() should have used binary mode in the first place.

With this branch, I can now run the unittests in my linux shell with LANG=C without problems. No guarantees on whether this is enough for running in strange windows/mac environments

@decalage2 decalage2 self-requested a review November 5, 2018 10:11
@decalage2 decalage2 added this to the oletools 0.54 milestone Nov 5, 2018
@christian-intra2net christian-intra2net force-pushed the encoding-for-non-unicode-environments branch 2 times, most recently from 94172db to 2302b36 Compare December 6, 2018 14:42
@christian-intra2net
Copy link
Contributor Author

christian-intra2net commented Dec 6, 2018

Did that force-push to simplify changes to msodde; added commits to increase version and update changelog

@christian-intra2net christian-intra2net force-pushed the encoding-for-non-unicode-environments branch from 2302b36 to e662d78 Compare January 4, 2019 15:35
@christian-intra2net
Copy link
Contributor Author

Rebased onto current master

When print()ing unicode, python relies on
locale.getpreferredencoding to determine how to represent unicode
text. This fails in several cases, e.g. when redirecting output,
piping output into other programs or when the shell environment
has no locale defined (e.g. in linux with LANG=C). In all these
cases, print()ing non-ascii characters raises unicode exceptions.

Prevent these errors by encoding output in case of redirection,
replacing unhandleded chars in case of unicode-unfriendly shells.

This tries to solve issue decalage2#361
This replaces an earlier partial custom solution
This is only an unimportant test that apparently has never been run (had
a fatal error)
open() of text-files also depends on locale.getpreferredencoding which is
"ascii" (or so) if e.g. LANG=C or if redirecting output in python2.

Provide a function uopen() that ensures text-files are always opened such
that unicode text can be read properly.
This makes usage of uopen unnecessary.
The xml parser takes the encoding from the file header
Without this I got ASCII encoding on my machine
This way, all modules that use the log_helper do not need to call
ensure_stdout_handles_unicode (e.g. msodde, olevba)
@christian-intra2net christian-intra2net force-pushed the encoding-for-non-unicode-environments branch from e662d78 to 0b3af2d Compare July 16, 2019 08:35
@christian-intra2net
Copy link
Contributor Author

Rebased onto master required a few changes. Also realized that log_helper can call the stdout-wrapping-function for us, so removed calls from olevba, msodde and ooxml.test

Copy link
Owner

@decalage2 decalage2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @christian-intra2net, this is a long overdue PR. In general I see why it is useful to better manage unicode output when the locale is not optimal such as "C". However, the code in io_encoding looks quite complex with lots of if/else clauses. Many things can break depending on the python version and the OS config. Moreover, it looks like Python's behaviour with unicode/UTF-8/encodings may change in the future (for example https://discuss.python.org/t/use-utf-8-as-default-text-file-encoding/1785), so it's really a tricky topic.
How to be sure all the corner cases are covered, and how to test them?
All that to say I will merge this PR, but we need to be careful not to cause exceptions on normal/modern systems that support UTF-8.
And thanks a lot for your hard work! :-)


In order to read unicode from text, python uses locale.getpreferredencoding
to translate bytes to str. If the environment only provides ASCII encoding,
this will fail since most office files contain unicode.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, ms office files are all binary. In which cases do we need to deal with text files with unicode in oletools?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XML files are text, for example, msodde also works on CSV files, which are text.
The open() code will not do anything if a file is opened in binary mode

@decalage2 decalage2 merged commit 2f7a1ef into decalage2:master Oct 10, 2019
@christian-intra2net
Copy link
Contributor Author

I tried to be careful, falling back to builtin behaviour as soon as any encoding is specified.
Interesting link, good to see the python pros are aware of the problem and try to do something about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants