-
Notifications
You must be signed in to change notification settings - Fork 562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding for non unicode environments #365
Encoding for non unicode environments #365
Conversation
94172db
to
2302b36
Compare
Did that force-push to simplify changes to msodde; added commits to increase version and update changelog |
2302b36
to
e662d78
Compare
Rebased onto current master |
When print()ing unicode, python relies on locale.getpreferredencoding to determine how to represent unicode text. This fails in several cases, e.g. when redirecting output, piping output into other programs or when the shell environment has no locale defined (e.g. in linux with LANG=C). In all these cases, print()ing non-ascii characters raises unicode exceptions. Prevent these errors by encoding output in case of redirection, replacing unhandleded chars in case of unicode-unfriendly shells. This tries to solve issue decalage2#361
This replaces an earlier partial custom solution
This is only an unimportant test that apparently has never been run (had a fatal error)
open() of text-files also depends on locale.getpreferredencoding which is "ascii" (or so) if e.g. LANG=C or if redirecting output in python2. Provide a function uopen() that ensures text-files are always opened such that unicode text can be read properly.
This makes usage of uopen unnecessary.
The xml parser takes the encoding from the file header
Without this I got ASCII encoding on my machine
This way, all modules that use the log_helper do not need to call ensure_stdout_handles_unicode (e.g. msodde, olevba)
log_helper does that for us
log_helper does that for us
log_helper does that for us
e662d78
to
0b3af2d
Compare
Rebased onto master required a few changes. Also realized that log_helper can call the stdout-wrapping-function for us, so removed calls from olevba, msodde and ooxml.test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @christian-intra2net, this is a long overdue PR. In general I see why it is useful to better manage unicode output when the locale is not optimal such as "C". However, the code in io_encoding looks quite complex with lots of if/else clauses. Many things can break depending on the python version and the OS config. Moreover, it looks like Python's behaviour with unicode/UTF-8/encodings may change in the future (for example https://discuss.python.org/t/use-utf-8-as-default-text-file-encoding/1785), so it's really a tricky topic.
How to be sure all the corner cases are covered, and how to test them?
All that to say I will merge this PR, but we need to be careful not to cause exceptions on normal/modern systems that support UTF-8.
And thanks a lot for your hard work! :-)
|
||
In order to read unicode from text, python uses locale.getpreferredencoding | ||
to translate bytes to str. If the environment only provides ASCII encoding, | ||
this will fail since most office files contain unicode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, ms office files are all binary. In which cases do we need to deal with text files with unicode in oletools?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XML files are text, for example, msodde also works on CSV files, which are text.
The open() code will not do anything if a file is opened in binary mode
I tried to be careful, falling back to builtin behaviour as soon as any encoding is specified. |
This is a solution attempt for issue #361 (when shell environment is unicode-unfriendly, print() may fail with unicode error).
This was meant to be a simple call to a helper function that wraps sys.stdout into a encoder in case stdout can only handle ASCII or Latin1. It turned out that also open() relies on locale.getpreferredencoding which is different when LANG=C or output is redirected or piped. Therefore, I also created function uopen() which helps to correctly handle unicode in files.
Replacing open() with uopen() where appropriate revealed two cases where open() should have used binary mode in the first place.
With this branch, I can now run the unittests in my linux shell with LANG=C without problems. No guarantees on whether this is enough for running in strange windows/mac environments