New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON API: Encoding? #2273

Open
enkore opened this Issue Mar 8, 2017 · 11 comments

Comments

4 participants
@enkore
Copy link
Contributor

enkore commented Mar 8, 2017

From #2249

Or we fix IO encoding to UTF-8 (irrespective of locale) in JSON mode, which probably makes more sense and is less error prone for downstream developers anyway.

I.e., --log-json → stderr text is always UTF-8, stdin text (so not "borg create -") is always UTF-8 (prompts, passwords), env-vars are read as UTF-8 (dito)

--json → stdout text is always UTF-8

@enkore enkore added the c: json api label Apr 1, 2017

@enkore enkore added this to the 1.1 - near future goals milestone Aug 6, 2017

@enkore

This comment has been minimized.

Copy link
Contributor Author

enkore commented Aug 6, 2017

Attached to 1.1 milestone.

@enkore enkore added this to Borg 1.1 release pipe in Columns-of-issues Aug 7, 2017

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.0rc2, 1.1 - near future goals Aug 15, 2017

@ThomasWaldmann

This comment has been minimized.

Copy link
Member

ThomasWaldmann commented Aug 15, 2017

As this is a quite fundamental change, guess it should get some testing, in rc2.

Guess we won't have problems if the json is written to a file, not so sure about when it gets output on the console. Or piped and processed badly by other side.

@enkore

This comment has been minimized.

Copy link
Contributor Author

enkore commented Aug 16, 2017

Guess we won't have problems if the json is written to a file

When you write text (strings) to stdout/stderr, then they are encoded to bytes using an encoding guessed by Python. That's independent of whether stdout is connected to a TTY/terminal or redirected to a file.

@ThomasWaldmann

This comment has been minimized.

Copy link
Member

ThomasWaldmann commented Aug 16, 2017

Yes, but you suggested to always use utf-8.

So how does e.g. a cygwin console or latin1/ascii console react when you output utf-8 on it?
Guess we could live with funny characters, but it shouldn't crash or hang.

Or when doing borg --json | othertool (and othertool guesses encoding), it might guess wrong when utf-8 is not the native system / fs encoding? If othertool is specialized on borg, it would use the right encoding, but if not, could it be told to use utf-8?

@enkore enkore self-assigned this Aug 18, 2017

@enkore

This comment has been minimized.

Copy link
Contributor Author

enkore commented Aug 18, 2017

Depends on #2925;

Note: borg list already uses UTF-8 regardless of system preference (via safe_encode), but only for listing archive contents.

Yes, but you suggested to always use utf-8.

The alternative is to make step one of using --json: "Replicate the way Python guesses encodings [which changes over Python releases]." i.e. "Use Python.". That's not acceptable.

@RonnyPfannschmidt

This comment has been minimized.

Copy link
Contributor

RonnyPfannschmidt commented Aug 18, 2017

well - a completely ascii-save way could be to do unicode-escape then all unicode is escaped as \u...

@knutov

This comment has been minimized.

Copy link

knutov commented Aug 25, 2017

just as an idea - why not to skip support for latin1 and other non-unicode terminals now?

latin1 simbols inside utf8 will look the same on latin1 terminal I suppose. Other simbols will not be readable anyway, there is no good solution for this. And there is iconv for those who need 8bit encodings and knows what he is doing.

@ThomasWaldmann

This comment has been minimized.

Copy link
Member

ThomasWaldmann commented Aug 27, 2017

@enkore still working / do you still want to work on this?

@enkore

This comment has been minimized.

Copy link
Contributor Author

enkore commented Aug 27, 2017

I prepared an initial, functionally incomplete patch I was completely dissatisfied with. I've been working to fix this for good by replacing most of these interactions (os.environ, input/yes, get_passphrase, ...) to use a iosys-class that determines encoding (from Python) and decodes stuff. But this is still incomplete and touches many of the more annoying parts of the code, so it may be reasonable to just go forward with rc2 and perhaps even 1.1.0 without having this resolved yet — on most (Linux/BSD) systems it will "mostly just work", because UTF-8 is a very widespread locale codeset and typically assumed. (OpenBSD has an especially good grip on things here for a Unix, because they only support UTF-8 and 7-bit ASCII). In this case it may be best to add a short note in the docs to say that encoding will be finalized to UTF-8 later.

This will fall apart on Linux when no locale is configured (because Python will fallback to 7-bit ASCII), or glibc things no locale is configured, or considers the configuration invalid (e.g. partial or missing locale files). And of course every locale that is not UTF-8.

@ThomasWaldmann

This comment has been minimized.

Copy link
Member

ThomasWaldmann commented Aug 27, 2017

OK, so let's have some docs now and the fix for 1.1.1 or so.

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1 - near future goals, 1.1.0rc2 Aug 27, 2017

@enkore enkore added the bug label Sep 7, 2017

enkore added a commit that referenced this issue Sep 8, 2017

ThomasWaldmann added a commit to ThomasWaldmann/borg that referenced this issue Sep 8, 2017

ThomasWaldmann added a commit that referenced this issue Sep 9, 2017

Merge pull request #3019 from ThomasWaldmann/json-utf8-locale-1.1
document utf-8 locale requirement for json mode, #2273 (#3009)

@enkore enkore modified the milestones: 1.1.x, 1.1.0 release Sep 9, 2017

@ThomasWaldmann

This comment has been minimized.

Copy link
Member

ThomasWaldmann commented Sep 18, 2017

From https://docs.python.org/3.4/library/sys.html#sys.stdin / sys.stdout / sys.stderr:

The character encoding is platform-dependent.
Under Windows, if the stream is interactive (that is, if its isatty() method returns
True), the console codepage is used, otherwise the ANSI code page.
Under other platforms, the locale encoding is used (see locale.getpreferredencoding()).

Under all platforms though, you can override this value by setting the
PYTHONIOENCODING environment variable before starting Python.

@enkore enkore removed their assignment Oct 14, 2017

@ThomasWaldmann ThomasWaldmann removed this from the 1.1.1rc1 milestone Oct 14, 2017

@ThomasWaldmann ThomasWaldmann added this to the 1.1.2rc1 milestone Oct 14, 2017

@ThomasWaldmann ThomasWaldmann modified the milestones: 1.1.2rc1, 1.1.x Nov 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment