Skip to content

Consolidate string encoding handling across C API #1040

@giampaolo

Description

@giampaolo

UPDATE: final situation

This issue has been fixed in PR #1052. Starting from version 5.3.0 psutil will fully support unicode. The notes below apply to any API returning a string such as process exe() or cwd() including non-filesystem related APIs such as process username() or WindowsService.description(). This is what users will get with psutil 5.3.0:

  • all strings are encoded by using the OS filesystem encoding which varies depending on the platform (e.g. UTF-8 on Linux, mbcs on Win)
  • unicode is now correctly supported on Windows (no corrupted data is returned) by using specific unicode Windows APIs
  • no API call is supposed to crash with UnicodeDecodeError
  • instead, in case of badly encoded data returned by the OS, the following error handlers are used to replace the bad characters in the string:
    • Python 3: sys.getfilesystemencodeerrors() (PY 3.6+) or "surrogatescape" on POSIX and "replace" on Windows
    • Python 2: "replace"
  • on Python 2 all APIs return bytes (str type), never unicode
  • on Python 2 you can go back to unicode by doing:
    unicode(p.exe(), sys.getdefaultencoding(), errors="replace")and do funky string comparisons.ù
    Example which filters processes with a funky name working with both Python 2 and 3:
# -*- coding: utf-8 -*-
import psutil, sys

PY3 = sys.version_info[0] == 2
LOOKFOR = u"ƒőő"
for proc in psutil.process_iter(attrs=['name']):
    name = proc.info['name']
    if not PY3:
        name = unicode(name, sys.getdefaultencoding(), errors="replace")
    if LOOKFOR == name:
         print("process %s found" % p)

Original issue

(NOTE: this content is updated as I go)

So, psutil has different APIs returning a string, many of which misbehaving when it comes to unicode.

  • A: may raise decoding error on python 3 in case of non-ASCII string
  • B: return unicode on Python 2 instead of str
  • C returns incorrect / invalid encoded data in case of non-ASCII string
API Linux Win OSX FreeBSD NetBSD OpenBSD SunOS
Process.cmdline()
Process.connections() A A A
Process.cwd()
Process.environ() B
Process.exe()
Process.memory_maps() B,C A A A
Process.name()
Process.open_files() A
Process.username()
disk_io_counters()
disk_partitions() A A A A A A
disk_usage(str)
net_connections() A A
net_if_addrs()
net_if_stats() B
net_io_counters()
sensors_fans()
sensors_temperatures()
users() A A A A A A
WinService.binpath() B
WinService.description() B,C
WinService.display_name() B,C
WinService.name()
WinService.status()
WinService.username() B

Right now there are 3 distinctive problems about it.

Filesystem or locale encoding?

First problem is that the C extension currently uses 2 approaches when it comes to decode and return a string:

  • PyUnicode_DecodeFSDefault
    • PyUnicode_Decode(Py_FileSystemDefaultEncoding, "replace") on Python 2 (kinda equivalent)
  • PyUnicode_DecodeLocale (Python 3 only)

Most of the times we use PyUnicode_DecodeFSDefault but not always. First issue, then, is to figure out which APIs should use one or the other. It appears clear that PyUnicode_DecodeFSDefault should be used for all fs-related APIs such as process exe(), open_files() etc. It is less clear when to use PyUnicode_DecodeLocale. To my understanding maybe we should use it for things such as:

  • WindowsService.description()
  • WindowsService.display_name()

...and maybe (but less likely) for:

  • Process.username()
  • users()

UPDATE: decided it's better for the user to deal with one encoding only (filesystem) and not think about what API he/she is using

Error handling

Second question is what to do in case the string cannot be correctly decoded.

About FS APIs

Right now we tend to use "surrogateescape", which is also the default for PyUnicode_DecodeFSDefault on Python 3, so I'm pretty sure for fs-related paths we should do this every time we have the chance (on Python 3 at least).

Note: on Windows the default is "surrogatepass" (py 3.6) or "replace" as per PEP-529.

It must be noted that AFAIK on Python 2 the os module has no fs-APIs returning a string (e.g. os.listdir()) which may crash with UnicodeDecodeError so we should do the same and use "replace". There are already some tests for this, see see test_unicode.py).

About other APIs

Shall we use "strict" (and raise exception) or "surrogateescape"? Not sure.

Python 2 vs. 3

And here comes the troubles. Whereas it appears kind of clear what to do on Python 3, Python 2 is different. In order to attempt to correctly handle and represent all kind of strings on Python 2 we should return... well, unicode instead of str, but I don't want to do that, and neither have APIs which return two different types depending on the circumstance. Since unicode support is already broken in Python 2 and its stdlib (see bpo-18695) I'm happy to always return str, use "replace" error handler and consider unicode support in psutil + python 2 broken (EDIT: it turns out it's not as you can retrieve the correct string by doing unicode(proc.exe(), sys.getdefaultencoding(), errors="replace")).

There's still the question about when to use PyUnicode_DecodeFSDefault and (a variant of) PyUnicode_DecodeLocale but on Python 2 this is less important as unicode handling is broken anyway.

Summary / TODO

Python 3

  • figure out whether / when to use PyUnicode_DecodeFSDefault and PyUnicode_DecodeLocale
  • figure out the default error handler for PyUnicode_DecodeLocale (if used)
  • discover APIs which do not use PyUnicode_DecodeFSDefault and as such may crash

Python 2

  • same as above but never fail with UnicodeDecodeError in case of PyUnicode_DecodeLocale
  • never return unicode, always str and have tests for it

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions