Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make applications use UTF-8 code page #902

Merged
merged 1 commit into from Feb 14, 2020

Conversation

@lassoan
Copy link
Member

lassoan commented Feb 10, 2020

This pull request allows storing DICOM database in and importing DICOM files from folders that contains special characters in their names. Tested with CTK DICOM examples and in 3D Slicer.

In general, all std::string or char buffers in CTK are now expected to contain UTF-8 encoded string. This is what was the default on Mac OSX and Linux and it has been made available as an option on Windows as well. This is where VTK and ITK (and most other software libraries) are moving towards, too.

  • use add_executable_utf8 instead of add_executable to make the application use UTF-8 code page on all platforms (already the default on Mac OSX and Linux, but this makes behavior on Windows the same, too)
  • use toUtf8 instead of toLatin1 when converting from QString to std::string or char*, especially when converting file names, since now file IO functions expect filenames in UTF-8
  • use qPrintable to convert QString to string to be printed on console (std::cout, std::cerr): this converts the string to the local encoding (which is often not be UTF-8 and maybe not Latin1)
  • use qUtf8Printable with Qt logging macros (qDebug(), qWarning(), qCritical()) as per Qt documentation
  • fixed ctkDICOMAppWidget (there were minor errors, such as signal mismatches)
  • use UTF-8 encoding when passing strings to VTK (this is where VTK goes toward https://discourse.vtk.org/t/what-is-state-of-the-art-unicode-file-names-on-windows/1821)

DICOM field content encoding/decoding does not use UTF-8 everywhere (always Latin1 encoding is used when creating DICOM queries or when exporting DICOM items to file, while it could make sense to switch to UTF-8 when values cannot be encoded as Latin1; when exporting files, non-Latin1 characters are stripped for better compatibility with less-advanced file systems).

@lassoan lassoan requested review from pieper and jcfr Feb 10, 2020
@pieper

This comment has been minimized.

Copy link
Member

pieper commented Feb 10, 2020

LGTM. Very glad to see this addressed. 👍

@pieper
pieper approved these changes Feb 10, 2020
@lassoan lassoan force-pushed the lassoan:ctk-utf8-string-encoding branch from 31d1816 to 1126096 Feb 10, 2020
Copy link
Member

jcfr left a comment

Thanks 🙏

Few minor tweaks and it will be good for integration.

I will also report back when I hear back from the CMake core team.

#!
#! \ingroup CMakeUtilities

macro(add_executable_utf8)

This comment has been minimized.

Copy link
@jcfr

jcfr Feb 11, 2020

Member

Let's change this to use function(add_executable_utf8)/enfunction()

@@ -1319,6 +1319,8 @@ void ctkDICOMBrowser::exportSeries(QString dirPath, QStringList uids)
destinationDir += sep;

// make sure only ascii characters are in the directory path
// (it is safer to avoid special characters in case the files are
// to be more compatible with exported to external drive or network storage)

This comment has been minimized.

Copy link
@jcfr

jcfr Feb 11, 2020

Member

are -> have
with -> when

    // (it is safer to avoid special characters in case the files have
    // to be more compatible when exported to external drive or network storage)
@lassoan lassoan force-pushed the lassoan:ctk-utf8-string-encoding branch 4 times, most recently from 6d6ca78 to b25af7f Feb 12, 2020
In general, all std::string or char buffers in CTK are now expected to contain UTF-8 encoded string. This is what was the default on Mac OSX and Linux and it has been made available as an option on Windows from Version 1903 (May 2019 Update). This is where VTK and ITK (and most other software libraries) are moving towards, too.

- use add_executable_utf8 instead of add_executable to make the application use UTF-8 code page on all platforms (already the default on Mac OSX and Linux, but this makes behavior on Windows the same, too)
- use toUtf8 instead of toLatin1 when converting from QString to std::string or char*, especially when converting file names, since now file IO functions expect filenames in UTF-8
- use qPrintable to convert QString to string to be printed on console (std::cout, std::cerr): this converts the string to the local encoding (which is often not be UTF-8 and maybe not Latin1)
- use qUtf8Printable with Qt logging macros (qDebug(), qWarning(), qCritical()) as per Qt documentation
- fixed ctkDICOMAppWidget (there were minor errors, such as signal mismatches)
- use UTF-8 encoding when passing strings to VTK (this is where VTK goes toward https://discourse.vtk.org/t/what-is-state-of-the-art-unicode-file-names-on-windows/1821)

DICOM field content encoding/decoding does not use UTF-8 everywhere (always Latin1 encoding is used when creating DICOM queries or when exporting DICOM items to file, while it could make sense to switch to UTF-8 when values cannot be encoded as Latin1; when exporting files, non-Latin1 characters are stripped for better compatibility with less-advanced file systems).
@lassoan lassoan force-pushed the lassoan:ctk-utf8-string-encoding branch from b25af7f to 2cda29e Feb 14, 2020
@lassoan lassoan merged commit daa513f into commontk:master Feb 14, 2020
0 of 2 checks passed
0 of 2 checks passed
ci/circleci: build-qt4 Your tests failed on CircleCI
Details
ci/circleci: build-qt5 Your tests failed on CircleCI
Details
@jcfr

This comment has been minimized.

Copy link
Member

jcfr commented Feb 14, 2020

For reference, here is the reply I got talking with our CMake code dev:

From CMake code dev:

We do not have any builtin way to do that.  I think it makes sense
to do it as part of the application's CMake code.

CMake itself adapts to the active code page of the environment in
which it is run, converts strings to UTF-8 internally, and converts
back on output.  Is this manifest setting some new way to avoid all
that trouble?  What happens when running such a tool at a command
prompt in which `chcp` does not report utf-8?

And the email I sent:

From @jcfr:

On windows, executable needs to be compiled with a manifest. This ensure the executable uses the appropriate code page.

At the build system level, we are adding a function called "add_executable_utf8" that will be using in CTK, Slicer, ...

Before we move forward with integrating the relevant pull requests in CTK and Slicer, I wanted to check if there was already a way to enable UTF8 support in executable added with CMake.

add_executable_utf8
https://github.com/commontk/CTK/pull/902/files#diff-9fa77245970f38e9202992bc68b23649

WindowsApplicationUseUtf8.manifest
https://github.com/commontk/CTK/pull/902/files#diff-7c1eb7c10351f0c01680f24569381945
@lassoan

This comment has been minimized.

Copy link
Member Author

lassoan commented Feb 14, 2020

CMake itself adapts to the active code page of the environment in which it is run, converts strings to UTF-8 internally, and converts back on output. Is this manifest setting some new way to avoid all that trouble?

Yes, all that trouble will soon be gone completely. The only catch is that it only works on recent Windows10 versions. If all the Windows-specific manual conversions have been already implemented then probably it is safer to keep them around for 1-2 years and then they can be removed.

What happens when running such a tool at a command prompt in which chcp does not report utf-8?

The manifest forces the application to be use UTF8 code page, regardless of what was the code page in the environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.