Potential for encoding corruption, depending on the environment #21

MattBlissett · 2021-01-12T13:12:49Z

Reported in gbif/portal-feedback#3191, but also affecting other datasets.

Mac OS sets the locale environment variable LC_CTYPE=UTF-8, which is not recognized on Linux. Linux would use en_US.UTF-8 or similar, or leave it unset and use LANG.

When Java starts up on Linux with the Mac OS LC_CTYPE=UTF-8, the Charsets.defaultCharset() is US-ASCII. This causes problems wherever the default character set is used: System.out, I/O streams without a specified character set, convenience classes like FileReader and FileWriter, etc.

In the case above, a FileWriter is used to output sorted DWCA data. With the mixed environment variables, that leads to the file being written in ASCII, and corrupted data.

In other words, gbif-common assumes a correctly configured UTF-8 environment.

The text was updated successfully, but these errors were encountered:

…roblem.

MattBlissett · 2021-01-13T13:32:31Z

The commit improves the code (removing an encoding encoding assumption), and logs a warning if FileUtils is used where the default character set is ASCII.

I've also prevented the servers from accepting locale environment variables being set when accessed over SSH.

MattBlissett added a commit that referenced this issue Jan 12, 2021

#21: Warn if the environment is not UTF-8, and fix the most obvious p…

628585f

…roblem.

MattBlissett closed this as completed Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential for encoding corruption, depending on the environment #21

Potential for encoding corruption, depending on the environment #21

MattBlissett commented Jan 12, 2021

MattBlissett commented Jan 13, 2021

Potential for encoding corruption, depending on the environment #21

Potential for encoding corruption, depending on the environment #21

Comments

MattBlissett commented Jan 12, 2021

MattBlissett commented Jan 13, 2021