Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential for encoding corruption, depending on the environment #21

Closed
MattBlissett opened this issue Jan 12, 2021 · 1 comment
Closed

Comments

@MattBlissett
Copy link
Member

Reported in gbif/portal-feedback#3191, but also affecting other datasets.

Mac OS sets the locale environment variable LC_CTYPE=UTF-8, which is not recognized on Linux. Linux would use en_US.UTF-8 or similar, or leave it unset and use LANG.

When Java starts up on Linux with the Mac OS LC_CTYPE=UTF-8, the Charsets.defaultCharset() is US-ASCII. This causes problems wherever the default character set is used: System.out, I/O streams without a specified character set, convenience classes like FileReader and FileWriter, etc.

In the case above, a FileWriter is used to output sorted DWCA data. With the mixed environment variables, that leads to the file being written in ASCII, and corrupted data.

In other words, gbif-common assumes a correctly configured UTF-8 environment.

@MattBlissett
Copy link
Member Author

The commit improves the code (removing an encoding encoding assumption), and logs a warning if FileUtils is used where the default character set is ASCII.

I've also prevented the servers from accepting locale environment variables being set when accessed over SSH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant