Setting default encoding in RawDocument to UTF-8#731
Setting default encoding in RawDocument to UTF-8#731aurambaj merged 1 commit intobox:masterfrom maallen:fix_bom_issue
Conversation
|
|
||
| public RawDocument(CharSequence inputCharSequence, LocaleId sourceLocale, LocaleId targetLocale) { | ||
| super(inputCharSequence, sourceLocale, targetLocale); | ||
| super(new ByteArrayInputStream(inputCharSequence.toString().getBytes()), "utf-8", sourceLocale, targetLocale); |
There was a problem hiding this comment.
getBytes() is platform sensitive so the version with the encoding (getBytes(StandardCharsets.UTF_8) should be used to be sure to get a utf-8 array. Also you can replace "utf-8" with the constant.
Did you figure out why all the other filters seem fine with utf-16 here and not the CVS/JSON one? just trying to understand if the filter should be fixed as an alternative. I'm wondering about the side effect of having utf-8 here since utf-16 is used in the default implementation. this choice might just be irrelevant (downstream logic should be working regardless of the content encoding)
There was a problem hiding this comment.
The other filters handle this slightly differently, Po filter reads the source file and resets its reader with a new encoding. The XML filter does similar by using the encoding returned from the DocumentParser, I could do something similar for JSON by overriding the createStartFilterEvent but I don't see an easy fix for CSV without replicating a lot of the underlying library code to set the encoding.
...a/com/box/l10n/mojito/service/assetintegritychecker/integritychecker/IntegrityCheckStep.java
Outdated
Show resolved
Hide resolved
| try { | ||
| documentContent = CharStreams.toString(rawDocument.getReader()); | ||
| } catch (IOException e) { | ||
| logger.error("Error reading document content", e); |
There was a problem hiding this comment.
rethrow a RuntimeException here since that would be a hard failure
There was a problem hiding this comment.
i'm gonna merge this and update later since i need that change
Updated the okapi RawDocument to use a default encoding of UTF-8 as localized files were using UTF-16 as the default encoding which meant that a BOM was being added to the generated localized files.