Skip to content

Setting default encoding in RawDocument to UTF-8#731

Merged
aurambaj merged 1 commit intobox:masterfrom
maallen:fix_bom_issue
Oct 9, 2021
Merged

Setting default encoding in RawDocument to UTF-8#731
aurambaj merged 1 commit intobox:masterfrom
maallen:fix_bom_issue

Conversation

@maallen
Copy link
Contributor

@maallen maallen commented Oct 7, 2021

Updated the okapi RawDocument to use a default encoding of UTF-8 as localized files were using UTF-16 as the default encoding which meant that a BOM was being added to the generated localized files.


public RawDocument(CharSequence inputCharSequence, LocaleId sourceLocale, LocaleId targetLocale) {
super(inputCharSequence, sourceLocale, targetLocale);
super(new ByteArrayInputStream(inputCharSequence.toString().getBytes()), "utf-8", sourceLocale, targetLocale);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getBytes() is platform sensitive so the version with the encoding (getBytes(StandardCharsets.UTF_8) should be used to be sure to get a utf-8 array. Also you can replace "utf-8" with the constant.

Did you figure out why all the other filters seem fine with utf-16 here and not the CVS/JSON one? just trying to understand if the filter should be fixed as an alternative. I'm wondering about the side effect of having utf-8 here since utf-16 is used in the default implementation. this choice might just be irrelevant (downstream logic should be working regardless of the content encoding)

Copy link
Contributor Author

@maallen maallen Oct 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other filters handle this slightly differently, Po filter reads the source file and resets its reader with a new encoding. The XML filter does similar by using the encoding returned from the DocumentParser, I could do something similar for JSON by overriding the createStartFilterEvent but I don't see an easy fix for CSV without replicating a lot of the underlying library code to set the encoding.

try {
documentContent = CharStreams.toString(rawDocument.getReader());
} catch (IOException e) {
logger.error("Error reading document content", e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rethrow a RuntimeException here since that would be a hard failure

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm gonna merge this and update later since i need that change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants