Setting default encoding in RawDocument to UTF-8 by maallen · Pull Request #731 · box/mojito

maallen · 2021-10-07T10:28:23Z

Updated the okapi RawDocument to use a default encoding of UTF-8 as localized files were using UTF-16 as the default encoding which meant that a BOM was being added to the generated localized files.

aurambaj · 2021-10-07T14:59:41Z

common/src/main/java/com/box/l10n/mojito/okapi/RawDocument.java


    public RawDocument(CharSequence inputCharSequence, LocaleId sourceLocale, LocaleId targetLocale) {
-        super(inputCharSequence, sourceLocale, targetLocale);
+        super(new ByteArrayInputStream(inputCharSequence.toString().getBytes()), "utf-8", sourceLocale, targetLocale);


getBytes() is platform sensitive so the version with the encoding (getBytes(StandardCharsets.UTF_8) should be used to be sure to get a utf-8 array. Also you can replace "utf-8" with the constant.

Did you figure out why all the other filters seem fine with utf-16 here and not the CVS/JSON one? just trying to understand if the filter should be fixed as an alternative. I'm wondering about the side effect of having utf-8 here since utf-16 is used in the default implementation. this choice might just be irrelevant (downstream logic should be working regardless of the content encoding)

The other filters handle this slightly differently, Po filter reads the source file and resets its reader with a new encoding. The XML filter does similar by using the encoding returned from the DocumentParser, I could do something similar for JSON by overriding the createStartFilterEvent but I don't see an easy fix for CSV without replicating a lot of the underlying library code to set the encoding.

...a/com/box/l10n/mojito/service/assetintegritychecker/integritychecker/IntegrityCheckStep.java

aurambaj · 2021-10-08T22:26:16Z

...a/com/box/l10n/mojito/service/assetintegritychecker/integritychecker/IntegrityCheckStep.java

+        try {
+            documentContent = CharStreams.toString(rawDocument.getReader());
+        } catch (IOException e) {
+            logger.error("Error reading document content", e);


rethrow a RuntimeException here since that would be a hard failure

i'm gonna merge this and update later since i need that change

aurambaj reviewed Oct 7, 2021

View reviewed changes

Setting default encoding in RawDocument to UTF-8

c51ccfc

aurambaj reviewed Oct 8, 2021

View reviewed changes

aurambaj approved these changes Oct 9, 2021

View reviewed changes

aurambaj merged commit 9bbda57 into box:master Oct 9, 2021

wadimw mentioned this pull request Feb 3, 2026

Fix unexpected document end when importing drops with large XLIFF files #1049

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting default encoding in RawDocument to UTF-8#731

Setting default encoding in RawDocument to UTF-8#731
aurambaj merged 1 commit intobox:masterfrom
maallen:fix_bom_issue

maallen commented Oct 7, 2021

Uh oh!

aurambaj Oct 7, 2021

Uh oh!

maallen Oct 8, 2021 •

edited

Loading

Uh oh!

Uh oh!

aurambaj Oct 8, 2021

Uh oh!

aurambaj Oct 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maallen commented Oct 7, 2021

Uh oh!

aurambaj Oct 7, 2021

Choose a reason for hiding this comment

Uh oh!

maallen Oct 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aurambaj Oct 8, 2021

Choose a reason for hiding this comment

Uh oh!

aurambaj Oct 9, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maallen Oct 8, 2021 •

edited

Loading