fix(rest-api): preserve Unicode characters in uploaded filenames (#35266) by dsilvam · Pull Request #35276 · dotCMS/core

dsilvam · 2026-04-09T20:43:37Z

Proposed Changes

Re-interpret Jersey's ISO-8859-1 decoded filename as UTF-8 in TempFileResource.sanitizeFileName()
Normalize to NFC after decoding to handle NFD filenames sent by macOS browsers
Replace the original ASCII-stripping regex with the Unicode-safe FileUtil.sanitizeFileName()
Add integration test that simulates the browser encoding and asserts the filename is preserved

Root Cause

Jersey decodes multipart Content-Disposition filenames as ISO-8859-1. macOS browsers send filenames in NFD UTF-8, so bytes like 0xCC 0x88 (combining diaeresis) are misread as Ì (U+00CC) + an invisible control character, producing aÌ instead of ä. The original regex [^\x00-\x7F] then stripped everything above ASCII 127, silently truncating the filename.

Checklist

Tests
Translations
Security Implications Contemplated — fix does not relax any sanitization; FileUtil.sanitizeFileName() still strips all illegal filesystem characters

Additional Info

The fix is isolated to TempFileResource (the dotEvergreen upload path). The existing MultiPartUtils.getBinariesFromMultipart() path was already correct. FileUtil.java was left unchanged.

Note

Medium Risk
Changes multipart upload filename decoding/sanitization in TempFileResource, which could alter stored filenames for some non-UTF-8/legacy clients and affect downstream references. Scope is small and covered by a new integration test, but it touches file creation inputs.

Overview
Fixes temp-file multipart uploads to preserve Unicode filenames instead of stripping non-ASCII characters.

TempFileResource.sanitizeFileName() now re-interprets Jersey’s ISO-8859-1-decoded Content-Disposition filename bytes as UTF-8, normalizes to NFC, and sanitizes via FileUtil.sanitizeFileName().

Adds an integration test that simulates Jersey’s filename decoding (including macOS NFD input) and asserts the uploaded temp file keeps the expected Unicode filename.

^{Reviewed by Cursor Bugbot for commit d724e18. Bugbot is set up for automated code reviews on this repo. Configure here.}

) Jersey decodes multipart Content-Disposition filenames as ISO-8859-1, mangling non-ASCII characters. Re-interpret the bytes as UTF-8, normalize to NFC, then delegate to FileUtil.sanitizeFileName() for illegal-char removal. Refs: #35266 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude · 2026-04-09T20:44:30Z

Claude finished @dsilvam's task in 6m 28s —— View job

🔍 dotCMS Backend Review

[🟠 Medium] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:234

When every byte of the uploaded filename is invalid UTF-8 (e.g. a pure high-byte ISO-8859-1 name), .replace("\uFFFD", "") produces an empty string. FileUtil.sanitizeFileName("") then returns a random 10-letter alphanumeric fallback (e.g. AbCdEfGhIj.ukn). The upload succeeds with HTTP 200, the stored file carries a name bearing no relation to what the client sent, and no error or warning is emitted. The caller cannot distinguish this silent rename from a legitimate upload.

final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
        .replace("\uFFFD", "");   // may produce ""
final String nfcName = Normalizer.normalize(utf8Name, Normalizer.Form.NFC);
return FileUtil.sanitizeFileName(nfcName);  // returns random name if empty — no error returned

💡 Detect the empty-after-stripping case and throw an IllegalArgumentException (the existing catch (Exception e) at line 210 will convert it to a proper ErrorEntity with HTTP 400). Alternatively, at minimum log a Logger.warn so operators can identify ISO-8859-1 legacy clients.

[🟡 Medium] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:200 vs 233

The null/dotfile pre-check at line 200 validates the raw Jersey/ISO-8859-1 decoded string, but sanitizeFileName() re-interprets those same bytes as UTF-8, producing a potentially different string. A filename that only becomes a dotfile or hidden-path pattern after UTF-8 decoding bypasses the pre-check. FileUtil.sanitizeFileName() provides a second layer of defense (path resolution + leading-dot strip), so this is not exploitable, but the two guards are conceptually misaligned and the pre-check is redundant for the decoded path.

// Line 200 — guard on ISO-8859-1 decoded raw string
if (fileName == null || fileName.startsWith(".") || fileName.contains("/.")) { ... }

// Line 233 — sanitizeFileName operates on UTF-8 re-interpreted string
final String raw = meta.getFileName();  // second call to getFileName()
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)…

💡 Decode once early (before the pre-check) and run all validation against the decoded string. Passing the already-validated String into sanitizeFileName instead of ContentDisposition also removes the redundant meta.getFileName() call.

[🟡 Low] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:234

When a genuine ISO-8859-1 filename (e.g. "café.txt" from a legacy client) is silently downgraded — high bytes map to \uFFFD and are stripped — there is no warning logged. Operators have no visibility into which clients are still sending non-UTF-8 filenames.

final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
        .replace("\uFFFD", "");   // silent data loss, no Logger.warn

💡 Add Logger.warn(TempFileResource.class, "Non-UTF-8 bytes in Content-Disposition filename; high bytes dropped: " + raw) when the decoded string contains \uFFFD before stripping.

Next steps

🟠 Fix locally and push — the silent-random-rename path needs your judgment (throw on empty vs. log-and-continue)
🟡 You can ask me to handle mechanical fixes inline: @claude fix silent \uFFFD data-loss logging and pre-check alignment in TempFileResource.java
Every new push triggers a fresh review automatically

…malization path - Remove duplicate `import java.nio.charset.StandardCharsets` from TempFileResource.java (was imported at both line 49 and 62). - Remove duplicate `import java.nio.charset.StandardCharsets` from TempFileResourceTest.java (was imported at line 9 and 50). - Update unicode filename test to use NFD-encoded input so that the Normalizer.normalize(…, NFC) step in sanitizeFileName is actually exercised (previously the test used NFC literals, making the normalisation call a no-op). Co-authored-by: Daniel Silva <undefined@users.noreply.github.com>

claude · 2026-04-09T20:49:53Z

🔍 dotCMS Backend Review

[🟠 Medium] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:234

When every byte of the uploaded filename is invalid UTF-8 (e.g. a pure high-byte ISO-8859-1 name), .replace("\uFFFD", "") produces an empty string. FileUtil.sanitizeFileName("") then returns a random 10-letter alphanumeric fallback (e.g. AbCdEfGhIj.ukn). The upload succeeds with HTTP 200, the stored file carries a name bearing no relation to what the client sent, and no error or warning is emitted. The caller cannot distinguish this silent rename from a legitimate upload.

final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
        .replace("\uFFFD", "");   // may produce ""
final String nfcName = Normalizer.normalize(utf8Name, Normalizer.Form.NFC);
return FileUtil.sanitizeFileName(nfcName);  // returns random name if empty — no error returned

💡 Detect the empty-after-stripping case and throw an IllegalArgumentException (the existing catch (Exception e) at line 210 will convert it to a proper ErrorEntity with HTTP 400). Alternatively, at minimum log a Logger.warn so operators can identify ISO-8859-1 legacy clients.

[🟡 Medium] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:200 vs 233

The null/dotfile pre-check at line 200 validates the raw Jersey/ISO-8859-1 decoded string, but sanitizeFileName() re-interprets those same bytes as UTF-8, producing a potentially different string. A filename that only becomes a dotfile or hidden-path pattern after UTF-8 decoding bypasses the pre-check. FileUtil.sanitizeFileName() provides a second layer of defense (path resolution + leading-dot strip), so this is not exploitable, but the two guards are conceptually misaligned and the pre-check is redundant for the decoded path.

// Line 200 — guard on ISO-8859-1 decoded raw string
if (fileName == null || fileName.startsWith(".") || fileName.contains("/.")) { ... }

// Line 233 — sanitizeFileName operates on UTF-8 re-interpreted string
final String raw = meta.getFileName();  // second call to getFileName()
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)…

💡 Decode once early (before the pre-check) and run all validation against the decoded string. Passing the already-validated String into sanitizeFileName instead of ContentDisposition also removes the redundant meta.getFileName() call.

[🟡 Low] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:234

When a genuine ISO-8859-1 filename (e.g. "café.txt" from a legacy client) is silently downgraded — high bytes map to \uFFFD and are stripped — there is no warning logged. Operators have no visibility into which clients are still sending non-UTF-8 filenames.

final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
        .replace("\uFFFD", "");   // silent data loss, no Logger.warn

💡 Add Logger.warn(TempFileResource.class, "Non-UTF-8 bytes in Content-Disposition filename; high bytes dropped: " + raw) when the decoded string contains \uFFFD before stripping.

Next steps

🟠 Fix locally and push — the silent-random-rename path needs your judgment (throw on empty vs. log-and-continue)
🟡 You can ask me to handle mechanical fixes inline: @claude fix silent \uFFFD data-loss logging and pre-check alignment in TempFileResource.java
Every new push triggers a fresh review automatically

Remove duplicate java.nio.charset.StandardCharsets import and move both new imports (StandardCharsets, Normalizer) into the existing java.* group in alphabetical order. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

#35266) When a non-browser client sends a genuine ISO-8859-1 filename whose bytes are not valid UTF-8, re-decoding inserts U+FFFD replacement characters. Strip them immediately so they are not persisted in the stored filename. FileUtil.sanitizeFileName() already falls back to a random name if stripping leaves an empty string. Refs: #35266 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…zer imports Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…5266) Add comment clarifying that the ISO-8859-1 → UTF-8 round-trip assumes modern browsers send UTF-8 bytes per HTML5/RFC 6266, and that high bytes from genuine ISO-8859-1 legacy clients are silently dropped. Refs: #35266 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-project-automation bot added this to dotCMS - Product Planning Apr 9, 2026

github-actions bot mentioned this pull request Apr 9, 2026

File upload strips non-ASCII characters (umlauts, accented chars) from filenames #35266

Closed

github-actions bot added the Area : Backend PR changes Java/Maven backend code label Apr 9, 2026

claude bot added the AI: Safe To Rollback label Apr 9, 2026

dsolistorres approved these changes Apr 9, 2026

View reviewed changes

gortiz-dotcms approved these changes Apr 9, 2026

View reviewed changes

Daniel Silva and others added 2 commits April 9, 2026 16:16

dsolistorres reviewed Apr 9, 2026

View reviewed changes

Comment thread dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java Outdated

Daniel Silva and others added 2 commits April 9, 2026 16:25

fix(rest-api): resolve merge conflict — keep StandardCharsets/Normali…

89063d3

…zer imports Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dsilvam enabled auto-merge April 10, 2026 13:02

dsilvam added this pull request to the merge queue Apr 10, 2026

Merged via the queue into main with commit bf6b6d5 Apr 10, 2026
50 checks passed

dsilvam deleted the fix/issue-35266-non-ascii-filename-strip branch April 10, 2026 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rest-api): preserve Unicode characters in uploaded filenames (#35266)#35276

fix(rest-api): preserve Unicode characters in uploaded filenames (#35266)#35276
dsilvam merged 6 commits intomainfrom
fix/issue-35266-non-ascii-filename-strip

dsilvam commented Apr 9, 2026 •

edited by cursor bot

Loading

Uh oh!

claude bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dsilvam commented Apr 9, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Root Cause

Checklist

Additional Info

Uh oh!

claude bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 dotCMS Backend Review

Uh oh!

claude bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 dotCMS Backend Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dsilvam commented Apr 9, 2026 •

edited by cursor bot

Loading

claude bot commented Apr 9, 2026 •

edited

Loading

claude bot commented Apr 9, 2026 •

edited

Loading