Skip to content

fix(rest-api): preserve Unicode characters in uploaded filenames (#35266)#35276

Merged
dsilvam merged 6 commits intomainfrom
fix/issue-35266-non-ascii-filename-strip
Apr 10, 2026
Merged

fix(rest-api): preserve Unicode characters in uploaded filenames (#35266)#35276
dsilvam merged 6 commits intomainfrom
fix/issue-35266-non-ascii-filename-strip

Conversation

@dsilvam
Copy link
Copy Markdown
Member

@dsilvam dsilvam commented Apr 9, 2026

Closes #35266

Proposed Changes

  • Re-interpret Jersey's ISO-8859-1 decoded filename as UTF-8 in TempFileResource.sanitizeFileName()
  • Normalize to NFC after decoding to handle NFD filenames sent by macOS browsers
  • Replace the original ASCII-stripping regex with the Unicode-safe FileUtil.sanitizeFileName()
  • Add integration test that simulates the browser encoding and asserts the filename is preserved

Root Cause

Jersey decodes multipart Content-Disposition filenames as ISO-8859-1. macOS browsers send filenames in NFD UTF-8, so bytes like 0xCC 0x88 (combining diaeresis) are misread as Ì (U+00CC) + an invisible control character, producing instead of ä. The original regex [^\x00-\x7F] then stripped everything above ASCII 127, silently truncating the filename.

Checklist

  • Tests
  • Translations
  • Security Implications Contemplated — fix does not relax any sanitization; FileUtil.sanitizeFileName() still strips all illegal filesystem characters

Additional Info

The fix is isolated to TempFileResource (the dotEvergreen upload path). The existing MultiPartUtils.getBinariesFromMultipart() path was already correct. FileUtil.java was left unchanged.


Note

Medium Risk
Changes multipart upload filename decoding/sanitization in TempFileResource, which could alter stored filenames for some non-UTF-8/legacy clients and affect downstream references. Scope is small and covered by a new integration test, but it touches file creation inputs.

Overview
Fixes temp-file multipart uploads to preserve Unicode filenames instead of stripping non-ASCII characters.

TempFileResource.sanitizeFileName() now re-interprets Jersey’s ISO-8859-1-decoded Content-Disposition filename bytes as UTF-8, normalizes to NFC, and sanitizes via FileUtil.sanitizeFileName().

Adds an integration test that simulates Jersey’s filename decoding (including macOS NFD input) and asserts the uploaded temp file keeps the expected Unicode filename.

Reviewed by Cursor Bugbot for commit d724e18. Bugbot is set up for automated code reviews on this repo. Configure here.

)

Jersey decodes multipart Content-Disposition filenames as ISO-8859-1,
mangling non-ASCII characters. Re-interpret the bytes as UTF-8, normalize
to NFC, then delegate to FileUtil.sanitizeFileName() for illegal-char removal.

Refs: #35266

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 9, 2026

Claude finished @dsilvam's task in 6m 28s —— View job


🔍 dotCMS Backend Review

[🟠 Medium] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:234

When every byte of the uploaded filename is invalid UTF-8 (e.g. a pure high-byte ISO-8859-1 name), .replace("\uFFFD", "") produces an empty string. FileUtil.sanitizeFileName("") then returns a random 10-letter alphanumeric fallback (e.g. AbCdEfGhIj.ukn). The upload succeeds with HTTP 200, the stored file carries a name bearing no relation to what the client sent, and no error or warning is emitted. The caller cannot distinguish this silent rename from a legitimate upload.

final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
        .replace("\uFFFD", "");   // may produce ""
final String nfcName = Normalizer.normalize(utf8Name, Normalizer.Form.NFC);
return FileUtil.sanitizeFileName(nfcName);  // returns random name if empty — no error returned

💡 Detect the empty-after-stripping case and throw an IllegalArgumentException (the existing catch (Exception e) at line 210 will convert it to a proper ErrorEntity with HTTP 400). Alternatively, at minimum log a Logger.warn so operators can identify ISO-8859-1 legacy clients.


[🟡 Medium] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:200 vs 233

The null/dotfile pre-check at line 200 validates the raw Jersey/ISO-8859-1 decoded string, but sanitizeFileName() re-interprets those same bytes as UTF-8, producing a potentially different string. A filename that only becomes a dotfile or hidden-path pattern after UTF-8 decoding bypasses the pre-check. FileUtil.sanitizeFileName() provides a second layer of defense (path resolution + leading-dot strip), so this is not exploitable, but the two guards are conceptually misaligned and the pre-check is redundant for the decoded path.

// Line 200 — guard on ISO-8859-1 decoded raw string
if (fileName == null || fileName.startsWith(".") || fileName.contains("/.")) { ... }

// Line 233 — sanitizeFileName operates on UTF-8 re-interpreted string
final String raw = meta.getFileName();  // second call to getFileName()
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)…

💡 Decode once early (before the pre-check) and run all validation against the decoded string. Passing the already-validated String into sanitizeFileName instead of ContentDisposition also removes the redundant meta.getFileName() call.


[🟡 Low] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:234

When a genuine ISO-8859-1 filename (e.g. "café.txt" from a legacy client) is silently downgraded — high bytes map to \uFFFD and are stripped — there is no warning logged. Operators have no visibility into which clients are still sending non-UTF-8 filenames.

final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
        .replace("\uFFFD", "");   // silent data loss, no Logger.warn

💡 Add Logger.warn(TempFileResource.class, "Non-UTF-8 bytes in Content-Disposition filename; high bytes dropped: " + raw) when the decoded string contains \uFFFD before stripping.


Next steps

  • 🟠 Fix locally and push — the silent-random-rename path needs your judgment (throw on empty vs. log-and-continue)
  • 🟡 You can ask me to handle mechanical fixes inline: @claude fix silent \uFFFD data-loss logging and pre-check alignment in TempFileResource.java
  • Every new push triggers a fresh review automatically

…malization path

- Remove duplicate `import java.nio.charset.StandardCharsets` from
  TempFileResource.java (was imported at both line 49 and 62).
- Remove duplicate `import java.nio.charset.StandardCharsets` from
  TempFileResourceTest.java (was imported at line 9 and 50).
- Update unicode filename test to use NFD-encoded input so that the
  Normalizer.normalize(…, NFC) step in sanitizeFileName is actually
  exercised (previously the test used NFC literals, making the
  normalisation call a no-op).

Co-authored-by: Daniel Silva <undefined@users.noreply.github.com>
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Apr 9, 2026

🔍 dotCMS Backend Review

[🟠 Medium] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:234

When every byte of the uploaded filename is invalid UTF-8 (e.g. a pure high-byte ISO-8859-1 name), .replace("\uFFFD", "") produces an empty string. FileUtil.sanitizeFileName("") then returns a random 10-letter alphanumeric fallback (e.g. AbCdEfGhIj.ukn). The upload succeeds with HTTP 200, the stored file carries a name bearing no relation to what the client sent, and no error or warning is emitted. The caller cannot distinguish this silent rename from a legitimate upload.

final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
        .replace("\uFFFD", "");   // may produce ""
final String nfcName = Normalizer.normalize(utf8Name, Normalizer.Form.NFC);
return FileUtil.sanitizeFileName(nfcName);  // returns random name if empty — no error returned

💡 Detect the empty-after-stripping case and throw an IllegalArgumentException (the existing catch (Exception e) at line 210 will convert it to a proper ErrorEntity with HTTP 400). Alternatively, at minimum log a Logger.warn so operators can identify ISO-8859-1 legacy clients.


[🟡 Medium] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:200 vs 233

The null/dotfile pre-check at line 200 validates the raw Jersey/ISO-8859-1 decoded string, but sanitizeFileName() re-interprets those same bytes as UTF-8, producing a potentially different string. A filename that only becomes a dotfile or hidden-path pattern after UTF-8 decoding bypasses the pre-check. FileUtil.sanitizeFileName() provides a second layer of defense (path resolution + leading-dot strip), so this is not exploitable, but the two guards are conceptually misaligned and the pre-check is redundant for the decoded path.

// Line 200 — guard on ISO-8859-1 decoded raw string
if (fileName == null || fileName.startsWith(".") || fileName.contains("/.")) { ... }

// Line 233 — sanitizeFileName operates on UTF-8 re-interpreted string
final String raw = meta.getFileName();  // second call to getFileName()
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)…

💡 Decode once early (before the pre-check) and run all validation against the decoded string. Passing the already-validated String into sanitizeFileName instead of ContentDisposition also removes the redundant meta.getFileName() call.


[🟡 Low] dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java:234

When a genuine ISO-8859-1 filename (e.g. "café.txt" from a legacy client) is silently downgraded — high bytes map to \uFFFD and are stripped — there is no warning logged. Operators have no visibility into which clients are still sending non-UTF-8 filenames.

final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
        .replace("\uFFFD", "");   // silent data loss, no Logger.warn

💡 Add Logger.warn(TempFileResource.class, "Non-UTF-8 bytes in Content-Disposition filename; high bytes dropped: " + raw) when the decoded string contains \uFFFD before stripping.


Next steps

  • 🟠 Fix locally and push — the silent-random-rename path needs your judgment (throw on empty vs. log-and-continue)
  • 🟡 You can ask me to handle mechanical fixes inline: @claude fix silent \uFFFD data-loss logging and pre-check alignment in TempFileResource.java
  • Every new push triggers a fresh review automatically

Daniel Silva and others added 2 commits April 9, 2026 16:16
Remove duplicate java.nio.charset.StandardCharsets import and move both
new imports (StandardCharsets, Normalizer) into the existing java.* group
in alphabetical order.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
#35266)

When a non-browser client sends a genuine ISO-8859-1 filename whose bytes
are not valid UTF-8, re-decoding inserts U+FFFD replacement characters.
Strip them immediately so they are not persisted in the stored filename.
FileUtil.sanitizeFileName() already falls back to a random name if stripping
leaves an empty string.

Refs: #35266

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread dotCMS/src/main/java/com/dotcms/rest/api/v1/temp/TempFileResource.java Outdated
Daniel Silva and others added 2 commits April 9, 2026 16:25
…zer imports

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…5266)

Add comment clarifying that the ISO-8859-1 → UTF-8 round-trip assumes
modern browsers send UTF-8 bytes per HTML5/RFC 6266, and that high bytes
from genuine ISO-8859-1 legacy clients are silently dropped.

Refs: #35266

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dsilvam dsilvam enabled auto-merge April 10, 2026 13:02
@dsilvam dsilvam added this pull request to the merge queue Apr 10, 2026
Merged via the queue into main with commit bf6b6d5 Apr 10, 2026
50 checks passed
@dsilvam dsilvam deleted the fix/issue-35266-non-ascii-filename-strip branch April 10, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

File upload strips non-ASCII characters (umlauts, accented chars) from filenames

3 participants