fix(rest-api): preserve Unicode characters in uploaded filenames (#35266)#35276
fix(rest-api): preserve Unicode characters in uploaded filenames (#35266)#35276
Conversation
) Jersey decodes multipart Content-Disposition filenames as ISO-8859-1, mangling non-ASCII characters. Re-interpret the bytes as UTF-8, normalize to NFC, then delegate to FileUtil.sanitizeFileName() for illegal-char removal. Refs: #35266 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Claude finished @dsilvam's task in 6m 28s —— View job 🔍 dotCMS Backend Review[🟠 Medium]
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
.replace("\uFFFD", ""); // may produce ""
final String nfcName = Normalizer.normalize(utf8Name, Normalizer.Form.NFC);
return FileUtil.sanitizeFileName(nfcName); // returns random name if empty — no error returned💡 Detect the empty-after-stripping case and throw an [🟡 Medium]
// Line 200 — guard on ISO-8859-1 decoded raw string
if (fileName == null || fileName.startsWith(".") || fileName.contains("/.")) { ... }
// Line 233 — sanitizeFileName operates on UTF-8 re-interpreted string
final String raw = meta.getFileName(); // second call to getFileName()
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)…💡 Decode once early (before the pre-check) and run all validation against the decoded string. Passing the already-validated [🟡 Low]
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
.replace("\uFFFD", ""); // silent data loss, no Logger.warn💡 Add Next steps
|
…malization path - Remove duplicate `import java.nio.charset.StandardCharsets` from TempFileResource.java (was imported at both line 49 and 62). - Remove duplicate `import java.nio.charset.StandardCharsets` from TempFileResourceTest.java (was imported at line 9 and 50). - Update unicode filename test to use NFD-encoded input so that the Normalizer.normalize(…, NFC) step in sanitizeFileName is actually exercised (previously the test used NFC literals, making the normalisation call a no-op). Co-authored-by: Daniel Silva <undefined@users.noreply.github.com>
🔍 dotCMS Backend Review[🟠 Medium]
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
.replace("\uFFFD", ""); // may produce ""
final String nfcName = Normalizer.normalize(utf8Name, Normalizer.Form.NFC);
return FileUtil.sanitizeFileName(nfcName); // returns random name if empty — no error returned💡 Detect the empty-after-stripping case and throw an [🟡 Medium]
// Line 200 — guard on ISO-8859-1 decoded raw string
if (fileName == null || fileName.startsWith(".") || fileName.contains("/.")) { ... }
// Line 233 — sanitizeFileName operates on UTF-8 re-interpreted string
final String raw = meta.getFileName(); // second call to getFileName()
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)…💡 Decode once early (before the pre-check) and run all validation against the decoded string. Passing the already-validated [🟡 Low]
final String utf8Name = new String(raw.getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8)
.replace("\uFFFD", ""); // silent data loss, no Logger.warn💡 Add Next steps
|
Remove duplicate java.nio.charset.StandardCharsets import and move both new imports (StandardCharsets, Normalizer) into the existing java.* group in alphabetical order. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
#35266) When a non-browser client sends a genuine ISO-8859-1 filename whose bytes are not valid UTF-8, re-decoding inserts U+FFFD replacement characters. Strip them immediately so they are not persisted in the stored filename. FileUtil.sanitizeFileName() already falls back to a random name if stripping leaves an empty string. Refs: #35266 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…zer imports Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Closes #35266
Proposed Changes
TempFileResource.sanitizeFileName()FileUtil.sanitizeFileName()Root Cause
Jersey decodes multipart
Content-Dispositionfilenames as ISO-8859-1. macOS browsers send filenames in NFD UTF-8, so bytes like0xCC 0x88(combining diaeresis) are misread asÌ(U+00CC) + an invisible control character, producingaÌinstead ofä. The original regex[^\x00-\x7F]then stripped everything above ASCII 127, silently truncating the filename.Checklist
FileUtil.sanitizeFileName()still strips all illegal filesystem charactersAdditional Info
The fix is isolated to
TempFileResource(the dotEvergreen upload path). The existingMultiPartUtils.getBinariesFromMultipart()path was already correct.FileUtil.javawas left unchanged.Note
Medium Risk
Changes multipart upload filename decoding/sanitization in
TempFileResource, which could alter stored filenames for some non-UTF-8/legacy clients and affect downstream references. Scope is small and covered by a new integration test, but it touches file creation inputs.Overview
Fixes temp-file multipart uploads to preserve Unicode filenames instead of stripping non-ASCII characters.
TempFileResource.sanitizeFileName()now re-interprets Jersey’s ISO-8859-1-decodedContent-Dispositionfilename bytes as UTF-8, normalizes to NFC, and sanitizes viaFileUtil.sanitizeFileName().Adds an integration test that simulates Jersey’s filename decoding (including macOS NFD input) and asserts the uploaded temp file keeps the expected Unicode filename.
Reviewed by Cursor Bugbot for commit d724e18. Bugbot is set up for automated code reviews on this repo. Configure here.