Skip to content

🐛 Fix / UTF-8 characters in SAMLResponse rejected by xmerl_scan#22

Merged
docJerem merged 1 commit intomainfrom
fix/utf-8-characters
May 6, 2026
Merged

🐛 Fix / UTF-8 characters in SAMLResponse rejected by xmerl_scan#22
docJerem merged 1 commit intomainfrom
fix/utf-8-characters

Conversation

@docJerem
Copy link
Copy Markdown
Owner

@docJerem docJerem commented May 6, 2026

Summary

  • Fixes a production crash on POST /api/sp/consume/... where xmerl_scan rejected any non-ASCII character (e.g. é in an AttributeValue) with {:wfc_Legal_Character, {:bad_character, 233}}.
  • Root cause: to_charlist/1 (and String.to_charlist/1) was applied to the raw UTF-8 XML binary before :xmerl_scan.string/2. to_charlist decodes UTF-8 into Unicode codepoints, but xmerl_scan expects a list of raw UTF-8 bytes and performs its own decoding — so it interpreted codepoint 233 (0xE9) as a stray UTF-8 continuation byte and bailed out.
  • Switches the four xmerl call sites to :binary.bin_to_list/1: decode_response/2 (DEFLATE + plain-base64 branches in Core.Binding), the encrypted-assertion scan in Core.SP, and Metadata.parse/1. Only the Core.Binding paths were observed in prod; the other two were latent and would have failed on any IdP returning accents in encrypted assertions or in metadata.

Test plan

  • mix test — 199 tests, 0 failures (51 new regression tests).
  • Reproduced original prod crash locally with a SAMLResponse containing é — fails on main, passes on this branch.
  • Stash-and-rerun verified the new tests fail without the fix (real regression coverage, not just decoration).
  • Coverage spans 2-byte UTF-8 (é/è/ñ/ß/ü/ø/ł/č), 3-byte (€/Ω/cyrillic/CJK) and 4-byte (emoji 🎉), in element text and attribute values, on both DEFLATE and non-DEFLATE decode paths, plus <OrganizationName> in metadata.
  • Smoke check on a staging IdP returning a French given_name once deployed.

decode_response/2 (and the encrypted-assertion + metadata parse paths)
piped the XML binary through to_charlist/1 before xmerl_scan.string/2.
to_charlist decodes UTF-8 into Unicode codepoints, but xmerl_scan
expects raw UTF-8 bytes and does its own decoding — feeding it
codepoints made it reject any non-ASCII char with
{:wfc_Legal_Character, {:bad_character, _}}, crashing /api/sp/consume
on assertions containing accents.

Switches to :binary.bin_to_list/1 at the four call sites and adds
regression coverage across 2/3/4-byte UTF-8 sequences in element text,
attribute values, DEFLATE and non-DEFLATE response paths, plus
metadata OrganizationName.
@docJerem docJerem merged commit 64e6e30 into main May 6, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant