Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xml compliance against control characters #62

Merged
merged 3 commits into from
May 26, 2022
Merged

Conversation

NelsonVides
Copy link

@NelsonVides NelsonVides commented May 25, 2022

I checked all the pertaining RFCs, and not only XMPP states being defined for XML 1.0 fifth edition, but also both XML 1.0 or XML 1.1 wouldn’t allow such character on the body (1.1, which is barely used anyway, allows such characters in very special places, but the cdata is not one of them).

Also indeed, I can confirm that our XML parser accepts the escape char inside the body, in the screenshot below we can see that a body with just a space is quickly stripped, but the one with the escape character keeps the \e code in the xml cdata section:
image

Not only that, but the XML specification says that only the following chars are allowed:

Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

and then I can confirm that the xml parser also accepts 0x1c, 0x1d, etc.


The fix is to now, on every character being processed, check if such character is any of the disallowed ones. A few tests are added. This unfortunately has an impact on performance, with an average worsening of around I guess 10-15%

benchmarks

I have the xml payloads from this test generated in my computer, feel free to ask to share them

##### With input 1 #####
Name                ips        average  deviation         median         99th %
master         398.65 K        2.51 μs  ±3113.73%        1.07 μs        2.69 μs
this brch      366.11 K        2.73 μs  ±2870.52%        1.25 μs        2.83 μs

##### With input 2 #####
Name                ips        average  deviation         median         99th %
master         318.30 K        3.14 μs  ±2568.48%        1.33 μs        3.04 μs
this brch      305.15 K        3.28 μs  ±2411.22%        1.47 μs        3.18 μs

##### With input 3 #####
Name                ips        average  deviation         median         99th %
master           6.85 K      145.91 μs    ±54.58%      116.55 μs      424.71 μs
this brch        6.41 K      156.03 μs    ±51.03%      127.66 μs      472.68 μs

##### With input 4 #####
Name                ips        average  deviation         median         99th %
master           5.87 K      170.41 μs    ±51.77%      130.51 μs      516.84 μs
this brch        5.43 K      184.22 μs    ±46.77%      144.83 μs      531.45 μs

##### With input 5 #####
Name                ips        average  deviation         median         99th %
master         844.20 K        1.18 μs  ±6480.92%        0.47 μs        1.60 μs
this brch      825.42 K        1.21 μs  ±6204.77%        0.51 μs        1.50 μs

##### With input big big test #####
Name                ips        average  deviation         median         99th %
master            29.49       33.91 ms    ±33.59%       42.60 ms       50.83 ms
this brch         27.65       36.17 ms    ±32.21%       38.00 ms       63.03 ms

##### With input generated #####
Name                ips        average  deviation         median         99th %
master          77.78 K       12.86 μs   ±614.82%        6.36 μs       38.81 μs
this brch       76.87 K       13.01 μs   ±588.76%        6.68 μs       37.31 μs

##### With input generated-oracle #####
Name                ips        average  deviation         median         99th %
master         102.55 K        9.75 μs   ±878.33%        4.47 μs       18.48 μs
this brch      101.30 K        9.87 μs   ±840.00%        4.52 μs       20.40 μs

##### With input mystic-library #####
Name                ips        average  deviation         median         99th %
master           192.34        5.20 ms    ±30.87%        4.51 ms        7.94 ms
this brch        183.72        5.44 ms    ±29.26%        4.73 ms        8.25 ms

A possible solution would be to make it configurable, so that the control char validation could be given as a flag to the erlang call. But on the C level that would probably make the code size grow considerably, as part of the optimisations in rapidxml is the auto-generated code for each branch, instead of one code with many branching conditionals.

Copy link

@arcusfelis arcusfelis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok

@arcusfelis arcusfelis merged commit 1fe6419 into master May 26, 2022
@arcusfelis arcusfelis deleted the xml_compliance branch May 26, 2022 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants