Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base64 and widebase64 modifiers #1185

Open
wants to merge 10 commits into
base: master
from
Open

base64 and widebase64 modifiers #1185

wants to merge 10 commits into from

Conversation

@wxsBSD
Copy link
Collaborator

wxsBSD commented Dec 23, 2019

Add two new modifiers: base64 and widebase64.

These modifiers take the given text string and generate 3 different strings
being careful to trim off the bytes which are dependent upon leading or
trailing bytes in the larger search space.

I've implemented it by slightly cheating. I generate all the search strings in
a list and then creating one large string suitable for the RE compiler to deal
with. For example, the string "This program cannot" generates these three
base64 encoded strings:

VGhpcyBwcm9ncmFtIGNhbm5vd
RoaXMgcHJvZ3JhbSBjYW5ub3
UaGlzIHByb2dyYW0gY2Fubm90

Those three strings are then transformed into a RE that looks like this:

(VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90)

That string is then passed to the RE compiler for parsing and AST generation.
The AST is then emitted into the appropriate spot and YARA believes it was
given a regex from that point on.

I've also implemented support for specifying custom alphabets:

base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu")

This means that I have to be careful to escape any special RE characters in the
string passed to the compiler. The base64 alphabet has to be 64 bytes long, and
does support escaped bytes properly too.

To avoid the need to deal with escaping I had a first implementation which
attempted to generate the AST by hand, which was mostly working but was very
cumbersome to maintain. In doing this I ended up improving yr_re_print_node()
so that it indents the tree properly, which made debugging that attempt easier.
My apologies for including it in this diff. I can split it out if you want.

Lastly, I did most of the work in re.c but I think it belongs in it's own
base64.c. I'm happy to move it out of re.c if you would prefer.

wxsBSD added 4 commits Dec 15, 2019
Add two new modifiers: base64 and widebase64.

These modifiers take the given text string and generate 3 different strings
being careful to trim off the bytes which are dependent upon leading or
trailing bytes in the larger search space.

I've implemented it by slightly cheating. I generate all the search strings in
a list and then creating one large string suitable for the RE compiler to deal
with. For example, the string "This program cannot" generates these three
base64 encoded strings:

VGhpcyBwcm9ncmFtIGNhbm5vd
RoaXMgcHJvZ3JhbSBjYW5ub3
UaGlzIHByb2dyYW0gY2Fubm90

Those three strings are then transformed into a RE that looks like this:

(VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90)

That string is then passed to the RE compiler for parsing and AST generation.
The AST is then emitted into the appropriate spot and YARA believes it was
given a regex from that point on.

I've also implemented support for specifying custom alphabets:

base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu")

This means that I have to be careful to escape any special RE characters in the
string passed to the compiler. The base64 alphabet has to be 64 bytes long, and
does support escaped bytes properly too.

To avoid the need to deal with escaping I had a first implementation which
attempted to generate the AST by hand, which was mostly working but was very
cumbersome to maintain. In doing this I ended up improving yr_re_print_node()
so that it indents the tree properly, which made debugging that attempt easier.
My apologies for including it in this diff. I can split it out if you want.

Lastly, I did most of the work in re.c but I think it belongs in it's own
base64.c. I'm happy to move it out of re.c if you would prefer.
@wxsBSD

This comment has been minimized.

Copy link
Collaborator Author

wxsBSD commented Dec 23, 2019

I'm putting this up for review so I can gather comments and iterate on it. I think this belongs in its own set of source files, instead of forcing it into re.(c|h) for the most part. I'll move things over to base64.(c|h) and get them into the build system, but would first like to gather feedback on the design.

The details are in the documentation (included in this PR) but the short version is that:

$a = "This program cannot" base64 will search for the three base64 representations of that string and $a = "This program cannot" base64 wide will apply the wide modifier and then the base64 - it does not include the plaintext version if you use base64.

The widebase64 modifier works just like the base64 modifier but inserts null bytes between each character in the base64 strings.

My first implementation of this was generating the RE AST by hand, but it was far too fragile so I switched over to the current method (generating an RE string and letting the RE compiler do the heavy lifting for me). This is an easy approach to this but I'm not super happy with it either. I think a better long-term answer here is to have a way to have different representations of strings but it is still only referenced by a single string internally. This is a lot more work and may not be worth the effort until a more concrete use-case is found.

wxsBSD added 2 commits Dec 30, 2019
Specifically, you can only use these modifiers with text strings (hex strings
and regular expressions are not supported) and you can't use the xor or nocase
modifiers in combination with these.
@plusvic

This comment has been minimized.

Copy link
Collaborator

plusvic commented Jan 7, 2020

I have two concerns about these modifiers, which are:

  1. The more modifiers we have, the larger the number of possible combinations and therefore it's more difficult to understand how they interact between them. For example, a user may wonder what happens if base64 and nocase are used together. Does it take all possible case combinations for the original string and apply base64 to them? Does it find any case combination of the resulting base64 string? It's an error?. In this case it will be an error, but the more intuitive options if probably the first one. I'm afraid that by adding more modifiers is going to be difficult both for the user and for ourselves to reason about how the different modifiers interact between them.

  2. These modifiers are convenient, but not necessary. Other modifiers like nocase and xor are necessary because the alternative would be writing hundreds or even thousands of possible strings, but base64 can be replaced by just three strings instead of one. It's certainly easier for the user to use base64 than having to deal with writing the three strings by herself, but considering the maintenance burden and the added complexity (even for the user) I'm not convinced about adding them.

Thoughts?

@wxsBSD

This comment has been minimized.

Copy link
Collaborator Author

wxsBSD commented Jan 7, 2020

The more modifiers we have, the larger the number of possible combinations and therefore it's more difficult to understand how they interact between them. For example, a user may wonder what happens if base64 and nocase are used together. Does it take all possible case combinations for the original string and apply base64 to them? Does it find any case combination of the resulting base64 string? It's an error?. In this case it will be an error, but the more intuitive options if probably the first one. I'm afraid that by adding more modifiers is going to be difficult both for the user and for ourselves to reason about how the different modifiers interact between them.

I understand this concern but I believe the benefit of a more expressive and readable language is worth it. You are right that it does increase the cognitive burden on users, but with good documentation (which I tried to provide a start to) we can help users understand what is and what is not supported. With tests, which I tried to make very thorough, I think we can lessen the burden on ourselves for maintenance.

These modifiers are convenient, but not necessary. Other modifiers like nocase and xor are necessary because the alternative would be writing hundreds or even thousands of possible strings, but base64 can be replaced by just three strings instead of one. It's certainly easier for the user to use base64 than having to deal with writing the three strings by herself, but considering the maintenance burden and the added complexity (even for the user) I'm not convinced about adding them.

You're right that these are more for convenience but not having them means each user needs to write code (or find someone else's) which does the base64 encoding with proper padding and trimming of the encoded strings. It's not hard to do, and there are a handful of implementations out there already. In fact, when I was writing this I was using one which turned out to be buggy. The fact that we can provide a convenient way for users to express this while also increasing readability makes this at least worth considering. The alternative is users needing to write or find a script to do this, and any bugs that may include. Also, this is more expressive than the versions I've seen that don't let you specify a custom alphabet or even wide support.

I'd much rather see $a = "This program cannot" base64 instead of the alternatives. :)

@thehellu

This comment has been minimized.

Copy link

thehellu commented Jan 8, 2020

Jumping on the debate as a Yara user, as some feedback was asked on Twitter.

At the moment, and probably as other folks, I use a third-party script to generate the 3 possible base64 strings for every input. This technique proved to be successful to monitor some threat actors.
When you search a specific string without knowing if it will be ascii or wide, you need to generate not only 3 but 6 strings. In Yara, this could be replaced by one line, with the "ascii" and "wide" modifier in addition to "base64" and "widebase64". So, as a user, those new modifiers would be useful, convenient, and would make rules more readable as they would turn 6 lines into 1. The ability to specify the alphabet is a nice bonus.

Regarding the risk of "user confusion", my feeling is that the basic user will not bother with the modifier, and the advanced user will read the documentation.

I cannot say how much burden it adds as a developer, my first thought was that the base64 algorithm is not going to change, so once the code is written and tested you should be fine, but then I realized you will need to handle the modifier against every new modifier that you may add in the future. You are in the best position to answer this question, and I will still cherish Yara if you decide not to implement the feature :)

@JohnLaTwC

This comment has been minimized.

Copy link

JohnLaTwC commented Jan 8, 2020

I strongly support this addition.

Base64 is a very common encoding format found across most major languages and platforms. It is used to encode multi-line parameters & scripts, encode binary content (e.g. compressed data), and to support common file encodings (e.g. X509 certificate encodings).

Attackers make use of Base64 encoding very often because it allows them to utilize native functionality to achieve simple obfuscation objectives. Payloads across .NET, PowerShell, Python, PHP, VBA Macros, and beyond have used base64. It is probably the #1 encoding used by attackers for simple evasion and obfuscation goals.

Matching encoded content is also the way an entry infosec practitioner can progress beyond matching on "plaintext strings" to find more advanced payloads. These payloads require additional analysis skills and tools (e.g. CyberChef) that help practitioners gain the skills necessary to defend their environments. Giving defenders more power in a convenient way simplifies the progression of mastery.

While this feature can be approximated by generating the appropriate strings in standalone tools, native support for this encoding would simplify the effort required by defenders to match content and eliminate points of error (e.g. forgetting to including multiple base64 strings to account for leading and trailing bytes, neglecting unicode, etc).

@plusvic

This comment has been minimized.

Copy link
Collaborator

plusvic commented Jan 8, 2020

Given the arguments, and how passionately @wxsBSD has defended this feature, I'm changing my mind. Let me do the final review before merging it.

libyara/grammar.y Show resolved Hide resolved
libyara/parser.c Outdated Show resolved Hide resolved
@wxsBSD

This comment has been minimized.

Copy link
Collaborator Author

wxsBSD commented Jan 13, 2020

I think all of the code I added in re.c belongs in a new file (I'm thinking base64.c) - really the majority of the code has nothing to do with the re code. By moving it to it's own file it will make things cleaner. Do you want me to do that?

@plusvic

This comment has been minimized.

Copy link
Collaborator

plusvic commented Jan 13, 2020

I totally agree on creating the base64.c file.

wxsBSD added 2 commits Jan 13, 2020
I haven't tested the build with bazel but the normal build process works and
tests are passing.
@plusvic

This comment has been minimized.

Copy link
Collaborator

plusvic commented Feb 4, 2020

I was about to merge this PR and noticed that for some reason the tests in Travis CI are failing. They pass in my local machine though.

@wxsBSD

This comment has been minimized.

Copy link
Collaborator Author

wxsBSD commented Feb 4, 2020

Not sure why those are failing and I can’t find a way to get the test logs to see what failed exactly. :(

@wxsBSD

This comment has been minimized.

Copy link
Collaborator Author

wxsBSD commented Feb 5, 2020

The windows builds are broken because I didn't update the VS solution files - I can download a copy of VS in a VM if you want me to fix that, but it may be a day or two before I can get to it.

I have no idea why those tests are failing and I can't seem to get travis to give me the logs I need to debug it. Maybe it makes sense to modify the Makefile so it cats out the test-suite.log if tests fail? This way we can see the test failures in travis (and it just makes it easier to not have to cat that file all the time during development).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.