Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upbase64 and widebase64 modifiers #1185
Conversation
Add two new modifiers: base64 and widebase64.
These modifiers take the given text string and generate 3 different strings
being careful to trim off the bytes which are dependent upon leading or
trailing bytes in the larger search space.
I've implemented it by slightly cheating. I generate all the search strings in
a list and then creating one large string suitable for the RE compiler to deal
with. For example, the string "This program cannot" generates these three
base64 encoded strings:
VGhpcyBwcm9ncmFtIGNhbm5vd
RoaXMgcHJvZ3JhbSBjYW5ub3
UaGlzIHByb2dyYW0gY2Fubm90
Those three strings are then transformed into a RE that looks like this:
(VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90)
That string is then passed to the RE compiler for parsing and AST generation.
The AST is then emitted into the appropriate spot and YARA believes it was
given a regex from that point on.
I've also implemented support for specifying custom alphabets:
base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu")
This means that I have to be careful to escape any special RE characters in the
string passed to the compiler. The base64 alphabet has to be 64 bytes long, and
does support escaped bytes properly too.
To avoid the need to deal with escaping I had a first implementation which
attempted to generate the AST by hand, which was mostly working but was very
cumbersome to maintain. In doing this I ended up improving yr_re_print_node()
so that it indents the tree properly, which made debugging that attempt easier.
My apologies for including it in this diff. I can split it out if you want.
Lastly, I did most of the work in re.c but I think it belongs in it's own
base64.c. I'm happy to move it out of re.c if you would prefer.
This comment has been minimized.
This comment has been minimized.
|
I'm putting this up for review so I can gather comments and iterate on it. I think this belongs in its own set of source files, instead of forcing it into re.(c|h) for the most part. I'll move things over to base64.(c|h) and get them into the build system, but would first like to gather feedback on the design. The details are in the documentation (included in this PR) but the short version is that:
The My first implementation of this was generating the RE AST by hand, but it was far too fragile so I switched over to the current method (generating an RE string and letting the RE compiler do the heavy lifting for me). This is an easy approach to this but I'm not super happy with it either. I think a better long-term answer here is to have a way to have different representations of strings but it is still only referenced by a single string internally. This is a lot more work and may not be worth the effort until a more concrete use-case is found. |
This comment has been minimized.
This comment has been minimized.
|
I have two concerns about these modifiers, which are:
Thoughts? |
This comment has been minimized.
This comment has been minimized.
I understand this concern but I believe the benefit of a more expressive and readable language is worth it. You are right that it does increase the cognitive burden on users, but with good documentation (which I tried to provide a start to) we can help users understand what is and what is not supported. With tests, which I tried to make very thorough, I think we can lessen the burden on ourselves for maintenance.
You're right that these are more for convenience but not having them means each user needs to write code (or find someone else's) which does the base64 encoding with proper padding and trimming of the encoded strings. It's not hard to do, and there are a handful of implementations out there already. In fact, when I was writing this I was using one which turned out to be buggy. The fact that we can provide a convenient way for users to express this while also increasing readability makes this at least worth considering. The alternative is users needing to write or find a script to do this, and any bugs that may include. Also, this is more expressive than the versions I've seen that don't let you specify a custom alphabet or even wide support. I'd much rather see |
This comment has been minimized.
This comment has been minimized.
thehellu
commented
Jan 8, 2020
|
Jumping on the debate as a Yara user, as some feedback was asked on Twitter. At the moment, and probably as other folks, I use a third-party script to generate the 3 possible base64 strings for every input. This technique proved to be successful to monitor some threat actors. Regarding the risk of "user confusion", my feeling is that the basic user will not bother with the modifier, and the advanced user will read the documentation. I cannot say how much burden it adds as a developer, my first thought was that the base64 algorithm is not going to change, so once the code is written and tested you should be fine, but then I realized you will need to handle the modifier against every new modifier that you may add in the future. You are in the best position to answer this question, and I will still cherish Yara if you decide not to implement the feature :) |
This comment has been minimized.
This comment has been minimized.
JohnLaTwC
commented
Jan 8, 2020
|
I strongly support this addition. Base64 is a very common encoding format found across most major languages and platforms. It is used to encode multi-line parameters & scripts, encode binary content (e.g. compressed data), and to support common file encodings (e.g. X509 certificate encodings). Attackers make use of Base64 encoding very often because it allows them to utilize native functionality to achieve simple obfuscation objectives. Payloads across .NET, PowerShell, Python, PHP, VBA Macros, and beyond have used base64. It is probably the #1 encoding used by attackers for simple evasion and obfuscation goals. Matching encoded content is also the way an entry infosec practitioner can progress beyond matching on "plaintext strings" to find more advanced payloads. These payloads require additional analysis skills and tools (e.g. CyberChef) that help practitioners gain the skills necessary to defend their environments. Giving defenders more power in a convenient way simplifies the progression of mastery. While this feature can be approximated by generating the appropriate strings in standalone tools, native support for this encoding would simplify the effort required by defenders to match content and eliminate points of error (e.g. forgetting to including multiple base64 strings to account for leading and trailing bytes, neglecting unicode, etc). |
This comment has been minimized.
This comment has been minimized.
|
Given the arguments, and how passionately @wxsBSD has defended this feature, I'm changing my mind. Let me do the final review before merging it. |
This comment has been minimized.
This comment has been minimized.
|
I think all of the code I added in |
This comment has been minimized.
This comment has been minimized.
|
I totally agree on creating the |
This comment has been minimized.
This comment has been minimized.
|
I was about to merge this PR and noticed that for some reason the tests in Travis CI are failing. They pass in my local machine though. |
This comment has been minimized.
This comment has been minimized.
|
Not sure why those are failing and I can’t find a way to get the test logs to see what failed exactly. :( |
This comment has been minimized.
This comment has been minimized.
|
The windows builds are broken because I didn't update the VS solution files - I can download a copy of VS in a VM if you want me to fix that, but it may be a day or two before I can get to it. I have no idea why those tests are failing and I can't seem to get travis to give me the logs I need to debug it. Maybe it makes sense to modify the Makefile so it cats out the test-suite.log if tests fail? This way we can see the test failures in travis (and it just makes it easier to not have to cat that file all the time during development). |
wxsBSD commentedDec 23, 2019
Add two new modifiers: base64 and widebase64.
These modifiers take the given text string and generate 3 different strings
being careful to trim off the bytes which are dependent upon leading or
trailing bytes in the larger search space.
I've implemented it by slightly cheating. I generate all the search strings in
a list and then creating one large string suitable for the RE compiler to deal
with. For example, the string "This program cannot" generates these three
base64 encoded strings:
VGhpcyBwcm9ncmFtIGNhbm5vd
RoaXMgcHJvZ3JhbSBjYW5ub3
UaGlzIHByb2dyYW0gY2Fubm90
Those three strings are then transformed into a RE that looks like this:
(VGhpcyBwcm9ncmFtIGNhbm5vd|RoaXMgcHJvZ3JhbSBjYW5ub3|UaGlzIHByb2dyYW0gY2Fubm90)
That string is then passed to the RE compiler for parsing and AST generation.
The AST is then emitted into the appropriate spot and YARA believes it was
given a regex from that point on.
I've also implemented support for specifying custom alphabets:
base64("!@#$%^&*(){}[].,|ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstu")
This means that I have to be careful to escape any special RE characters in the
string passed to the compiler. The base64 alphabet has to be 64 bytes long, and
does support escaped bytes properly too.
To avoid the need to deal with escaping I had a first implementation which
attempted to generate the AST by hand, which was mostly working but was very
cumbersome to maintain. In doing this I ended up improving yr_re_print_node()
so that it indents the tree properly, which made debugging that attempt easier.
My apologies for including it in this diff. I can split it out if you want.
Lastly, I did most of the work in re.c but I think it belongs in it's own
base64.c. I'm happy to move it out of re.c if you would prefer.