Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance & hash based whitelisting #592

Closed
gwillem opened this issue Jan 6, 2017 · 5 comments
Closed

Performance & hash based whitelisting #592

gwillem opened this issue Jan 6, 2017 · 5 comments

Comments

@gwillem
Copy link

gwillem commented Jan 6, 2017

What is the recommend way to implement hash based whitelists in Yara? Some projects such as php malware finder do it similar like this:

import "hash"
global private rule Whitelist {
	condition:
		hash.sha1(0, filesize) != "c9cf738d8b1a8a77f6d200f327c5d4ec8201a99d" and
                hash.sha1(0, filesize) != "40a0a6e5ff86f75e6723e0008ddae29b1ed384c8" and
                [...]
}

However, this seems to take O(n) time while I would expect O(1).
Proof, timing with a single sha1sum:

$ time yara -r whitelist-1-hash.yar magento-2.0
real	0m2.780s
user	0m4.056s
sys	0m2.344s

$ time yara -r whitelist-100-hashes.yar magento-2.0
real	0m38.553s
user	2m15.468s
sys	0m2.348s

So checking for 100 hashes takes 35 times as much CPU power as 1 hash. What is the best way to whitelist thousands of hashes?

@gwillem gwillem changed the title Performance & whitelisting Performance & hash based whitelisting Jan 6, 2017
@aschuster99
Copy link

aschuster99 commented Jan 6, 2017

I suggest to calculate the hash (once per sample!) externally and pass it as an external variable to your ruleset.

$ sha1deep -b sample
bb1ab80641f80fdd0e6258a032a8e9dd9f2f5ee6 sample

Test run with 100 hashes:

$time yara -d ext_hash="bb1ab80641f80fdd0e6258a032a8e9dd9f2f5ee6" ruleset.yar sample
real    0m0.005s
user    0m0.001s
sys     0m0.002s

Your whitelist rule would now look like this:

$ cat ruleset.yar
rule whitelist {
condition:
  ext_hash != "1e6f6dcbc28d0fdcd01d49a71a90d0e2e447c96a" and
  ext_hash != "f0d3ad63910d8d3051bb4dd7af8513652259d796" and
  ext_hash != "47e000551379fe895e4f13a34ea7eb77db5439e2" and

...

@gwillem
Copy link
Author

gwillem commented Jan 6, 2017

Thanks! That's what I do now in Python, but its 25% slower than Yara's built-in hash module (for single hashes, that is).

with open(path, 'rb') as fh:
  data = fh.read()
hash = hashlib.sha1(data).hexdigest()
if hash in whitelist:
 return False
rules.match(data=data)

So I was hoping it would be possible that Yara would memoize the results of its hash lookups (or another cache mechanism).

@plusvic
Copy link
Member

plusvic commented Jan 9, 2017

@gwillem YARA 3.5 doesn't cache the hash results, but the latest version in master does it.

@plusvic plusvic closed this as completed Jan 9, 2017
@gwillem
Copy link
Author

gwillem commented Jan 9, 2017

Awesome, thank you!

@jvoisin
Copy link

jvoisin commented Jan 18, 2017

Does someone have some benchmark to share about this?
For the reccord, the commit implementing this behaviour is 22ce5e0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants