initial commit for backend-smda #355

danielplohmann · 2020-10-29T11:06:28Z

Hey all!

We propose to integrate SMDA as a disassembler backend into capa to enable full Python 3 support.
SMDA is a lightweight and fast recursive disassembler built on top of capstone, covering PE and ELF files in 32/64bit. It is compatible with both Python 2 and Python 3.

The pull request contains an implementation structurally oriented along the vivisect implementation and provides all capa-features covered by vivisect. We also made sure that the code is in line with the style requirements and has full test coverage using the tests/fixtures.get_extractor() method:

This initial PR sets SMDA as a default backend for Python 3.
This choice was made to allow you an easier evaluation of the contribution.

As discussed out of band, we don't necessarily ask to make SMDA the default backend for Python 3 but instead propose to allow users choosing their preferred backend through a CLI parameter, which could also be used to pass additional parameters to the disassembler backend (e.g. SMDA can process shellcode / memory dumps and make use of ApiScout for reconstruction of dynamic imports - increasing the potential matches of rules).

This PR was a collaborative effort of @jcrussell and @danielplohmann during GeekWeek 7.

williballenthin

this is a really great piece of work, nice job.

I can tell from the changes you made that you've gotten this working quite successfully. I also appreciate that the style is consistent, both within the PR and our existing code base - that really helps us integrate the changes a lot easier.

The biggest thing I'd like to see before merging is the addition of feature unit tests against the SMDA backend. there are 100 or so unit tests that ensure that each backend behaves essentially the same. fortunately, registering the SMDA backend should be pretty easy (if not, its a bug, and I'd like to work to make it easy for you).

would you duplicate test_viv_features.py as test_smda_features.py and update it appropriately?

capa/features/extractors/smda/insn.py

capa/features/extractors/smda/basicblock.py

capa/features/extractors/smda/insn.py

williballenthin · 2020-10-29T15:58:13Z

capa/main.py

+            # https://stackoverflow.com/a/22947334/ offers a solution and decoding using getfilesystemencoding works
+            # in our testing, however other sources suggest `sys.stdin.encoding` (https://stackoverflow.com/q/4012571/)
+            "sample",
+            type=str,


a fix for #354, thanks!

williballenthin · 2020-10-29T15:58:39Z

capa/main.py

@@ -550,7 +573,7 @@ def main(argv=None):
            # during the load of the RuleSet, we extract subscope statements into their own rules
            # that are subsequently `match`ed upon. this inflates the total rule count.
            # so, filter out the subscope rules when reporting total number of loaded rules.
-            len(filter(lambda r: "capa/subscope-rule" not in r.meta, rules.rules.values())),
+            len([i for i in filter(lambda r: "capa/subscope-rule" not in r.meta, rules.rules.values())]),


williballenthin · 2020-10-29T16:05:54Z

also, i broke master by merging a rule but not syncing the testfiles submodule. thats since fixed. if you can sync this branch from master then we should see the test results show up in CI.

40 failed, 73 passed.

Check if memory referenced is a pointer to a string. Fixes mimikatz string test.

Found derefs in viv/insn.py, does exactly what we need!

…ckend-smda

danielplohmann · 2020-10-30T14:54:47Z

this is a really great piece of work, nice job.

I can tell from the changes you made that you've gotten this working quite successfully. I also appreciate that the style is consistent, both within the PR and our existing code base - that really helps us integrate the changes a lot easier.

The biggest thing I'd like to see before merging is the addition of feature unit tests against the SMDA backend. there are 100 or so unit tests that ensure that each backend behaves essentially the same. fortunately, registering the SMDA backend should be pretty easy (if not, its a bug, and I'd like to work to make it easy for you).

would you duplicate test_viv_features.py as test_smda_features.py and update it appropriately?

Alright, here we go!
Fully implemented and (almost) passing tests for test_smda_features.py.

We found that the only failing test (x64 nested thunks) does not succeed because disassemblers disagree where to define function borders. It seems vivisect wants 0x140059342 as function start, while IDA/SMDA want 0x14005E0C0 as a function start and say 0x140059342 is just a function with a single jump instruction to 0x14005E0C0.
In fixtures.py, I commented on this and also proposed another, simpler function 0x1400615c0 which would serve as possible replacement for a x64 nested thunks test. It's calling IsProcessorFeaturePresent via double indirection, but this function appears to be missed by vivisect. :(
Please advise how to proceed.

mr-tz

very cool, this looks great, I think the following items are left to do:

find a new nested thunk test case that works across the board (explain insn/api: x64 nested thunk test case #356)
decide how to use/specify the analysis backend
- specify command-line argument(s)
- default to vivisect for Py2 and SMDA for Py3?

capa/features/extractors/smda/function.py

mr-tz · 2020-11-03T13:35:18Z

capa/main.py

+            # https://stackoverflow.com/a/22947334/ offers a solution and decoding using getfilesystemencoding works
+            # in our testing, however other sources suggest `sys.stdin.encoding` (https://stackoverflow.com/q/4012571/)
+            "sample",
+            type=str,


…or vivisect

williballenthin

lgtm

@mr-tz @mike-hunhoff will one of you take a second look and then do the merge?

mike-hunhoff

Awesome, thank you for all of your hard work on this implementation! It's really cool to see capa running in Python 3 and your approach to implementing a new extractor for capa. I left a few comments to consider but overall this looks great!

I tested locally and encountered errors while running the smda tests on Windows with Python v3.7.8 (see below):

> C:\Exclusions\capa>python -m pytest tests\test_smda_features.py
...
=========================================================================================================== short test summary info ===========================================================================================================
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True0] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True1] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True0] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True1] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[al-khaser x64-function=0x14004B4F0-api(__vcrt_GetModuleHandle)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800017D0-characteristic(peb access)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001068-characteristic(gs access)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[a1982...-function=0x4014D0-characteristic(cross section flow)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001068-characteristic(cross section flow)-False] - struct.error: unpack requires a buffer of 4 bytes

Results (47.59s):
     104 passed
       9 failed
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True0]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True1]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True0]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True1]
         - tests/test_smda_features.py:13 test_smda_features[al-khaser x64-function=0x14004B4F0-api(__vcrt_GetModuleHandle)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800017D0-characteristic(peb access)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001068-characteristic(gs access)-True]
         - tests/test_smda_features.py:13 test_smda_features[a1982...-function=0x4014D0-characteristic(cross section flow)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001068-characteristic(cross section flow)-False]

The failed tests appear to be the result of the following error:

C:\Exclusions\capa>python -m capa.main tests\data\al-khaser_x64.exe_
WARNING:capa:skipping non-.yml file: 11736.py
loading : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 422/422 [00:00<00:00, 1227.46     rules/s]
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Exclusions\capa\capa\main.py", line 713, in <module>
    sys.exit(main())
  File "C:\Exclusions\capa\capa\main.py", line 595, in main
    extractor = get_extractor(args.sample, args.format, disable_progress=args.quiet)
  File "C:\Exclusions\capa\capa\main.py", line 319, in get_extractor
    return get_extractor_py3(path, format, disable_progress=disable_progress)
  File "C:\Exclusions\capa\capa\main.py", line 308, in get_extractor_py3
    smda_report = smda_disasm.disassembleFile(path)
  File "smda\Disassembler.py", line 37, in disassembleFile
    loader = FileLoader(file_path, map_file=True)
  File "smda\utility\FileLoader.py", line 17, in __init__
    self._loadFile()
  File "smda\utility\FileLoader.py", line 32, in _loadFile
    self._base_addr = loader.getBaseAddress(self._raw_data)
  File "smda\utility\PeFileLoader.py", line 94, in getBaseAddress
    base_addr = struct.unpack("L", binary[pe_offset + 0x30:pe_offset + 0x38])[0]
struct.error: unpack requires a buffer of 4 bytes

Let's address this error and then we should be ready for a merge 🚀

capa/features/extractors/smda/basicblock.py

capa/features/extractors/smda/function.py

capa/features/extractors/smda/insn.py

danielplohmann · 2020-11-06T09:46:26Z

Awesome, thank you for all of your hard work on this implementation! It's really cool to see capa running in Python 3 and your approach to implementing a new extractor for capa. I left a few comments to consider but overall this looks great!

I tested locally and encountered errors while running the smda tests on Windows with Python v3.7.8 (see below):

> C:\Exclusions\capa>python -m pytest tests\test_smda_features.py
...
=========================================================================================================== short test summary info ===========================================================================================================
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True0] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True1] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True0] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True1] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[al-khaser x64-function=0x14004B4F0-api(__vcrt_GetModuleHandle)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800017D0-characteristic(peb access)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001068-characteristic(gs access)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[a1982...-function=0x4014D0-characteristic(cross section flow)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001068-characteristic(cross section flow)-False] - struct.error: unpack requires a buffer of 4 bytes

Results (47.59s):
     104 passed
       9 failed
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True0]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True1]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True0]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True1]
         - tests/test_smda_features.py:13 test_smda_features[al-khaser x64-function=0x14004B4F0-api(__vcrt_GetModuleHandle)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800017D0-characteristic(peb access)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001068-characteristic(gs access)-True]
         - tests/test_smda_features.py:13 test_smda_features[a1982...-function=0x4014D0-characteristic(cross section flow)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001068-characteristic(cross section flow)-False]

The failed tests appear to be the result of the following error:

C:\Exclusions\capa>python -m capa.main tests\data\al-khaser_x64.exe_
WARNING:capa:skipping non-.yml file: 11736.py
loading : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 422/422 [00:00<00:00, 1227.46     rules/s]
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Exclusions\capa\capa\main.py", line 713, in <module>
    sys.exit(main())
  File "C:\Exclusions\capa\capa\main.py", line 595, in main
    extractor = get_extractor(args.sample, args.format, disable_progress=args.quiet)
  File "C:\Exclusions\capa\capa\main.py", line 319, in get_extractor
    return get_extractor_py3(path, format, disable_progress=disable_progress)
  File "C:\Exclusions\capa\capa\main.py", line 308, in get_extractor_py3
    smda_report = smda_disasm.disassembleFile(path)
  File "smda\Disassembler.py", line 37, in disassembleFile
    loader = FileLoader(file_path, map_file=True)
  File "smda\utility\FileLoader.py", line 17, in __init__
    self._loadFile()
  File "smda\utility\FileLoader.py", line 32, in _loadFile
    self._base_addr = loader.getBaseAddress(self._raw_data)
  File "smda\utility\PeFileLoader.py", line 94, in getBaseAddress
    base_addr = struct.unpack("L", binary[pe_offset + 0x30:pe_offset + 0x38])[0]
struct.error: unpack requires a buffer of 4 bytes

Let's address this error and then we should be ready for a merge

Hehe, seems like I fell into the struct format char size trap - I now changed this to Q, so it should be fine across all platforms.
All other comments are addressed.

What's still open for discussion:

find a new nested thunk test case that works across the board (explain insn/api: x64 nested thunk test case #356)
decide how to use/specify the analysis backend
1. specify command-line argument(s)
2. default to vivisect for Py2 and SMDA for Py3?

I just checked this in my python2 venv and it looks my proposed test replacement for a x64 nested thunk is processed correctly by vivisect as well - if you can confirm this, we would be able to close explain insn/api: x64 nested thunk test case #356 as well.
2.i) I would go ahead in the near future and prepare this in a separate PR.
2.ii) That is at least the case for now. As stated initially, feel free to change out SMDA as default any time if another python3 disassembler makes a good replacement. 2.i) would then at least conveniently allow for SMDA to remain available.

mr-tz · 2020-11-09T12:50:13Z

Thanks, the test case is great! This closes explain insn/api: x64 nested thunk test case #356
That works for me.

Before we merge, can you please:

Add a check to verify Python3, see initial commit for backend-smda #355 (comment); ref: https://stackoverflow.com/questions/62569556/how-to-graceful-exit-if-the-required-verion-of-python-is-not-used
Investigate why the Python 3.9.0-rc.1 setup fails, see https://github.com/fireeye/capa/pull/355/checks?check_run_id=1374027450#step:5:249
- Do we need a workaround?
Merge in the improvements I suggest in https://github.com/fireeye/capa/compare/mrtz-backend-smda, see Tests/Improvements backend smda #360

…Python version < 3.0

danielplohmann · 2020-11-09T15:33:07Z

@mr-tz I've just merged your improvements, thank you!
I've put the runtime check for Python 3 into the initialization of SmdaFeatureExtractor and throw a UnsupportedRuntimeError in case this condition is not met. Is this sufficient or would anyone prefer a different way of handling this?
The failure of Python 3.9.0-rc.1 is pretty unfortunate as this is due to LIEF not being available via PyPI for Python 3.9 yet. There are recent commits in their repo working towards supporting it. It seems that it is also already possible to compile it from source for Python 3.9 using the Github repository, so I assume it will be available for Python 3.9 in the near future.

mr-tz · 2020-11-09T19:51:19Z

Great, this looks good to me then. Well done and many thanks again!
@williballenthin, @mike-hunhoff are we good to merge?

williballenthin · 2020-11-09T20:03:48Z

lets do it!

initial commit for backend-smda

3682292

williballenthin requested review from mr-tz, mike-hunhoff and williballenthin October 29, 2020 15:40

williballenthin added the enhancement New feature or request label Oct 29, 2020

williballenthin requested changes Oct 29, 2020

View reviewed changes

williballenthin mentioned this pull request Oct 29, 2020

capa can't be used as a library on py3 #50

Closed

Daniel Plohmann (jupiter) and others added 10 commits October 29, 2020 17:38

Merge remote-tracking branch 'origin/master' into backend-smda

669d348

addressing review

60ddf04

tests: add smda backend test

b12d0b6

40 failed, 73 passed.

down to 14 failed

74b2c18

add check for pointer to string

8f6a46e

Check if memory referenced is a pointer to a string. Fixes mimikatz string test.

use magical derefs

0c85e76

Found derefs in viv/insn.py, does exactly what we need!

test fixes

4a0f1f2

Merge branch 'backend-smda' of github.com:danielplohmann/capa into ba…

f3b59b3

…ckend-smda

comments on a test where disassembly differs among backends

d276a07

formatting

6bcdf64

mr-tz reviewed Nov 3, 2020

View reviewed changes

pnx@pyrite added 2 commits November 5, 2020 12:58

adjusted identification of thunks via SMDA.

3a43ffa

replacement test for nested x64 thunks - still needs to be verified f…

1e25604

…or vivisect

williballenthin approved these changes Nov 5, 2020

View reviewed changes

mike-hunhoff requested changes Nov 5, 2020

View reviewed changes

Daniel Plohmann (jupiter) added 2 commits November 6, 2020 09:50

Merge branch 'master' of github.com:fireeye/capa into backend-smda

1a34029

addressing the comments in the PR discussion

7d4888b

mr-tz added 2 commits November 9, 2020 13:22

disable fail-fast for tests job

75defc1

improvements for PR mandiant#355

dfc805b

throw UnsupportedRuntimeError if SmdaFeatureExtractor is used with a …

f7492c7

…Python version < 3.0

This was referenced Nov 9, 2020

disable python3.9 on CI until LIEF updates #361

Closed

enable python3.9 on CI when LIEF updates #362

Closed

mr-tz merged commit 1c1fb20 into mandiant:master Nov 9, 2020

williballenthin added this to the v1.5.0 milestone Jan 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial commit for backend-smda #355

initial commit for backend-smda #355

danielplohmann commented Oct 29, 2020 •

edited

williballenthin left a comment

williballenthin Oct 29, 2020

mr-tz Nov 3, 2020

williballenthin Oct 29, 2020

williballenthin commented Oct 29, 2020

danielplohmann commented Oct 30, 2020

mr-tz left a comment

mr-tz Nov 3, 2020

williballenthin left a comment

mike-hunhoff left a comment

danielplohmann commented Nov 6, 2020

mr-tz commented Nov 9, 2020

danielplohmann commented Nov 9, 2020

mr-tz commented Nov 9, 2020

williballenthin commented Nov 9, 2020

initial commit for backend-smda #355

initial commit for backend-smda #355

Conversation

danielplohmann commented Oct 29, 2020 • edited

williballenthin left a comment

Choose a reason for hiding this comment

williballenthin Oct 29, 2020

Choose a reason for hiding this comment

mr-tz Nov 3, 2020

Choose a reason for hiding this comment

williballenthin Oct 29, 2020

Choose a reason for hiding this comment

williballenthin commented Oct 29, 2020

danielplohmann commented Oct 30, 2020

mr-tz left a comment

Choose a reason for hiding this comment

mr-tz Nov 3, 2020

Choose a reason for hiding this comment

williballenthin left a comment

Choose a reason for hiding this comment

mike-hunhoff left a comment

Choose a reason for hiding this comment

danielplohmann commented Nov 6, 2020

mr-tz commented Nov 9, 2020

danielplohmann commented Nov 9, 2020

mr-tz commented Nov 9, 2020

williballenthin commented Nov 9, 2020

danielplohmann commented Oct 29, 2020 •

edited