Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial commit for backend-smda #355

Merged
merged 18 commits into from Nov 9, 2020
Merged

Conversation

danielplohmann
Copy link

@danielplohmann danielplohmann commented Oct 29, 2020

Hey all!

We propose to integrate SMDA as a disassembler backend into capa to enable full Python 3 support.
SMDA is a lightweight and fast recursive disassembler built on top of capstone, covering PE and ELF files in 32/64bit. It is compatible with both Python 2 and Python 3.

The pull request contains an implementation structurally oriented along the vivisect implementation and provides all capa-features covered by vivisect. We also made sure that the code is in line with the style requirements and has full test coverage using the tests/fixtures.get_extractor() method:

CapaTests

This initial PR sets SMDA as a default backend for Python 3.
This choice was made to allow you an easier evaluation of the contribution.

As discussed out of band, we don't necessarily ask to make SMDA the default backend for Python 3 but instead propose to allow users choosing their preferred backend through a CLI parameter, which could also be used to pass additional parameters to the disassembler backend (e.g. SMDA can process shellcode / memory dumps and make use of ApiScout for reconstruction of dynamic imports - increasing the potential matches of rules).

This PR was a collaborative effort of @jcrussell and @danielplohmann during GeekWeek 7.

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really great piece of work, nice job.

I can tell from the changes you made that you've gotten this working quite successfully. I also appreciate that the style is consistent, both within the PR and our existing code base - that really helps us integrate the changes a lot easier.

The biggest thing I'd like to see before merging is the addition of feature unit tests against the SMDA backend. there are 100 or so unit tests that ensure that each backend behaves essentially the same. fortunately, registering the SMDA backend should be pretty easy (if not, its a bug, and I'd like to work to make it easy for you).

would you duplicate test_viv_features.py as test_smda_features.py and update it appropriately?

capa/features/extractors/smda/insn.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/insn.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/basicblock.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/insn.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/insn.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/insn.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/insn.py Outdated Show resolved Hide resolved
# https://stackoverflow.com/a/22947334/ offers a solution and decoding using getfilesystemencoding works
# in our testing, however other sources suggest `sys.stdin.encoding` (https://stackoverflow.com/q/4012571/)
"sample",
type=str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a fix for #354, thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great

@@ -550,7 +573,7 @@ def main(argv=None):
# during the load of the RuleSet, we extract subscope statements into their own rules
# that are subsequently `match`ed upon. this inflates the total rule count.
# so, filter out the subscope rules when reporting total number of loaded rules.
len(filter(lambda r: "capa/subscope-rule" not in r.meta, rules.rules.values())),
len([i for i in filter(lambda r: "capa/subscope-rule" not in r.meta, rules.rules.values())]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@williballenthin
Copy link
Collaborator

also, i broke master by merging a rule but not syncing the testfiles submodule. thats since fixed. if you can sync this branch from master then we should see the test results show up in CI.

@danielplohmann
Copy link
Author

this is a really great piece of work, nice job.

I can tell from the changes you made that you've gotten this working quite successfully. I also appreciate that the style is consistent, both within the PR and our existing code base - that really helps us integrate the changes a lot easier.

The biggest thing I'd like to see before merging is the addition of feature unit tests against the SMDA backend. there are 100 or so unit tests that ensure that each backend behaves essentially the same. fortunately, registering the SMDA backend should be pretty easy (if not, its a bug, and I'd like to work to make it easy for you).

would you duplicate test_viv_features.py as test_smda_features.py and update it appropriately?

20201030-capa-smda

Alright, here we go!
Fully implemented and (almost) passing tests for test_smda_features.py.

We found that the only failing test (x64 nested thunks) does not succeed because disassemblers disagree where to define function borders. It seems vivisect wants 0x140059342 as function start, while IDA/SMDA want 0x14005E0C0 as a function start and say 0x140059342 is just a function with a single jump instruction to 0x14005E0C0.
In fixtures.py, I commented on this and also proposed another, simpler function 0x1400615c0 which would serve as possible replacement for a x64 nested thunks test. It's calling IsProcessorFeaturePresent via double indirection, but this function appears to be missed by vivisect. :(
Please advise how to proceed.

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool, this looks great, I think the following items are left to do:

capa/features/extractors/smda/function.py Outdated Show resolved Hide resolved
# https://stackoverflow.com/a/22947334/ offers a solution and decoding using getfilesystemencoding works
# in our testing, however other sources suggest `sys.stdin.encoding` (https://stackoverflow.com/q/4012571/)
"sample",
type=str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great

Copy link
Collaborator

@williballenthin williballenthin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mr-tz @mike-hunhoff will one of you take a second look and then do the merge?

Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for all of your hard work on this implementation! It's really cool to see capa running in Python 3 and your approach to implementing a new extractor for capa. I left a few comments to consider but overall this looks great!

I tested locally and encountered errors while running the smda tests on Windows with Python v3.7.8 (see below):

> C:\Exclusions\capa>python -m pytest tests\test_smda_features.py
...
=========================================================================================================== short test summary info ===========================================================================================================
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True0] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True1] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True0] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True1] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[al-khaser x64-function=0x14004B4F0-api(__vcrt_GetModuleHandle)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800017D0-characteristic(peb access)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001068-characteristic(gs access)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[a1982...-function=0x4014D0-characteristic(cross section flow)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001068-characteristic(cross section flow)-False] - struct.error: unpack requires a buffer of 4 bytes

Results (47.59s):
     104 passed
       9 failed
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True0]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True1]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True0]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True1]
         - tests/test_smda_features.py:13 test_smda_features[al-khaser x64-function=0x14004B4F0-api(__vcrt_GetModuleHandle)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800017D0-characteristic(peb access)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001068-characteristic(gs access)-True]
         - tests/test_smda_features.py:13 test_smda_features[a1982...-function=0x4014D0-characteristic(cross section flow)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001068-characteristic(cross section flow)-False]

The failed tests appear to be the result of the following error:

C:\Exclusions\capa>python -m capa.main tests\data\al-khaser_x64.exe_
WARNING:capa:skipping non-.yml file: 11736.py
loading : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 422/422 [00:00<00:00, 1227.46     rules/s]
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Exclusions\capa\capa\main.py", line 713, in <module>
    sys.exit(main())
  File "C:\Exclusions\capa\capa\main.py", line 595, in main
    extractor = get_extractor(args.sample, args.format, disable_progress=args.quiet)
  File "C:\Exclusions\capa\capa\main.py", line 319, in get_extractor
    return get_extractor_py3(path, format, disable_progress=disable_progress)
  File "C:\Exclusions\capa\capa\main.py", line 308, in get_extractor_py3
    smda_report = smda_disasm.disassembleFile(path)
  File "smda\Disassembler.py", line 37, in disassembleFile
    loader = FileLoader(file_path, map_file=True)
  File "smda\utility\FileLoader.py", line 17, in __init__
    self._loadFile()
  File "smda\utility\FileLoader.py", line 32, in _loadFile
    self._base_addr = loader.getBaseAddress(self._raw_data)
  File "smda\utility\PeFileLoader.py", line 94, in getBaseAddress
    base_addr = struct.unpack("L", binary[pe_offset + 0x30:pe_offset + 0x38])[0]
struct.error: unpack requires a buffer of 4 bytes

Let's address this error and then we should be ready for a merge 🚀

capa/features/extractors/smda/basicblock.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/basicblock.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/basicblock.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/function.py Outdated Show resolved Hide resolved
capa/features/extractors/smda/insn.py Outdated Show resolved Hide resolved
@danielplohmann
Copy link
Author

Awesome, thank you for all of your hard work on this implementation! It's really cool to see capa running in Python 3 and your approach to implementing a new extractor for capa. I left a few comments to consider but overall this looks great!

I tested locally and encountered errors while running the smda tests on Windows with Python v3.7.8 (see below):

> C:\Exclusions\capa>python -m pytest tests\test_smda_features.py
...
=========================================================================================================== short test summary info ===========================================================================================================
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True0] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True1] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True0] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True1] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[al-khaser x64-function=0x14004B4F0-api(__vcrt_GetModuleHandle)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x1800017D0-characteristic(peb access)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001068-characteristic(gs access)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[a1982...-function=0x4014D0-characteristic(cross section flow)-True] - struct.error: unpack requires a buffer of 4 bytes
FAILED tests/test_smda_features.py::test_smda_features[kernel32-64-function=0x180001068-characteristic(cross section flow)-False] - struct.error: unpack requires a buffer of 4 bytes

Results (47.59s):
     104 passed
       9 failed
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True0]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001010-api(RtlVirtualUnwind)-True1]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True0]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800202B0-api(RtlCaptureContext)-True1]
         - tests/test_smda_features.py:13 test_smda_features[al-khaser x64-function=0x14004B4F0-api(__vcrt_GetModuleHandle)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x1800017D0-characteristic(peb access)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001068-characteristic(gs access)-True]
         - tests/test_smda_features.py:13 test_smda_features[a1982...-function=0x4014D0-characteristic(cross section flow)-True]
         - tests/test_smda_features.py:13 test_smda_features[kernel32-64-function=0x180001068-characteristic(cross section flow)-False]

The failed tests appear to be the result of the following error:

C:\Exclusions\capa>python -m capa.main tests\data\al-khaser_x64.exe_
WARNING:capa:skipping non-.yml file: 11736.py
loading : 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 422/422 [00:00<00:00, 1227.46     rules/s]
Traceback (most recent call last):
  File "C:\Program Files\Python37\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Program Files\Python37\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Exclusions\capa\capa\main.py", line 713, in <module>
    sys.exit(main())
  File "C:\Exclusions\capa\capa\main.py", line 595, in main
    extractor = get_extractor(args.sample, args.format, disable_progress=args.quiet)
  File "C:\Exclusions\capa\capa\main.py", line 319, in get_extractor
    return get_extractor_py3(path, format, disable_progress=disable_progress)
  File "C:\Exclusions\capa\capa\main.py", line 308, in get_extractor_py3
    smda_report = smda_disasm.disassembleFile(path)
  File "smda\Disassembler.py", line 37, in disassembleFile
    loader = FileLoader(file_path, map_file=True)
  File "smda\utility\FileLoader.py", line 17, in __init__
    self._loadFile()
  File "smda\utility\FileLoader.py", line 32, in _loadFile
    self._base_addr = loader.getBaseAddress(self._raw_data)
  File "smda\utility\PeFileLoader.py", line 94, in getBaseAddress
    base_addr = struct.unpack("L", binary[pe_offset + 0x30:pe_offset + 0x38])[0]
struct.error: unpack requires a buffer of 4 bytes

Let's address this error and then we should be ready for a merge

Hehe, seems like I fell into the struct format char size trap - I now changed this to Q, so it should be fine across all platforms.
All other comments are addressed.

What's still open for discussion:

  1. find a new nested thunk test case that works across the board (explain insn/api: x64 nested thunk test case #356)
  2. decide how to use/specify the analysis backend
    1. specify command-line argument(s)
    2. default to vivisect for Py2 and SMDA for Py3?
  1. I just checked this in my python2 venv and it looks my proposed test replacement for a x64 nested thunk is processed correctly by vivisect as well - if you can confirm this, we would be able to close explain insn/api: x64 nested thunk test case #356 as well.
    2.i) I would go ahead in the near future and prepare this in a separate PR.
    2.ii) That is at least the case for now. As stated initially, feel free to change out SMDA as default any time if another python3 disassembler makes a good replacement. 2.i) would then at least conveniently allow for SMDA to remain available.

@mr-tz
Copy link
Collaborator

mr-tz commented Nov 9, 2020

  1. Thanks, the test case is great! This closes explain insn/api: x64 nested thunk test case #356
  2. That works for me.

Before we merge, can you please:

@danielplohmann
Copy link
Author

  • @mr-tz I've just merged your improvements, thank you!
  • I've put the runtime check for Python 3 into the initialization of SmdaFeatureExtractor and throw a UnsupportedRuntimeError in case this condition is not met. Is this sufficient or would anyone prefer a different way of handling this?
  • The failure of Python 3.9.0-rc.1 is pretty unfortunate as this is due to LIEF not being available via PyPI for Python 3.9 yet. There are recent commits in their repo working towards supporting it. It seems that it is also already possible to compile it from source for Python 3.9 using the Github repository, so I assume it will be available for Python 3.9 in the near future.

@mr-tz
Copy link
Collaborator

mr-tz commented Nov 9, 2020

Great, this looks good to me then. Well done and many thanks again!
@williballenthin, @mike-hunhoff are we good to merge?

@williballenthin
Copy link
Collaborator

lets do it!

@mr-tz mr-tz merged commit 1c1fb20 into mandiant:master Nov 9, 2020
@williballenthin williballenthin added this to the v1.5.0 milestone Jan 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants