Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible extended ASCII string decompress problem. #490

Open
kirk-sayre-work opened this issue Oct 17, 2019 · 3 comments
Open

Possible extended ASCII string decompress problem. #490

kirk-sayre-work opened this issue Oct 17, 2019 · 3 comments
Assignees
Milestone

Comments

@kirk-sayre-work
Copy link
Contributor

Affected tool:
olevba

Describe the bug
It looks like olevba may be improperly decompressing the values of some VBA strings that contain extended ASCII characters. There are some different extended ASCII VBA characters that result in the same byte sequence in the output of olevba.

File/Malware sample to reproduce the bug
An example Word document is available at https://github.com/kirk-sayre-work/talks/blob/master/test.docm

How To Reproduce the bug
Compare the output of olevba on the file with the output of oledump.py test.docm -s A3 -v . The string contents are different between the 2 tools, with the output of oledump.py for the string appearing to be possibly correct.

Version information:

  • OS: Linux
  • OS Ubuntu 16, 64 bit
  • Python version: 2.7 /64 bits
  • oletools version: olevba 0.55.dev4 on Python 2.7.12

Additional context
There are some maldoc campaigns (currently IcedID) that are encoding payloads in strings with extended ASCII characters. Vipermonkey fails to properly decode the payloads due to what appear to be issues with the decompression of the extended ASCII strings.

@decalage2
Copy link
Owner

decalage2 commented Mar 10, 2021

In this sample, the VBA string with special characters seems to be 8F 88 in hex. This is what I get using oledump, or when copy-pasting from the VBA editor into a text editor. The code page for the sample is 1252, so it's standard Western encoding.
olevba on Python 2 converts the string to EF BF BD CB. The proper encoding of 8F 88 in UTF-8 should be C2 8F C2 88.
When olevba parses the VBA code, VBA_Module.code_raw contains the right string with 8F 88. So the issue happens when converting that raw string to unicode using the cp1252 codec, and then converting the unicode to UTF-8.
In fact, the cp1252 codec triggers an exception when converting 8F 88 to Unicode:

>>> s=b'\x8F\x88'
>>> u=s.decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\encodings\cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 0: character maps to <undefined>

But that exception is hidden by olevba because it uses errors='replace' in VBA_Project.decode_bytes:

>>> u=s.decode('cp1252', errors='replace')
>>> u
u'\ufffd\u02c6'
>>> u.encode('utf8')
'\xef\xbf\xbd\xcb\x86'

And this is why the the UTF-8 encoded output is incorrect.
It looks like the cp1252 python codec considers 8F 88 as illegal characters, whereas they are accepted by MS Office...

On Wikipedia about CP1252: "According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows API MultiByteToWideChar maps these to the corresponding C1 control codes. The "best fit" mapping documents this behavior, too."

@decalage2
Copy link
Owner

TODO:

  • At least detect when unicode conversion fails (without errors="replace"), and issue a warning that the VBA source code contains special characters that cannot be converted to unicode.
  • Improve the olevba API so that calling applications can be informed when special characters are present, and can get the raw source code instead of the unicode/UTF-8 on demand.
  • Check how Windows converts those special characters from code page 1252 to unicode, and to UTF-8
  • If possible build a modified version of the cp1252 codec to mimic the behaviour of Windows
  • Check if other code pages have the same issue with undefined characters

@kirk-sayre-work
Copy link
Contributor Author

There is an additional weird wrinkle to the extended ASCII characters. You have tried copying and pasting from the VBA editor, now try adding a loop to Debug.Print each character in the string with Mid(), copy/paste the debug text, and look at the byte values in that text. In this case the original 128...256 byte value (single byte) is used for each of the extended ASCII characters. So it looks like Office uses unicode for display in the VBA editor but under the covers it is still using the single byte extended ASCII values when accessed in VBA (this is also the behavior I see with VBA string decode loops). Maybe there can be an olevba option for display text values vs. raw/underlying text values?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants