Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid unicode in metadata values. #136

Merged
merged 3 commits into from
Apr 23, 2020

Conversation

wxsBSD
Copy link
Contributor

@wxsBSD wxsBSD commented Apr 9, 2020

In #135 it was brought up that you can crash the python interpreter if you have
invalid unicode in a metadata value. This is my attempt to fix that by
attempting to create a string, and if that fails falling back to a bytes object.
On the weird chance that the bytes object fails to create I added a safety check
so that we don't add a NULL ptr to the dictionary (this is how the crash was
manifesting).

It's debatable if we want to ONLY add strings as metadata, and NOT fallback to
bytes. If we don't fall back to bytes the only other option I see is to silently
drop that metadata on the floor. The tradeoff here is that now you may end up
with a string or a bytes object in your metadata dictionary, which is less than
ideal IMO.

I'm open to suggestions on this one.

Fixes #135

In VirusTotal#135 it was brought up that you can crash the python interpreter if you have
invalid unicode in a metadata value. This is my attempt to fix that by
attempting to create a string, and if that fails falling back to a bytes object.
On the weird chance that the bytes object fails to create I added a safety check
so that we don't add a NULL ptr to the dictionary (this is how the crash was
manifesting).

It's debatable if we want to ONLY add strings as metadata, and NOT fallback to
bytes. If we don't fall back to bytes the only other option I see is to silently
drop that metadata on the floor. The tradeoff here is that now you may end up
with a string or a bytes object in your metadata dictionary, which is less than
ideal IMO.

I'm open to suggestions on this one.

Fixes VirusTotal#135
@wxsBSD
Copy link
Contributor Author

wxsBSD commented Apr 9, 2020

As I said in the commit, I'm not happy with either option. I don't like silently dropping metadata that is not valid unicode, and having it be either bytes or strings is less than ideal also.

wxs@wxs-mbp yara-python % cat test.py
import yara

rules = yara.compile(source=r'rule test { meta: a = "\x80" b = "foo" condition: true }')
print(list(rules)[0].meta["a"])

matches = rules.match("/bin/ls")
print(list(matches)[0].meta)
wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.14-x86_64-3.7 python3 test.py
b'\x80'
{'a': b'\x80', 'b': 'foo'}
wxs@wxs-mbp yara-python %

@malvidin
Copy link
Contributor

The PyUnicode_DecodeUTF8 and other PyUnicode_Decode* APIs can handle the errors with backslashdecode, but those require a length.

I don't know the length is available, could be made available in the future, or if the result of the PyBytes_FromString has a length that could then be passed to PyUnicode_FromEncodedObject with backslashdecode error handling.

At this point, the way that YARA processes metadata strings makes "\x20" and " " identical, yara-python doesn't see the strings until after the escaping happens. In the long term, I would hope that YARA won't convert metadata strings like "\x00" and "\x80" into bytes, and accepts all other bytes except null. But that only decreases the likelihood of the issue, it doesn't not remove it.

@malvidin
Copy link
Contributor

What about replacing this:
object = PyBytes_FromString(meta->string);
with this:
object = PyUnicode_DecodeUTF8(meta->string, strlen(meta->string), "replace");

Using replace loses the invalid characters entirely, but the output is simple to handle. If surrogateescape is used instead of replace, the original bytes are available if desired, but printing requires setting the the IO encoding error handling. Another option is ignore, which removes all invalid characters.

With replace:
These commands will return the original data, with invalid bytes replaced with .
print(list(rules)[0].meta['a']) # '�'
print(list(rules)[0].meta['a'].encode('utf8')) # b'\xef\xbf\xbd'

With surrogateescape:
The second command will return the original UTF8, with invalid bytes intact.
print(repr(list(rules)[0].meta['a'])) # '\udc80'
print(list(rules)[0].meta['a'].encode('utf8', 'surrogateescape')) # b'\x80'

This raises a UnicodeEncodeError, unless the error handling was changed with something like PYTHONIOENCODING=utf-8:surrogateescape.
print(list(rules)[0].meta['a'])
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed

With any of these, Python doesn't crash.

@plusvic
Copy link
Member

plusvic commented Apr 13, 2020

I would say that metadata should be only text, at least that was the original intention.

@malvidin
Copy link
Contributor

Based on that intention, I recommend dropping the invalid bytes using ignore.

Metadata test accepts stripped or original characters
@malvidin
Copy link
Contributor

For the metadata decoding, is there a drawback to defining PY_STRING as PyUnicode_DecodeUTF8, rather than PyUnicode_FromString?
Or is it better to use PyUnicode_FromString and handle any errors with PyUnicode_DecodeUTF8?

For the tests, I recommend accepting an empty string '' or the original string r'\x80'. The first accepts current behavior when YARA unescapes the original string and decoding issue is handled. The second accepts the original string if YARA provides that in the future, which matches the original intention of metadata.

wxsBSD@c9f4260

@googlebot
Copy link

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@wxsBSD
Copy link
Contributor Author

wxsBSD commented Apr 19, 2020

OK, I've updated this with your commit @malvidin - but I modified the tests in your commit to not include the check for returning the raw string. It seems wrong to me to include a test which is not testing the current expected behavior, but is testing something YARA may do in the future (as unlikely as it is to do that).

I'm still not happy with ignoring bytes from strings which do not decode as UTF8, but as you've pointed out there is no good way to handle this. I'm going to include another commit which updates the documentation to say strings in metadata must be valid UTF8 codepoints.

@wxsBSD
Copy link
Contributor Author

wxsBSD commented Apr 19, 2020

I put up VirusTotal/yara#1260 as a documentation update to go along with this PR if it is merged.

@plusvic plusvic merged commit bc4e0cd into VirusTotal:master Apr 23, 2020
plusvic added a commit that referenced this pull request Apr 23, 2020
@wxsBSD wxsBSD deleted the issue_135 branch April 29, 2020 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unescaped Metadata Crashes Python
4 participants