Handle invalid unicode in metadata values. #136

wxsBSD · 2020-04-09T20:15:43Z

In #135 it was brought up that you can crash the python interpreter if you have
invalid unicode in a metadata value. This is my attempt to fix that by
attempting to create a string, and if that fails falling back to a bytes object.
On the weird chance that the bytes object fails to create I added a safety check
so that we don't add a NULL ptr to the dictionary (this is how the crash was
manifesting).

It's debatable if we want to ONLY add strings as metadata, and NOT fallback to
bytes. If we don't fall back to bytes the only other option I see is to silently
drop that metadata on the floor. The tradeoff here is that now you may end up
with a string or a bytes object in your metadata dictionary, which is less than
ideal IMO.

I'm open to suggestions on this one.

Fixes #135

In VirusTotal#135 it was brought up that you can crash the python interpreter if you have invalid unicode in a metadata value. This is my attempt to fix that by attempting to create a string, and if that fails falling back to a bytes object. On the weird chance that the bytes object fails to create I added a safety check so that we don't add a NULL ptr to the dictionary (this is how the crash was manifesting). It's debatable if we want to ONLY add strings as metadata, and NOT fallback to bytes. If we don't fall back to bytes the only other option I see is to silently drop that metadata on the floor. The tradeoff here is that now you may end up with a string or a bytes object in your metadata dictionary, which is less than ideal IMO. I'm open to suggestions on this one. Fixes VirusTotal#135

wxsBSD · 2020-04-09T20:17:10Z

As I said in the commit, I'm not happy with either option. I don't like silently dropping metadata that is not valid unicode, and having it be either bytes or strings is less than ideal also.

wxs@wxs-mbp yara-python % cat test.py
import yara

rules = yara.compile(source=r'rule test { meta: a = "\x80" b = "foo" condition: true }')
print(list(rules)[0].meta["a"])

matches = rules.match("/bin/ls")
print(list(matches)[0].meta)
wxs@wxs-mbp yara-python % PYTHONPATH=build/lib.macosx-10.14-x86_64-3.7 python3 test.py
b'\x80'
{'a': b'\x80', 'b': 'foo'}
wxs@wxs-mbp yara-python %

malvidin · 2020-04-11T09:32:22Z

The PyUnicode_DecodeUTF8 and other PyUnicode_Decode* APIs can handle the errors with backslashdecode, but those require a length.

I don't know the length is available, could be made available in the future, or if the result of the PyBytes_FromString has a length that could then be passed to PyUnicode_FromEncodedObject with backslashdecode error handling.

At this point, the way that YARA processes metadata strings makes "\x20" and " " identical, yara-python doesn't see the strings until after the escaping happens. In the long term, I would hope that YARA won't convert metadata strings like "\x00" and "\x80" into bytes, and accepts all other bytes except null. But that only decreases the likelihood of the issue, it doesn't not remove it.

malvidin · 2020-04-12T20:46:00Z

What about replacing this:
object = PyBytes_FromString(meta->string);
with this:
object = PyUnicode_DecodeUTF8(meta->string, strlen(meta->string), "replace");

Using replace loses the invalid characters entirely, but the output is simple to handle. If surrogateescape is used instead of replace, the original bytes are available if desired, but printing requires setting the the IO encoding error handling. Another option is ignore, which removes all invalid characters.

With replace:
These commands will return the original data, with invalid bytes replaced with �.
print(list(rules)[0].meta['a']) # '�'
print(list(rules)[0].meta['a'].encode('utf8')) # b'\xef\xbf\xbd'

With surrogateescape:
The second command will return the original UTF8, with invalid bytes intact.
print(repr(list(rules)[0].meta['a'])) # '\udc80'
print(list(rules)[0].meta['a'].encode('utf8', 'surrogateescape')) # b'\x80'

This raises a UnicodeEncodeError, unless the error handling was changed with something like PYTHONIOENCODING=utf-8:surrogateescape.
print(list(rules)[0].meta['a'])
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in position 0: surrogates not allowed

With any of these, Python doesn't crash.

plusvic · 2020-04-13T09:10:41Z

I would say that metadata should be only text, at least that was the original intention.

malvidin · 2020-04-13T11:05:27Z

Based on that intention, I recommend dropping the invalid bytes using ignore.

Metadata test accepts stripped or original characters

malvidin · 2020-04-18T10:39:29Z

For the metadata decoding, is there a drawback to defining PY_STRING as PyUnicode_DecodeUTF8, rather than PyUnicode_FromString?
Or is it better to use PyUnicode_FromString and handle any errors with PyUnicode_DecodeUTF8?

For the tests, I recommend accepting an empty string '' or the original string r'\x80'. The first accepts current behavior when YARA unescapes the original string and decoding issue is handled. The second accepts the original string if YARA provides that in the future, which matches the original intention of metadata.

wxsBSD@c9f4260

googlebot · 2020-04-19T02:29:42Z

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

wxsBSD · 2020-04-19T02:32:33Z

OK, I've updated this with your commit @malvidin - but I modified the tests in your commit to not include the check for returning the raw string. It seems wrong to me to include a test which is not testing the current expected behavior, but is testing something YARA may do in the future (as unlikely as it is to do that).

I'm still not happy with ignoring bytes from strings which do not decode as UTF8, but as you've pointed out there is no good way to handle this. I'm going to include another commit which updates the documentation to say strings in metadata must be valid UTF8 codepoints.

wxsBSD · 2020-04-19T02:35:26Z

I put up VirusTotal/yara#1260 as a documentation update to go along with this PR if it is merged.

Add error handling to conversion to Unicode

c9f4260

Metadata test accepts stripped or original characters

Remove 'or' clause from tests and add another NULL test check.

9a7145e

wxsBSD mentioned this pull request Apr 19, 2020

Update documentation for metadata strings. VirusTotal/yara#1260

Merged

plusvic approved these changes Apr 23, 2020

View reviewed changes

plusvic merged commit bc4e0cd into VirusTotal:master Apr 23, 2020

plusvic added a commit that referenced this pull request Apr 23, 2020

Fix issue #136 for Python 2.7 too. (#140)

c57e97d

wxsBSD deleted the issue_135 branch April 29, 2020 20:08

malvidin mentioned this pull request Mar 30, 2023

YARA UI & Service Updates CybercentreCanada/assemblyline#55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle invalid unicode in metadata values. #136

Handle invalid unicode in metadata values. #136

wxsBSD commented Apr 9, 2020

wxsBSD commented Apr 9, 2020

malvidin commented Apr 11, 2020

malvidin commented Apr 12, 2020

plusvic commented Apr 13, 2020

malvidin commented Apr 13, 2020

malvidin commented Apr 18, 2020

googlebot commented Apr 19, 2020

wxsBSD commented Apr 19, 2020

wxsBSD commented Apr 19, 2020

Handle invalid unicode in metadata values. #136

Handle invalid unicode in metadata values. #136

Conversation

wxsBSD commented Apr 9, 2020

wxsBSD commented Apr 9, 2020

malvidin commented Apr 11, 2020

malvidin commented Apr 12, 2020

plusvic commented Apr 13, 2020

malvidin commented Apr 13, 2020

malvidin commented Apr 18, 2020

googlebot commented Apr 19, 2020

wxsBSD commented Apr 19, 2020

wxsBSD commented Apr 19, 2020