-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle invalid unicode in metadata values. #136
Conversation
In VirusTotal#135 it was brought up that you can crash the python interpreter if you have invalid unicode in a metadata value. This is my attempt to fix that by attempting to create a string, and if that fails falling back to a bytes object. On the weird chance that the bytes object fails to create I added a safety check so that we don't add a NULL ptr to the dictionary (this is how the crash was manifesting). It's debatable if we want to ONLY add strings as metadata, and NOT fallback to bytes. If we don't fall back to bytes the only other option I see is to silently drop that metadata on the floor. The tradeoff here is that now you may end up with a string or a bytes object in your metadata dictionary, which is less than ideal IMO. I'm open to suggestions on this one. Fixes VirusTotal#135
As I said in the commit, I'm not happy with either option. I don't like silently dropping metadata that is not valid unicode, and having it be either bytes or strings is less than ideal also.
|
The PyUnicode_DecodeUTF8 and other PyUnicode_Decode* APIs can handle the errors with I don't know the length is available, could be made available in the future, or if the result of the PyBytes_FromString has a length that could then be passed to PyUnicode_FromEncodedObject with At this point, the way that YARA processes metadata strings makes |
What about replacing this: Using With With This raises a UnicodeEncodeError, unless the error handling was changed with something like With any of these, Python doesn't crash. |
I would say that metadata should be only text, at least that was the original intention. |
Based on that intention, I recommend dropping the invalid bytes using |
Metadata test accepts stripped or original characters
For the metadata decoding, is there a drawback to defining For the tests, I recommend accepting an empty string |
All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the ℹ️ Googlers: Go here for more info. |
OK, I've updated this with your commit @malvidin - but I modified the tests in your commit to not include the check for returning the raw string. It seems wrong to me to include a test which is not testing the current expected behavior, but is testing something YARA may do in the future (as unlikely as it is to do that). I'm still not happy with ignoring bytes from strings which do not decode as UTF8, but as you've pointed out there is no good way to handle this. I'm going to include another commit which updates the documentation to say strings in metadata must be valid UTF8 codepoints. |
I put up VirusTotal/yara#1260 as a documentation update to go along with this PR if it is merged. |
In #135 it was brought up that you can crash the python interpreter if you have
invalid unicode in a metadata value. This is my attempt to fix that by
attempting to create a string, and if that fails falling back to a bytes object.
On the weird chance that the bytes object fails to create I added a safety check
so that we don't add a NULL ptr to the dictionary (this is how the crash was
manifesting).
It's debatable if we want to ONLY add strings as metadata, and NOT fallback to
bytes. If we don't fall back to bytes the only other option I see is to silently
drop that metadata on the floor. The tradeoff here is that now you may end up
with a string or a bytes object in your metadata dictionary, which is less than
ideal IMO.
I'm open to suggestions on this one.
Fixes #135