New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: decode malformed strings #187
Conversation
eth_abi/decoding.py
Outdated
"binary data.", | ||
) from e | ||
return value | ||
return data.decode("utf-8", errors="surrogateescape") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decode
can take a number of different error callbacks for strings: "strict"
, "ignore"
, "replace"
, "surrogateescape"
, "backslashreplace"
, and "xmlcharrefreplace"
. We can allow users to use any of these callbacks if we define our own __call__
and decode
functions inside the StringDecoder
class that accept an error handler. What do you think about that?
So the class would look something like:
class StringDecoder(ByteStringDecoder):
def __call__(self, stream: ContextFramesBytesIO, errors="surrogateescape") -> Any:
return self.decode(stream, errors)
@parse_type_str("string")
def from_type_str(cls, abi_type, registry):
return cls()
def decode(self, stream, errors="surrogateescape"):
raw_data = self.read_data_from_stream(stream)
data, padding_bytes = self.split_data_and_padding(raw_data)
value = data.decode("utf-8", errors)
self.validate_padding_bytes(value, padding_bytes)
return value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, this looks good
267ebf3
to
e0b4845
Compare
fix: use surrogateescape chore: lint
cdc1a83
to
9d4c971
Compare
I've updated the class as described and adjusted testing, let me know if you have any thoughts. I'll work on an example for the docs tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this over, @pacrob! What do you think about allowing users to pass in errors
on the method call too so that they don't have to add their own custom Encoder/Decoder through the registry? Maybe the API looks like:
decode(["string"], test_string, errors='backslashreplace')
. I haven't thought about this too hard, so feel free to push back if that doesn't make sense.
We'll also want to make sure this isn't a breaking change because then I think we'd have to pin eth-abi in web3 which wouldn't be ideal. I think a breaking change can be avoided if we set the errors
key to the Python default, which is strict
.
Not too complex, got it added. User can now call the existing Not sure about the docs location - it could be appropriate to put it in the Decoding section too. Thoughts? |
eth_abi/codec.py
Outdated
@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]: | |||
|
|||
decoders = [self._registry.get_decoder(type_str) for type_str in types] | |||
|
|||
for dec in decoders: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kclowes this feels clunky, but I'm not sure how else to apply the update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be convinced otherwise, but I don't think that passing in a different error handler on a one-off decode call should change the whole class's error handler. But I think either way we should document that behavior somewhere.
Since the StringDecoder
is the only one that takes the errors
argument, I think we can just pass it in to the decode function on the StringDecoder
and then we don't have to loop here. See my comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With from eth_abi import decode
, the user is getting an instance of ABIDecoder.decode
, which then builds a registry of the various decoder classes needed for the types passed in the first argument.
The stream object can't be passed directly to StringDecoder
's decode
function - it has to go through a HeadTailDecoder
first to handle the padding.
I manually tested this looping solution . The errors
argument is sticky from one decode
call to the next, so having it default to None
doesn't work, but by having it default to strict
I can get the desired behavior of only overriding if errors
is provided by the user.
Edit: but by setting it to default to strict
in ABIDecoder.decode
, it then overrides it for any registered StringDecoder
, even if it was registered with a different errors
arg. Not seeing a good way to get both behaviors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, yeah, I think it makes the most sense how you have it then, by only allowing the registration of a different decoder. I think one API is better than potentially unexpected behavior ¯_(ツ)_/¯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default behavior has been returned to the default strict
It looks like the default is still surrogateescape
as of the last commit? I might be missing something though.
Not sure about the docs location - it could be appropriate to put it in the Decoding section too. Thoughts?
Yeah, I think the more docs the merrier! The eth-abi docs are very confusing to me. We could probably use some doc auditing here at some point /cc @wolovim
eth_abi/codec.py
Outdated
@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]: | |||
|
|||
decoders = [self._registry.get_decoder(type_str) for type_str in types] | |||
|
|||
for dec in decoders: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be convinced otherwise, but I don't think that passing in a different error handler on a one-off decode call should change the whole class's error handler. But I think either way we should document that behavior somewhere.
Since the StringDecoder
is the only one that takes the errors
argument, I think we can just pass it in to the decode function on the StringDecoder
and then we don't have to loop here. See my comment below.
You are correct, the defaults were not updated, I must not have saved properly. I've added a test to make sure of it now. I've removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
newsfragments/187.feature.rst
Outdated
@@ -0,0 +1 @@ | |||
updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `surrogateescape` by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `surrogateescape` by default | |
updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `strict` by default |
eth_abi/codec.py
Outdated
@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]: | |||
|
|||
decoders = [self._registry.get_decoder(type_str) for type_str in types] | |||
|
|||
for dec in decoders: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, yeah, I think it makes the most sense how you have it then, by only allowing the registration of a different decoder. I think one API is better than potentially unexpected behavior ¯_(ツ)_/¯
and add example usage to docs
Thanks @banteg ! |
What was wrong?
decoding was failing for some malformed strings.
you can see an example of this if you try to decode logs from mainnet transaction
0x505870232ebd6cefd2a59c760924664212f72759e58fd2df82d61b67ffe0dd75
.it should be possible to read the useful data even from a malformed input.
How was it fixed?
added
errors="surrogateescape"
todecode
as suggested here https://peps.python.org/pep-0383/i deliberately skipped adding the same to
encode
because passing an unencodable string usually signifies a user error.To-Do
Cute Animal Picture