Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: decode malformed strings #187

Merged
merged 2 commits into from May 24, 2023
Merged

fix: decode malformed strings #187

merged 2 commits into from May 24, 2023

Conversation

banteg
Copy link
Contributor

@banteg banteg commented Aug 26, 2022

What was wrong?

decoding was failing for some malformed strings.

you can see an example of this if you try to decode logs from mainnet transaction 0x505870232ebd6cefd2a59c760924664212f72759e58fd2df82d61b67ffe0dd75.

it should be possible to read the useful data even from a malformed input.

How was it fixed?

added errors="surrogateescape" to decode as suggested here https://peps.python.org/pep-0383/

i deliberately skipped adding the same to encode because passing an unencodable string usually signifies a user error.

To-Do

  • Clean up commit history

Cute Animal Picture

out-3-2

"binary data.",
) from e
return value
return data.decode("utf-8", errors="surrogateescape")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decode can take a number of different error callbacks for strings: "strict", "ignore", "replace", "surrogateescape", "backslashreplace", and "xmlcharrefreplace". We can allow users to use any of these callbacks if we define our own __call__ and decode functions inside the StringDecoder class that accept an error handler. What do you think about that?

So the class would look something like:

class StringDecoder(ByteStringDecoder):
    def __call__(self, stream: ContextFramesBytesIO, errors="surrogateescape") -> Any:
        return self.decode(stream, errors)

    @parse_type_str("string")
    def from_type_str(cls, abi_type, registry):
        return cls()

    def decode(self, stream, errors="surrogateescape"):
        raw_data = self.read_data_from_stream(stream)
        data, padding_bytes = self.split_data_and_padding(raw_data)
        value = data.decode("utf-8", errors)
        self.validate_padding_bytes(value, padding_bytes)

        return value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this looks good

@pacrob pacrob force-pushed the fix/strings branch 5 times, most recently from 267ebf3 to e0b4845 Compare May 16, 2023 19:54
fix: use surrogateescape

chore: lint
@pacrob pacrob force-pushed the fix/strings branch 2 times, most recently from cdc1a83 to 9d4c971 Compare May 16, 2023 20:48
@pacrob
Copy link
Contributor

pacrob commented May 16, 2023

I've updated the class as described and adjusted testing, let me know if you have any thoughts. I'll work on an example for the docs tomorrow.

Copy link
Contributor

@kclowes kclowes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this over, @pacrob! What do you think about allowing users to pass in errors on the method call too so that they don't have to add their own custom Encoder/Decoder through the registry? Maybe the API looks like:
decode(["string"], test_string, errors='backslashreplace'). I haven't thought about this too hard, so feel free to push back if that doesn't make sense.

We'll also want to make sure this isn't a breaking change because then I think we'd have to pin eth-abi in web3 which wouldn't be ideal. I think a breaking change can be avoided if we set the errors key to the Python default, which is strict.

@pacrob
Copy link
Contributor

pacrob commented May 19, 2023

Not too complex, got it added. User can now call the existing string decoder like you wrote, or add multiple StringDecoders with different behaviors to the registry. The default behavior has been returned to the default strict.

Not sure about the docs location - it could be appropriate to put it in the Decoding section too. Thoughts?

eth_abi/codec.py Outdated
@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]:

decoders = [self._registry.get_decoder(type_str) for type_str in types]

for dec in decoders:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kclowes this feels clunky, but I'm not sure how else to apply the update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be convinced otherwise, but I don't think that passing in a different error handler on a one-off decode call should change the whole class's error handler. But I think either way we should document that behavior somewhere.

Since the StringDecoder is the only one that takes the errors argument, I think we can just pass it in to the decode function on the StringDecoder and then we don't have to loop here. See my comment below.

Copy link
Contributor

@pacrob pacrob May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With from eth_abi import decode, the user is getting an instance of ABIDecoder.decode, which then builds a registry of the various decoder classes needed for the types passed in the first argument.

The stream object can't be passed directly to StringDecoder's decode function - it has to go through a HeadTailDecoder first to handle the padding.

I manually tested this looping solution . The errors argument is sticky from one decode call to the next, so having it default to None doesn't work, but by having it default to strict I can get the desired behavior of only overriding if errors is provided by the user.

Edit: but by setting it to default to strict in ABIDecoder.decode, it then overrides it for any registered StringDecoder, even if it was registered with a different errors arg. Not seeing a good way to get both behaviors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, yeah, I think it makes the most sense how you have it then, by only allowing the registration of a different decoder. I think one API is better than potentially unexpected behavior ¯_(ツ)_/¯

Copy link
Contributor

@kclowes kclowes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default behavior has been returned to the default strict

It looks like the default is still surrogateescape as of the last commit? I might be missing something though.

Not sure about the docs location - it could be appropriate to put it in the Decoding section too. Thoughts?

Yeah, I think the more docs the merrier! The eth-abi docs are very confusing to me. We could probably use some doc auditing here at some point /cc @wolovim

eth_abi/codec.py Outdated
@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]:

decoders = [self._registry.get_decoder(type_str) for type_str in types]

for dec in decoders:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be convinced otherwise, but I don't think that passing in a different error handler on a one-off decode call should change the whole class's error handler. But I think either way we should document that behavior somewhere.

Since the StringDecoder is the only one that takes the errors argument, I think we can just pass it in to the decode function on the StringDecoder and then we don't have to loop here. See my comment below.

eth_abi/decoding.py Show resolved Hide resolved
@pacrob
Copy link
Contributor

pacrob commented May 23, 2023

You are correct, the defaults were not updated, I must not have saved properly. I've added a test to make sure of it now.

I've removed the errors argument from the top-level decode function as I don't see a way forward for it, but we can discuss further.

Copy link
Contributor

@kclowes kclowes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -0,0 +1 @@
updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `surrogateescape` by default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `surrogateescape` by default
updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `strict` by default

eth_abi/codec.py Outdated
@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]:

decoders = [self._registry.get_decoder(type_str) for type_str in types]

for dec in decoders:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, yeah, I think it makes the most sense how you have it then, by only allowing the registration of a different decoder. I think one API is better than potentially unexpected behavior ¯_(ツ)_/¯

@pacrob pacrob merged commit 6189429 into ethereum:master May 24, 2023
18 checks passed
@pacrob
Copy link
Contributor

pacrob commented May 24, 2023

Thanks @banteg !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants