fix: decode malformed strings #187

banteg · 2022-08-26T01:17:58Z

What was wrong?

decoding was failing for some malformed strings.

you can see an example of this if you try to decode logs from mainnet transaction 0x505870232ebd6cefd2a59c760924664212f72759e58fd2df82d61b67ffe0dd75.

it should be possible to read the useful data even from a malformed input.

How was it fixed?

added errors="surrogateescape" to decode as suggested here https://peps.python.org/pep-0383/

i deliberately skipped adding the same to encode because passing an unencodable string usually signifies a user error.

To-Do

Clean up commit history

Add entry to the release notes

Cute Animal Picture

kclowes · 2022-08-29T22:05:30Z

eth_abi/decoding.py

-                "binary data.",
-            ) from e
-        return value
+        return data.decode("utf-8", errors="surrogateescape")


decode can take a number of different error callbacks for strings: "strict", "ignore", "replace", "surrogateescape", "backslashreplace", and "xmlcharrefreplace". We can allow users to use any of these callbacks if we define our own __call__ and decode functions inside the StringDecoder class that accept an error handler. What do you think about that?

So the class would look something like:

class StringDecoder(ByteStringDecoder): def __call__(self, stream: ContextFramesBytesIO, errors="surrogateescape") -> Any: return self.decode(stream, errors) @parse_type_str("string") def from_type_str(cls, abi_type, registry): return cls() def decode(self, stream, errors="surrogateescape"): raw_data = self.read_data_from_stream(stream) data, padding_bytes = self.split_data_and_padding(raw_data) value = data.decode("utf-8", errors) self.validate_padding_bytes(value, padding_bytes) return value

yes, this looks good

fix: use surrogateescape chore: lint

pacrob · 2023-05-16T20:53:06Z

I've updated the class as described and adjusted testing, let me know if you have any thoughts. I'll work on an example for the docs tomorrow.

kclowes

Thanks for taking this over, @pacrob! What do you think about allowing users to pass in errors on the method call too so that they don't have to add their own custom Encoder/Decoder through the registry? Maybe the API looks like:
decode(["string"], test_string, errors='backslashreplace'). I haven't thought about this too hard, so feel free to push back if that doesn't make sense.

We'll also want to make sure this isn't a breaking change because then I think we'd have to pin eth-abi in web3 which wouldn't be ideal. I think a breaking change can be avoided if we set the errors key to the Python default, which is strict.

pacrob · 2023-05-19T17:19:45Z

Not too complex, got it added. User can now call the existing string decoder like you wrote, or add multiple StringDecoders with different behaviors to the registry. The default behavior has been returned to the default strict.

Not sure about the docs location - it could be appropriate to put it in the Decoding section too. Thoughts?

pacrob · 2023-05-19T17:21:08Z

eth_abi/codec.py

@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]:

        decoders = [self._registry.get_decoder(type_str) for type_str in types]

+        for dec in decoders:


@kclowes this feels clunky, but I'm not sure how else to apply the update

I could be convinced otherwise, but I don't think that passing in a different error handler on a one-off decode call should change the whole class's error handler. But I think either way we should document that behavior somewhere.

Since the StringDecoder is the only one that takes the errors argument, I think we can just pass it in to the decode function on the StringDecoder and then we don't have to loop here. See my comment below.

With from eth_abi import decode, the user is getting an instance of ABIDecoder.decode, which then builds a registry of the various decoder classes needed for the types passed in the first argument.

The stream object can't be passed directly to StringDecoder's decode function - it has to go through a HeadTailDecoder first to handle the padding.

I manually tested this looping solution . The errors argument is sticky from one decode call to the next, so having it default to None doesn't work, but by having it default to strict I can get the desired behavior of only overriding if errors is provided by the user.

Edit: but by setting it to default to strict in ABIDecoder.decode, it then overrides it for any registered StringDecoder, even if it was registered with a different errors arg. Not seeing a good way to get both behaviors.

Okay, yeah, I think it makes the most sense how you have it then, by only allowing the registration of a different decoder. I think one API is better than potentially unexpected behavior ¯_(ツ)_/¯

kclowes

The default behavior has been returned to the default strict

It looks like the default is still surrogateescape as of the last commit? I might be missing something though.

Not sure about the docs location - it could be appropriate to put it in the Decoding section too. Thoughts?

Yeah, I think the more docs the merrier! The eth-abi docs are very confusing to me. We could probably use some doc auditing here at some point /cc @wolovim

kclowes · 2023-05-19T18:15:15Z

eth_abi/codec.py

@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]:

        decoders = [self._registry.get_decoder(type_str) for type_str in types]

+        for dec in decoders:


I could be convinced otherwise, but I don't think that passing in a different error handler on a one-off decode call should change the whole class's error handler. But I think either way we should document that behavior somewhere.

Since the StringDecoder is the only one that takes the errors argument, I think we can just pass it in to the decode function on the StringDecoder and then we don't have to loop here. See my comment below.

eth_abi/decoding.py

pacrob · 2023-05-23T21:11:12Z

You are correct, the defaults were not updated, I must not have saved properly. I've added a test to make sure of it now.

I've removed the errors argument from the top-level decode function as I don't see a way forward for it, but we can discuss further.

kclowes

LGTM!

kclowes · 2023-05-24T16:17:03Z

newsfragments/187.feature.rst

@@ -0,0 +1 @@
+updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `surrogateescape` by default


Suggested change

updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `surrogateescape` by default

updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `strict` by default

kclowes · 2023-05-24T16:22:32Z

eth_abi/codec.py

@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]:

        decoders = [self._registry.get_decoder(type_str) for type_str in types]

+        for dec in decoders:


Okay, yeah, I think it makes the most sense how you have it then, by only allowing the registration of a different decoder. I think one API is better than potentially unexpected behavior ¯_(ツ)_/¯

and add example usage to docs

pacrob · 2023-05-24T17:12:33Z

Thanks @banteg !

kclowes reviewed Aug 29, 2022

View reviewed changes

pacrob force-pushed the fix/strings branch 5 times, most recently from 267ebf3 to e0b4845 Compare May 16, 2023 19:54

fix: string decoding errors replace

b5e5565

fix: use surrogateescape chore: lint

pacrob force-pushed the fix/strings branch 2 times, most recently from cdc1a83 to 9d4c971 Compare May 16, 2023 20:48

pacrob force-pushed the fix/strings branch from 9d4c971 to 74f4d25 Compare May 18, 2023 21:16

pacrob requested a review from kclowes May 18, 2023 21:21

kclowes reviewed May 19, 2023

View reviewed changes

pacrob force-pushed the fix/strings branch from 74f4d25 to 49100ea Compare May 19, 2023 17:13

pacrob reviewed May 19, 2023

View reviewed changes

kclowes reviewed May 19, 2023

View reviewed changes

pacrob force-pushed the fix/strings branch from 10762c1 to 61a4d27 Compare May 23, 2023 21:30

kclowes approved these changes May 24, 2023

View reviewed changes

update StringDecoder to allow user-defined error handler, adjust tests

2ac3b9c

and add example usage to docs

pacrob force-pushed the fix/strings branch from 61a4d27 to 2ac3b9c Compare May 24, 2023 16:53

pacrob merged commit 6189429 into ethereum:master May 24, 2023
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: decode malformed strings #187

fix: decode malformed strings #187

banteg commented Aug 26, 2022 •

edited by pacrob

kclowes Aug 29, 2022

banteg Aug 31, 2022

pacrob commented May 16, 2023

kclowes left a comment

pacrob commented May 19, 2023 •

edited

pacrob May 19, 2023

kclowes May 19, 2023

pacrob May 23, 2023 •

edited

kclowes May 24, 2023

kclowes left a comment

kclowes May 19, 2023

pacrob commented May 23, 2023

kclowes left a comment

kclowes May 24, 2023

kclowes May 24, 2023

pacrob commented May 24, 2023

		@@ -139,6 +145,10 @@ def decode(self, types: Iterable[TypeStr], data: Decodable) -> Tuple[Any, ...]:

		decoders = [self._registry.get_decoder(type_str) for type_str in types]

		for dec in decoders:

		@@ -0,0 +1 @@
		updated `StringDecoder` class to allow user-defined handling of malformed strings, handle with `surrogateescape` by default

fix: decode malformed strings #187

fix: decode malformed strings #187

Conversation

banteg commented Aug 26, 2022 • edited by pacrob

What was wrong?

How was it fixed?

To-Do

Cute Animal Picture

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacrob commented May 16, 2023

kclowes left a comment

Choose a reason for hiding this comment

pacrob commented May 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacrob May 23, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kclowes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacrob commented May 23, 2023

kclowes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pacrob commented May 24, 2023

banteg commented Aug 26, 2022 •

edited by pacrob

pacrob commented May 19, 2023 •

edited

pacrob May 23, 2023 •

edited