Save the original raw body in the entity #111

jameshoulahan · 2020-10-21T08:02:46Z

There's a good chance you won't like these changes due to unbounded allocations. I fully agree. This PR is more an invitation to discussion than a proposal of a concrete solution.

But it's important to have access to the original raw entity body in case it can't be decoded/read (example: messages that specify a transfer encoding yet have illegal base64/qp data). In that case, it's good to still have the original body there to do something with.

Can you think of any io.Reader magic that would facilitate this without the need for ioutil.ReadAll like I've done? I was thinking about wrapping the source reader in a TeeReader that writes the source data to a buffer, which then becomes RawBody. But that would mean you have to read all from Body before RawBody contains what you want, which isn't so great (you might not care about the decoded body at all and only ever want the raw body and be surprised that it's always empty if you never read from Body).

Any other smart ideas?

emersion · 2020-10-21T08:39:15Z

There's a good chance you won't like these changes due to unbounded allocations. I fully agree. This PR is more an invitation to discussion than a proposal of a concrete solution.

OK. Yes, I can't merge this as-is. The whole point of this package is to provide a streaming API.

However, I've been thinking of adding another package in the repo for cases where buffering the whole message is fine: #79. I'm not sure if that would be a good fit for your use-case, and I don't know if/how this PR's functionality would be translate to this package, but feedback is welcome.

But it's important to have access to the original raw entity body in case it can't be decoded/read (example: messages that specify a transfer encoding yet have illegal base64/qp data). In that case, it's good to still have the original body there to do something with.

I'd like to hear more about your use-case. Why is it important to grab the raw, broken data? What do you plan to do with it?

jameshoulahan · 2020-10-21T08:56:57Z

I'd like to hear more about your use-case. Why is it important to grab the raw, broken data? What do you plan to do with it?

We wanted to add more tolerance to messages that specify wrong transfer encodings or contain badly encoded data. Specifically, while setting up our mime walker, we wanted access to the raw body in case we cannot read without error from the decoded body: github.com/ProtonMail/proton-bridge/pkg/message/parser/parser.go:123

If e.Body cannot be read without error, instead of giving up, it would be beneficial to have e.RawBody available. What we will actually do with this raw body is currently undecided; important thing is that we have access to it somehow.

However, I've been thinking of adding another package in the repo for cases where buffering the whole message is fine: #79. I'm not sure if that would be a good fit for your use-case, and I don't know if/how this PR's functionality would be translate to this package, but feedback is welcome.

We have already built a parser using your library that does what you propose: it parses the message once, building the mime representation in memory, then provides walker/visitor access over the mime structure. If such functionality were to be provided directly in this library, it would likely simplify our implementation.

emersion · 2020-10-22T13:46:20Z

What we will actually do with this raw body is currently undecided; important thing is that we have access to it somehow.

Well, to be honest, that's exactly the kind of feedback I'm interested in. I really wonder what you could do with broken part content.

Maybe what you're trying to know is whether the error is due to a broken encoding (in which case the rest of the message can be parsed) or whether it's due to something else (in which case the whole message is probably toast)?

We have already built a parser using your library that does what you propose: it parses the message once, building the mime representation in memory, then provides walker/visitor access over the mime structure. If such functionality were to be provided directly in this library, it would likely simplify our implementation.

Cool! Do you have a link to your implementation?

emersion · 2020-10-22T13:49:14Z

I really wonder what you could do with broken part content.

One use-case I can think of is saving it somewhere to be able to create a bug report. But in this case it'd be best to save the whole message, because the decoding of the part might be affected by MIME boundaries and headers.

jameshoulahan · 2020-10-22T14:54:58Z

I really wonder what you could do with broken part content.

One example is when the transfer encoding itself is wrong. We've noticed a few messages in the wild which had quoted-printable transfer encoding specified despite the body not actually being valid quoted-printable data; someone wrongly stuck that in the header. But I agree it can become problematic and ideally, we'd just reject anything we can't decode, because otherwise you can't realistically know what to really do with the data. That's why it's pending a decision about what the best UX here is (import the raw body as-is despite the failed decode attempt, or first let the user decide if the parsed result is "correct enough" before importing, or reject it entirely).

Do you have a link to your implementation?

Here is the parsing package. Here is how it is used.

jameshoulahan · 2020-10-27T09:32:15Z

Keeping the discussion going, would you consider an alternative implementation that wraps the underlying source io.Reader in a TeeReader before giving it to the decoder? The TeeReader would be used to write the original out to a buffer. Then, if reading from the decoded body fails, the RawBody would be readable from the buffer.

emersion · 2020-10-28T17:14:21Z

I really wonder what you could do with broken part content.

I see your package also has logic to detect the charset from the meta tags if the part is text/html.

Keeping the discussion going, would you consider an alternative implementation that wraps the underlying source io.Reader in a TeeReader before giving it to the decoder? The TeeReader would be used to write the original out to a buffer. Then, if reading from the decoded body fails, the RawBody would be readable from the buffer.

A TeeReader would need to write to a buffer (which will at the end of the process contain the whole part body in memory), and that's a non-goal for the message package.

I've started to work on #79 in a branch: https://github.com/emersion/go-message/compare/bufmsg. I don't know if it would be enough for your use-case, but feedback is welcome in any case.

emersion · 2020-12-22T09:31:15Z

Closing because this approach is incompatible with streaming.

Save the original raw body in the entity

36046e3

emersion closed this Dec 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save the original raw body in the entity #111

Save the original raw body in the entity #111

jameshoulahan commented Oct 21, 2020 •

edited

Loading

emersion commented Oct 21, 2020

jameshoulahan commented Oct 21, 2020 •

edited

Loading

emersion commented Oct 22, 2020

emersion commented Oct 22, 2020 •

edited

Loading

jameshoulahan commented Oct 22, 2020 •

edited

Loading

jameshoulahan commented Oct 27, 2020

emersion commented Oct 28, 2020

emersion commented Dec 22, 2020

Save the original raw body in the entity #111

Save the original raw body in the entity #111

Conversation

jameshoulahan commented Oct 21, 2020 • edited Loading

emersion commented Oct 21, 2020

jameshoulahan commented Oct 21, 2020 • edited Loading

emersion commented Oct 22, 2020

emersion commented Oct 22, 2020 • edited Loading

jameshoulahan commented Oct 22, 2020 • edited Loading

jameshoulahan commented Oct 27, 2020

emersion commented Oct 28, 2020

emersion commented Dec 22, 2020

jameshoulahan commented Oct 21, 2020 •

edited

Loading

jameshoulahan commented Oct 21, 2020 •

edited

Loading

emersion commented Oct 22, 2020 •

edited

Loading

jameshoulahan commented Oct 22, 2020 •

edited

Loading