Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save the original raw body in the entity #111

Closed
wants to merge 1 commit into from
Closed

Save the original raw body in the entity #111

wants to merge 1 commit into from

Conversation

jameshoulahan
Copy link
Contributor

@jameshoulahan jameshoulahan commented Oct 21, 2020

There's a good chance you won't like these changes due to unbounded allocations. I fully agree. This PR is more an invitation to discussion than a proposal of a concrete solution.

But it's important to have access to the original raw entity body in case it can't be decoded/read (example: messages that specify a transfer encoding yet have illegal base64/qp data). In that case, it's good to still have the original body there to do something with.

Can you think of any io.Reader magic that would facilitate this without the need for ioutil.ReadAll like I've done? I was thinking about wrapping the source reader in a TeeReader that writes the source data to a buffer, which then becomes RawBody. But that would mean you have to read all from Body before RawBody contains what you want, which isn't so great (you might not care about the decoded body at all and only ever want the raw body and be surprised that it's always empty if you never read from Body).

Any other smart ideas?

@emersion
Copy link
Owner

There's a good chance you won't like these changes due to unbounded allocations. I fully agree. This PR is more an invitation to discussion than a proposal of a concrete solution.

OK. Yes, I can't merge this as-is. The whole point of this package is to provide a streaming API.

However, I've been thinking of adding another package in the repo for cases where buffering the whole message is fine: #79. I'm not sure if that would be a good fit for your use-case, and I don't know if/how this PR's functionality would be translate to this package, but feedback is welcome.

But it's important to have access to the original raw entity body in case it can't be decoded/read (example: messages that specify a transfer encoding yet have illegal base64/qp data). In that case, it's good to still have the original body there to do something with.

I'd like to hear more about your use-case. Why is it important to grab the raw, broken data? What do you plan to do with it?

@jameshoulahan
Copy link
Contributor Author

jameshoulahan commented Oct 21, 2020

I'd like to hear more about your use-case. Why is it important to grab the raw, broken data? What do you plan to do with it?

We wanted to add more tolerance to messages that specify wrong transfer encodings or contain badly encoded data. Specifically, while setting up our mime walker, we wanted access to the raw body in case we cannot read without error from the decoded body: github.com/ProtonMail/proton-bridge/pkg/message/parser/parser.go:123

If e.Body cannot be read without error, instead of giving up, it would be beneficial to have e.RawBody available. What we will actually do with this raw body is currently undecided; important thing is that we have access to it somehow.

However, I've been thinking of adding another package in the repo for cases where buffering the whole message is fine: #79. I'm not sure if that would be a good fit for your use-case, and I don't know if/how this PR's functionality would be translate to this package, but feedback is welcome.

We have already built a parser using your library that does what you propose: it parses the message once, building the mime representation in memory, then provides walker/visitor access over the mime structure. If such functionality were to be provided directly in this library, it would likely simplify our implementation.

@emersion
Copy link
Owner

What we will actually do with this raw body is currently undecided; important thing is that we have access to it somehow.

Well, to be honest, that's exactly the kind of feedback I'm interested in. I really wonder what you could do with broken part content.

Maybe what you're trying to know is whether the error is due to a broken encoding (in which case the rest of the message can be parsed) or whether it's due to something else (in which case the whole message is probably toast)?

We have already built a parser using your library that does what you propose: it parses the message once, building the mime representation in memory, then provides walker/visitor access over the mime structure. If such functionality were to be provided directly in this library, it would likely simplify our implementation.

Cool! Do you have a link to your implementation?

@emersion
Copy link
Owner

emersion commented Oct 22, 2020

I really wonder what you could do with broken part content.

One use-case I can think of is saving it somewhere to be able to create a bug report. But in this case it'd be best to save the whole message, because the decoding of the part might be affected by MIME boundaries and headers.

@jameshoulahan
Copy link
Contributor Author

jameshoulahan commented Oct 22, 2020

I really wonder what you could do with broken part content.

One example is when the transfer encoding itself is wrong. We've noticed a few messages in the wild which had quoted-printable transfer encoding specified despite the body not actually being valid quoted-printable data; someone wrongly stuck that in the header. But I agree it can become problematic and ideally, we'd just reject anything we can't decode, because otherwise you can't realistically know what to really do with the data. That's why it's pending a decision about what the best UX here is (import the raw body as-is despite the failed decode attempt, or first let the user decide if the parsed result is "correct enough" before importing, or reject it entirely).

Do you have a link to your implementation?

Here is the parsing package. Here is how it is used.

@jameshoulahan
Copy link
Contributor Author

Keeping the discussion going, would you consider an alternative implementation that wraps the underlying source io.Reader in a TeeReader before giving it to the decoder? The TeeReader would be used to write the original out to a buffer. Then, if reading from the decoded body fails, the RawBody would be readable from the buffer.

@emersion
Copy link
Owner

I really wonder what you could do with broken part content.

I see your package also has logic to detect the charset from the meta tags if the part is text/html.

Keeping the discussion going, would you consider an alternative implementation that wraps the underlying source io.Reader in a TeeReader before giving it to the decoder? The TeeReader would be used to write the original out to a buffer. Then, if reading from the decoded body fails, the RawBody would be readable from the buffer.

A TeeReader would need to write to a buffer (which will at the end of the process contain the whole part body in memory), and that's a non-goal for the message package.

I've started to work on #79 in a branch: https://github.com/emersion/go-message/compare/bufmsg. I don't know if it would be enough for your use-case, but feedback is welcome in any case.

@emersion
Copy link
Owner

Closing because this approach is incompatible with streaming.

@emersion emersion closed this Dec 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants