-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of multipart emails #488
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Previously, ingest-file would add the body of HTML parts to the `bodyHtml property and the body of plaintext parts to the `bodyText` property. Given an email with the following structure… ``` multipart/mixed text/plain text/html text/plain text/html ``` … this would result in an FtM entity that has two values in the `bodyHtml` and two values in the `bodyText` property with no way of reconstructing the order of all four parts in the original email. In case of `multipart/mixed` emails, we need to display all parts in their original order (no matter the content type of the individual parts) to provide a complete, correct preview of the email. I have solved this by generating alternative HTML or plaintext representation for the parts of `multipart/mixed` emails and adding them to the respective entity properties. * If a mixed email has a plaintext part, the text is HTML encoded and line breaks are converted to `<br>` tags. * If a mixed email has an HTML part, all HTML markup is stripped (we already do this for search). This approach adds some storage overhead as we need to store the contents of email twice. But that is the whole idea of multipart emails with alternative representations… Also, the text contents of most emails are rather short. Fixes #3161
tillprochaska
force-pushed
the
fix/3161-multipart-emails
branch
2 times, most recently
from
July 8, 2023 16:19
b345b63
to
82b00cf
Compare
Hiding the recursion in the `walk_parts` iterator is a bit pointless when we also do that explicitly in `parse_part`. Makes the code much easier to follow in my opinion and allows us to stop the traversal earlier, e.g. when we encounter an attachment. The latter is key for fixing #3164.
While ingest-file already handles parts of multipart emails with the `Content-Disposition: attachment` header, the current implementation fails for attached emails. This is due to the common structure of emails with attached emails: Structure of an email with an attached plaintext email: ``` multipart/mixed text/plain message/rfc822 text/plain ``` Structure of an email with an attached multipart email: ``` multipart/mixed text/plain message/rfc822 multipart/alternative text/plain text/html ``` In both cases, the `Content-Disposition` header will be set only on the `message/rfc822` part. The previous implementation would simply skip this part because it also is a multipart part itself. The iterator would then descend into its subparts and handle the `text/plain` part (in the first example) or the `mutlipart/alternative` part (in the second example). As the subparts do not have the `Content-Disposition` header set, they would be handled like any other part of the original email. The new implementation first checks whether a part is an attachment. If it is, it is processed as an attachment and we do not continue the recursion (to prevent the "inlining" of subparts of the attachment). Fix #3164.
Emails can also contains attached inline emails. This is common especially for automated delivery notifications, but some email clients seem to use attached inline emails when replying to another email. Most common email client display relevant headers of the attached inline emails. This change appends a text-representation of relevant email headers to the `bodyText` and `bodyHtml` properties. While a better solution might be to store the parsed email contents in a more structured way, we're a bit limit by FtM and this seems like a pragmatic solution for a case that is probably rather uncommon.
We now store the extracted text contents in the `bodyText` property. Also storing the text contents in `indexText` would be redundant, increase index size and lead to duplicate highlights in search results.
We need this fix to preserve the original order of email parts: alephdata/followthemoney#1148
tillprochaska
force-pushed
the
fix/3161-multipart-emails
branch
from
July 12, 2023 16:43
338b138
to
7a05b69
Compare
stchris
approved these changes
Jul 13, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Please do not merge this. This is currently blocked until alephdata/followthemoney#1148 is merged and released.
This fixes handling of multipart emails to resolve alephdata/aleph#3161 and alephdata/aleph#3164. Check the commit messages for details!
To do:
indexText
property?followthemoney
before mergingHandle DSN / delivery status emails correctly (stretch goal)As a note to myself, a list of helpful email test datasets: