Fix handling of multipart emails #488

tillprochaska · 2023-07-06T08:13:00Z

Please do not merge this. This is currently blocked until alephdata/followthemoney#1148 is merged and released.

This fixes handling of multipart emails to resolve alephdata/aleph#3161 and alephdata/aleph#3164. Check the commit messages for details!

To do:

Do we still need to add HTML contents to the indexText property?
Handle inline attached emails in a similar way like Thunderbird or Apple Mail
Add test cases for base64 encoded email payload
Remove unreleased followthemoney before merging
~~Handle DSN / delivery status emails correctly (stretch goal)~~

As a note to myself, a list of helpful email test datasets:

Previously, ingest-file would add the body of HTML parts to the `bodyHtml property and the body of plaintext parts to the `bodyText` property. Given an email with the following structure… ``` multipart/mixed text/plain text/html text/plain text/html ``` … this would result in an FtM entity that has two values in the `bodyHtml` and two values in the `bodyText` property with no way of reconstructing the order of all four parts in the original email. In case of `multipart/mixed` emails, we need to display all parts in their original order (no matter the content type of the individual parts) to provide a complete, correct preview of the email. I have solved this by generating alternative HTML or plaintext representation for the parts of `multipart/mixed` emails and adding them to the respective entity properties. * If a mixed email has a plaintext part, the text is HTML encoded and line breaks are converted to `<br>` tags. * If a mixed email has an HTML part, all HTML markup is stripped (we already do this for search). This approach adds some storage overhead as we need to store the contents of email twice. But that is the whole idea of multipart emails with alternative representations… Also, the text contents of most emails are rather short. Fixes #3161

Hiding the recursion in the `walk_parts` iterator is a bit pointless when we also do that explicitly in `parse_part`. Makes the code much easier to follow in my opinion and allows us to stop the traversal earlier, e.g. when we encounter an attachment. The latter is key for fixing #3164.

While ingest-file already handles parts of multipart emails with the `Content-Disposition: attachment` header, the current implementation fails for attached emails. This is due to the common structure of emails with attached emails: Structure of an email with an attached plaintext email: ``` multipart/mixed text/plain message/rfc822 text/plain ``` Structure of an email with an attached multipart email: ``` multipart/mixed text/plain message/rfc822 multipart/alternative text/plain text/html ``` In both cases, the `Content-Disposition` header will be set only on the `message/rfc822` part. The previous implementation would simply skip this part because it also is a multipart part itself. The iterator would then descend into its subparts and handle the `text/plain` part (in the first example) or the `mutlipart/alternative` part (in the second example). As the subparts do not have the `Content-Disposition` header set, they would be handled like any other part of the original email. The new implementation first checks whether a part is an attachment. If it is, it is processed as an attachment and we do not continue the recursion (to prevent the "inlining" of subparts of the attachment). Fix #3164.

Emails can also contains attached inline emails. This is common especially for automated delivery notifications, but some email clients seem to use attached inline emails when replying to another email. Most common email client display relevant headers of the attached inline emails. This change appends a text-representation of relevant email headers to the `bodyText` and `bodyHtml` properties. While a better solution might be to store the parsed email contents in a more structured way, we're a bit limit by FtM and this seems like a pragmatic solution for a case that is probably rather uncommon.

We now store the extracted text contents in the `bodyText` property. Also storing the text contents in `indexText` would be redundant, increase index size and lead to duplicate highlights in search results.

We need this fix to preserve the original order of email parts: alephdata/followthemoney#1148

tillprochaska changed the title ~~Fix handling~~ Fix handling of multipart emails Jul 6, 2023

tillprochaska linked an issue Jul 6, 2023 that may be closed by this pull request

BUG: ingest-file "inlines" contents of attached emails as if they were part of the original email alephdata/aleph#3164

Closed

tillprochaska mentioned this pull request Jul 6, 2023

Feature/3123 switch email type alephdata/aleph#3186

Closed

tillprochaska force-pushed the fix/3161-multipart-emails branch 2 times, most recently from b345b63 to 82b00cf Compare July 8, 2023 16:19

tillprochaska marked this pull request as ready for review July 9, 2023 19:34

tillprochaska requested review from stchris, Rosencrantz and catileptic July 9, 2023 19:34

tillprochaska added 6 commits July 12, 2023 18:41

Add test to ensure email payloads are encoded properly

225727e

Do not add extracted text contents from HTML parts to indexText

496b17d

We now store the extracted text contents in the `bodyText` property. Also storing the text contents in `indexText` would be redundant, increase index size and lead to duplicate highlights in search results.

Bump followthemoney to 3.4.4

7a05b69

We need this fix to preserve the original order of email parts: alephdata/followthemoney#1148

tillprochaska force-pushed the fix/3161-multipart-emails branch from 338b138 to 7a05b69 Compare July 12, 2023 16:43

stchris approved these changes Jul 13, 2023

View reviewed changes

tillprochaska merged commit e1c15eb into main Jul 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix handling of multipart emails #488

Fix handling of multipart emails #488

tillprochaska commented Jul 6, 2023 •

edited

Loading

Fix handling of multipart emails #488

Fix handling of multipart emails #488

Conversation

tillprochaska commented Jul 6, 2023 • edited Loading

tillprochaska commented Jul 6, 2023 •

edited

Loading