Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of multipart emails #488

Merged
merged 7 commits into from
Jul 13, 2023
Merged

Conversation

tillprochaska
Copy link
Contributor

@tillprochaska tillprochaska commented Jul 6, 2023

Please do not merge this. This is currently blocked until alephdata/followthemoney#1148 is merged and released.


This fixes handling of multipart emails to resolve alephdata/aleph#3161 and alephdata/aleph#3164. Check the commit messages for details!

To do:

  • Do we still need to add HTML contents to the indexText property?
  • Handle inline attached emails in a similar way like Thunderbird or Apple Mail
  • Add test cases for base64 encoded email payload
  • Remove unreleased followthemoney before merging
  • Handle DSN / delivery status emails correctly (stretch goal)

As a note to myself, a list of helpful email test datasets:

Previously, ingest-file would add the body of HTML parts to the `bodyHtml property and the body of plaintext parts to the `bodyText` property.

Given an email with the following structure…

```
multipart/mixed
    text/plain
    text/html
    text/plain
    text/html
```

… this would result in an FtM entity that has two values in the `bodyHtml` and two values in the `bodyText` property with no way of reconstructing the order of all four parts in the original email.

In case of `multipart/mixed` emails, we need to display all parts in their original order (no matter the content type of the individual parts) to provide a complete, correct preview of the email.

I have solved this by generating alternative HTML or plaintext representation for the parts of `multipart/mixed` emails and adding them to the respective entity properties.

* If a mixed email has a plaintext part, the text is HTML encoded and line breaks are converted to `<br>` tags.
* If a mixed email has an HTML part, all HTML markup is stripped (we already do this for search).

This approach adds some storage overhead as we need to store the contents of email twice. But that is the whole idea of multipart emails with alternative representations… Also, the text contents of most emails are rather short.

Fixes #3161
@tillprochaska tillprochaska changed the title Fix handling Fix handling of multipart emails Jul 6, 2023
@tillprochaska tillprochaska force-pushed the fix/3161-multipart-emails branch 2 times, most recently from b345b63 to 82b00cf Compare July 8, 2023 16:19
@tillprochaska tillprochaska marked this pull request as ready for review July 9, 2023 19:34
Hiding the recursion in the `walk_parts` iterator is a bit pointless when we also do that explicitly in `parse_part`. Makes the code much easier to follow in my opinion and allows us to stop the traversal earlier, e.g. when we encounter an attachment. The latter is key for fixing #3164.
While ingest-file already handles parts of multipart emails with the `Content-Disposition: attachment` header, the current implementation fails for attached emails. This is due to the common structure of emails with attached emails:

Structure of an email with an attached plaintext email:

```
multipart/mixed
    text/plain
    message/rfc822
        text/plain
```

Structure of an email with an attached multipart email:

```
multipart/mixed
    text/plain
    message/rfc822
        multipart/alternative
            text/plain
            text/html
```

In both cases, the `Content-Disposition` header will be set only on the `message/rfc822` part. The previous implementation would simply skip this part because it also is a multipart part itself. The iterator would then descend into its subparts and handle the `text/plain` part (in the first example) or the `mutlipart/alternative` part (in the second example). As the subparts do not have the `Content-Disposition` header set, they would be handled like any other part of the original email.

The new implementation first checks whether a part is an attachment. If it is, it is processed as an attachment and we do not continue the recursion (to prevent the "inlining" of subparts of the attachment).

Fix #3164.
Emails can also contains attached inline emails. This is common especially for automated delivery notifications, but some email clients seem to use attached inline emails when replying to another email. Most common email client display relevant headers of the attached inline emails. This change appends a text-representation of relevant email headers to the `bodyText` and `bodyHtml` properties.

While a better solution might be to store the parsed email contents in a more structured way, we're a bit limit by FtM and this seems like a pragmatic solution for a case that is probably rather uncommon.
We now store the extracted text contents in the `bodyText` property. Also storing the text contents in `indexText` would be redundant, increase index size and lead to duplicate highlights in search results.
We need this fix to preserve the original order of email parts: alephdata/followthemoney#1148
@tillprochaska tillprochaska merged commit e1c15eb into main Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants