Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong html santitization for search column #30

Closed
kzaitsev opened this issue Dec 30, 2022 · 3 comments
Closed

Wrong html santitization for search column #30

kzaitsev opened this issue Dec 30, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@kzaitsev
Copy link

Hello, it seems something is wrong with HTML sanitization when you build the search column. it looks like some tags were ignored and not unwrapped to text. As a result, when you try to find the email by word in the body, you can't get it.

To reproduce, I'll attach a zipped eml file. In this case, the text "massmailgoodhost" will be dropped, and the "search" field will not contain it.

example.com - sales sales@example.com massmail testufof9hiyjgo8best regards! liam nison autotestmassmailwasservice@good.good https //www.example.com test. https //portal.example.com/services/my/15413 you have received this notification because you are a example.com customer.email address autotestmassmailwasservice@good.good is attached to account 11092.

It seems like a bug of https://github.com/k3a/html2text, but instead of it, why not use the Text field of the envelope structure, which returns the eml parser (github.com/jhillyerd/enmime)?

d80e6ca4-fb3c-4dcb-a6f8-030af2f8278f.eml.zip

@axllent
Copy link
Owner

axllent commented Dec 30, 2022

Thanks for the information @kzaitsev. We can't rely on the envelope Text value because so many HTML emails actually have something like You require an HTML-compatible email program to read this rather than an actual text version of the HTML, or a very dumbed-down/broken version of the HTML. From memory the enmime Text value isn't a conversion of HTML but rather the Content-Type: text/plain; part of an email (if set, else blank).

The best solution is still to manually convert the HTML (if set) to text, but I'll need to dig much deeper as to exactly why it is happening, and if it is an issue with html2text then that will need to be reported there to fix. Unfortunately I'm just heading off for a short holiday today, so it will be two weeks before I can probably look into this.

I also see I am stripping out : (when I "clean the text) which results in https //www.example.com .... (just noting it here so I don't forget to remove that).

@axllent axllent added the bug Something isn't working label Dec 30, 2022
@kzaitsev
Copy link
Author

kzaitsev commented Dec 30, 2022

@axllent thank you for your quick response, I understand.

I do some investigation and it seems https://github.com/jhillyerd/enmime uses https://github.com/jaytaylor/html2text to convert HTML to text in the case of HTML-only emails.

@axllent axllent closed this as completed in d47eb09 Jan 4, 2023
@axllent
Copy link
Owner

axllent commented Jan 4, 2023

Thanks again for reporting this. I found an option I could pass to html2text to include anchor content in the returned output, so this should now be solved in the latest v.1.3.5 release. Please feel free to re-open this if it does not solve your issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants