Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Decode non-latin attachements filenames. #36

Closed
seriyps opened this Issue May 4, 2013 · 37 comments

Comments

Projects
None yet
5 participants
Contributor

seriyps commented May 4, 2013

Mail:

MIME-Version: 1.0
Received: by 10.49.107.74 with HTTP; Sat, 4 May 2013 10:02:04 -0700 (PDT)
Date: Sat, 4 May 2013 21:02:04 +0400
Delivered-To: seriy.pr@gmail.com
Message-ID: <CAMZozm52CWSvHnhDu+qd+ct6R1bWvfwHy=H=-VjwgdWP+7Ps+A@mail.gmail.com>
Subject: 
From: =?KOI8-R?B?88XSx8XKIPDSz8jP0s/X?= <seriy.pr@gmail.com>
To: aonfmqcpw@95.28.172.21
Content-Type: multipart/mixed; boundary=047d7b672bf44fac7f04dbe76d1f

--047d7b672bf44fac7f04dbe76d1f
Content-Type: multipart/alternative; boundary=047d7b672bf44fac7c04dbe76d1d

--047d7b672bf44fac7c04dbe76d1d
Content-Type: text/plain; charset=ISO-8859-1



--047d7b672bf44fac7c04dbe76d1d
Content-Type: text/html; charset=ISO-8859-1

<div dir="ltr"><br></div>

--047d7b672bf44fac7c04dbe76d1d--
--047d7b672bf44fac7f04dbe76d1f
Content-Type: text/plain; charset=US-ASCII; name="=?KOI8-R?B?1MXT1M/X2cogxsHKzC50eHQ=?="
Content-Disposition: attachment; filename="=?KOI8-R?B?1MXT1M/X2cogxsHKzC4=?=
    =?KOI8-R?B?dHh0?="
Content-Transfer-Encoding: base64
X-Attachment-Id: f_hgb1h6xc0

cXdlcXdlCg==
--047d7b672bf44fac7f04dbe76d1f--

Gives:

{<<"multipart">>,<<"mixed">>,
 [{<<"MIME-Version">>,<<"1.0">>},
  {<<"Received">>,
   <<"by 10.49.107.74 with HTTP; Sat, 4 May 2013 10:02:04 -0700 (PDT)">>},
  {<<"Date">>,<<"Sat, 4 May 2013 21:02:04 +0400">>},
  {<<"Delivered-To">>,<<"seriy.pr@gmail.com">>},
  {<<"Message-ID">>,
   <<"<CAMZozm52CWSvHnhDu+qd+ct6R1bWvfwHy=H=-VjwgdWP+7Ps+A@mail.gmail.com>">>},
  {<<"Subject">>,<<>>},
  {<<"From">>,
   <<208,161,208,181,209,128,208,179,208,181,208,185,32,208,159,209,128,208,
     190,209,133,208,190,209,128,13,208,190,32,60,115,101,114,105,121,46,112,
     114,64,103,109,97,105,108,46,99,111,109,62>>},
  {<<"To">>,<<"aonfmqcpw@95.28.172.21">>},
  {<<"Content-Type">>,
   <<"multipart/mixed; boundary=047d7b672bf44fac7f04dbe76d1f">>}],
 [{<<"content-type-params">>,
   [{<<"boundary">>,<<"047d7b672bf44fac7f04dbe76d1f">>}]},
  {<<"disposition">>,<<"inline">>},
  {<<"disposition-params">>,[]}],
 [{<<"multipart">>,<<"alternative">>,
   [{<<"Content-Type">>,
     <<"multipart/alternative; boundary=047d7b672bf44fac7c04dbe76d1d">>}],
   [{<<"content-type-params">>,
     [{<<"boundary">>,<<"047d7b672bf44fac7c04dbe76d1d">>}]},
    {<<"disposition">>,<<"inline">>},
    {<<"disposition-params">>,[]}],
   [{<<"text">>,<<"plain">>,
     [{<<"Content-Type">>,<<"text/plain; charset=ISO-8859-1">>}],
     [{<<"content-type-params">>,[{<<"charset">>,<<"ISO-8859-1">>}]},
      {<<"disposition">>,<<"inline">>},
      {<<"disposition-params">>,[]}],
     <<"\r\n">>},
    {<<"text">>,<<"html">>,
     [{<<"Content-Type">>,<<"text/html; charset=ISO-8859-1">>}],
     [{<<"content-type-params">>,[{<<"charset">>,<<"ISO-8859-1">>}]},
      {<<"disposition">>,<<"inline">>},
      {<<"disposition-params">>,[]}],
     <<"<div dir=\"ltr\"><br></div>\r\n">>}]},
  {<<"text">>,<<"plain">>,
   [{<<"Content-Type">>,
     <<"text/plain; charset=US-ASCII; name=\"=?KOI8-R?B?1MXT1M/X2cogxsHKzC50eHQ=?=\"">>},
    {<<"Content-Disposition">>,
     <<"attachment; filename=\"=?KOI8-R?B?1MXT1M/X2cogxsHKzC4=?==?KOI8-R?B?dHh0?=\"">>},
    {<<"Content-Transfer-Encoding">>,<<"base64">>},
    {<<"X-Attachment-Id">>,<<"f_hgb1h6xc0">>}],
   [{<<"content-type-params">>,
     [{<<"charset">>,<<"US-ASCII">>},
      {<<"name">>,<<"=?KOI8-R?B?1MXT1M/X2cogxsHKzC50eHQ=?=">>}]},
    {<<"disposition">>,<<"attachment">>},
    {<<"disposition-params">>,
     [{<<"filename">>,
       <<"=?KOI8-R?B?1MXT1M/X2cogxsHKzC4=?==?KOI8-R?B?dHh0?=">>}]}],
   <<"qweqwe\n">>}]}

Note:

{<<"Content-Type">>,
     <<"text/plain; charset=US-ASCII; name=\"=?KOI8-R?B?1MXT1M/X2cogxsHKzC50eHQ=?=\"">>}
    {<<"Content-Disposition">>,
     <<"attachment; filename=\"=?KOI8-R?B?1MXT1M/X2cogxsHKzC4=?==?KOI8-R?B?dHh0?=\"">>},
{<<"name">>,<<"=?KOI8-R?B?1MXT1M/X2cogxsHKzC50eHQ=?=">>}
{<<"filename">>,
       <<"=?KOI8-R?B?1MXT1M/X2cogxsHKzC4=?==?KOI8-R?B?dHh0?=">>}

It should be "тестовый файл.txt"

Contributor

seriyps commented May 5, 2013

This patch can fix the problem, but we should pass encoding from options, not get_default_encoding() one.

diff --git a/src/mimemail.erl b/src/mimemail.erl
index 84fa362..3e202a9 100644
--- a/src/mimemail.erl
+++ b/src/mimemail.erl
@@ -326,8 +326,10 @@ split_body_by_boundary_(Body, Boundary, Acc) ->
                0 ->
                        lists:reverse([{[], TrimmedBody} | Acc]);
                Index ->
+            {ParsedHdrs, BodyRest} = parse_headers(binstr:substr(TrimmedBody, 1, Index - 1)),
+            DecodedHdrs = decode_headers(ParsedHdrs, [], get_default_encoding()),
                        split_body_by_boundary_(binstr:substr(TrimmedBody, Index + byte_size(Boundary)), Boundary,
-                               [parse_headers(binstr:substr(TrimmedBody, 1, Index - 1)) | Acc])
+                               [{DecodedHdrs, BodyRest} | Acc])
        end.

 -spec(parse_headers/1 :: (Body :: binary()) -> {[{binary(), binary()}], binary()}).
Contributor

seriyps commented Nov 22, 2013

I fix this issue in my fork there seriyps/gen_smtp@53f3673 not sure if I can convert it to pull-request...

seriyps added a commit to seriyps/gen_smtp that referenced this issue Nov 22, 2013

Collaborator

mworrell commented Nov 22, 2013

This patch looks good to me. @arjan what do you think?

Collaborator

mworrell commented Nov 25, 2013

@seriyps Is this commit seriyps/gen_smtp@53f3673 together with the eunit test the complete patch? If so then I will merge them via a patch.

Contributor

seriyps commented Nov 25, 2013

yeah, 2 commits: one with implementation and one with unit-test is complete solution for this issue.
Anyway, I will make another pull-request for 2'nd feature soon. Sorry not a separate branches..

Contributor

seriyps commented Nov 25, 2013

But yeah, you can merge it as patch or may merge whole fork. It's stable enough, I run it in production.

Collaborator

mworrell commented Nov 25, 2013

Your patches for the "Permissive mime decoder", are they stable as well?

(Looking at them because of the "get_vablue(…)" fix.)

Collaborator

mworrell commented Nov 25, 2013

I tried to merge your branch, though I do get the following eunit test failure:

mimemail: rfc2047_decode_test_ (multiple unicode email addresses)...*failed*
in function mimemail:decode_header_tokens_permissive/3 (src/mimemail.erl, line 196)
  called as decode_header_tokens_permissive([],"utf-8",[<<" <jz@erlang-solutions.com>">>,
 <<" <jacek.zlydach@erlang-solutions.com>, ">>,
 <<" <jz@erlang-solutions.com>, ">>,
 <<" <jacek.zlydach@erlang-solutions.com>, ">>])
in call from mimemail:decode_header/2 (src/mimemail.erl, line 132)
in call from mimemail:'-rfc2047_decode_test_/0-fun-20-'/1 (src/mimemail.erl, line 1594)
**error:function_clause

@ghost ghost assigned mworrell Nov 25, 2013

Contributor

seriyps commented Nov 25, 2013

yeah, already run it in production with no issues. But it little conflicts with one of unit-tests https://github.com/seriyps/gen_smtp/blob/master/src/mimemail.erl#L1587 - this one will fall, because of other error return. I can change the test or try to preserve original error reason...

Collaborator

mworrell commented Nov 25, 2013

Would be great if you can fix it.

BTW, which iconv library are you using?

Contributor

seriyps commented Nov 26, 2013

Ok, I will fix unit-test in next few days.

I use https://github.com/edescourtis/iconverl. I had some major issues with https://github.com/Vagabond/erlang-iconv like broken UTF-8 or so (but have no concrete examples now).

As I can see, http://github.com/zotonic/eiconv is recommended for now. I will try it then.

Collaborator

mworrell commented Nov 26, 2013

Both are good, I just used eiconv because I had experience with it.

We have some discussion of rolling it into iconverl so that we have a single preferred library.

Contributor

seriyps commented Nov 26, 2013

The only minor issue with iconverl I noticed is one edescourtis/iconverl#1. But it's good enough for me.

Contributor

seriyps commented Nov 27, 2013

Can you, please, merge commits seriyps/gen_smtp@53f3673 and seriyps/gen_smtp@4635759 as a patch for now...
I think I will make separate pull-request for 'permissive parser' some time later.

Collaborator

mworrell commented Nov 27, 2013

I have patched your commits.

Also rewrote the test case a little bit, so that we don't run into Erlang compiler-character-set issues.

See:

0595a24
23d3b6b
06aaf72

@mworrell mworrell closed this Nov 27, 2013

Collaborator

mworrell commented Nov 27, 2013

And thank you for your patches!

Th issue is resolved in 3.0.11

On Tue, Nov 26, 2013 at 7:10 AM, Sergey Prokhorov
notifications@github.comwrote:

The only minor issue with iconverl I noticed is one edescourtis/iconverl#1edescourtis/iconverl#1.
But it's good enough for me.


Reply to this email directly or view it on GitHubhttps://github.com/Vagabond/gen_smtp/issues/36#issuecomment-29287292
.

Collaborator

mworrell commented Nov 27, 2013

Soon we will merge the iconv libraries into one-size-fits-all. Then we can also set the dependency in gen_smtp.

Sounds good to me.

On Wed, Nov 27, 2013 at 4:12 PM, Marc Worrell notifications@github.comwrote:

Soon we will merge the iconv libraries into one-size-fits-all. Then we can
also set the dependency in gen_smtp.


Reply to this email directly or view it on GitHubhttps://github.com/Vagabond/gen_smtp/issues/36#issuecomment-29420504
.

ericdc commented Nov 27, 2013

Do you have a strategy in mind for merging the iconv libraries?

I tried to implement all interfaces of other libraries.

I could improve the specs on the eiconv interface along with adding extra unit tests for other interfaces.

Collaborator

mworrell commented Nov 27, 2013

iconv is another issue - but @mmzeeman had some ideas.

Contributor

mmzeeman commented Nov 28, 2013

Ideas on what?

Personally I think Erlang should get a generic interface for character set encoders and decoders. I’m not sure if the iconv is the right interface for that sort of usage.

I’ve made eiconv because the original iconv implementation could not handle conversion of blobs > 64kb.

Maas

iconv is another issue - but @mmzeeman had some ideas.


Reply to this email directly or view it on GitHub.

Collaborator

mworrell commented Nov 28, 2013

Ideas to use only a single library and what is needed for that.

I agree that iconv should be in Erlang's stdlib

Contributor

seriyps commented Nov 28, 2013

In my opinion, http://github.com/zotonic/eiconv has a bit better code quality (smaller codebase, accept iolists as encoding names and data) but at the same time, https://github.com/edescourtis/iconverl has streaming converter API, which may be usefull feature too.

Contributor

mmzeeman commented Nov 28, 2013

eiconv also has a stream api, zotonic’s fork is a bit out of date. It also automatically splits data in chunks to prevent problems with the scheduler when large blobs of data are converted.

Maas

In my opinion, http://github.com/zotonic/eiconv has a bit better code quality (smaller codebase, accept iolists as encoding names and data) but at the same time, https://github.com/edescourtis/iconverl has streaming converter API, which may be usefull feature too.


Reply to this email directly or view it on GitHub.

ericdc commented Nov 28, 2013

I use a fixed 4KB buffer instead of 128K for the conversion with little or
no performance penalty.

On Thu, Nov 28, 2013 at 8:12 AM, Maas-Maarten Zeeman <
notifications@github.com> wrote:

eiconv also has a stream api, zotonic’s fork is a bit out of date. It also
automatically splits data in chunks to prevent problems with the scheduler
when large blobs of data are converted.

Maas

In my opinion, http://github.com/zotonic/eiconv has a bit better code
quality (smaller codebase, accept iolists as encoding names and data) but
at the same time, https://github.com/edescourtis/iconverl has streaming
converter API, which may be usefull feature too.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHubhttps://github.com/Vagabond/gen_smtp/issues/36#issuecomment-29462589
.

Also worth mentioning is that I don't try to guess what the output size
will be and then resize my buffer to a larger size.
https://github.com/zotonic/eiconv/blob/master/c_src/eiconv_nif.c#L126 vs
https://github.com/edescourtis/iconverl/blob/master/c_src/iconverl.c#L188 .
This is why my library in theory should use less memory, perform better and
more predictably with some encodings. I agree some improved interface with
multiple selectable implementations would be nice.

On Thu, Nov 28, 2013 at 8:22 AM, ericdc notifications@github.com wrote:

I use a fixed 4KB buffer instead of 128K for the conversion with little or
no performance penalty.

On Thu, Nov 28, 2013 at 8:12 AM, Maas-Maarten Zeeman <
notifications@github.com> wrote:

eiconv also has a stream api, zotonic’s fork is a bit out of date. It
also
automatically splits data in chunks to prevent problems with the
scheduler
when large blobs of data are converted.

Maas

In my opinion, http://github.com/zotonic/eiconv has a bit better code
quality (smaller codebase, accept iolists as encoding names and data)
but
at the same time, https://github.com/edescourtis/iconverl has streaming
converter API, which may be usefull feature too.


Reply to this email directly or view it on GitHub.


Reply to this email directly or view it on GitHub<
https://github.com/Vagabond/gen_smtp/issues/36#issuecomment-29462589>
.


Reply to this email directly or view it on GitHubhttps://github.com/Vagabond/gen_smtp/issues/36#issuecomment-29463178
.

ericdc commented Nov 28, 2013

I guess ideally we should have it build into the unicode library as a set
of possible encodings.

On Thu, Nov 28, 2013 at 3:57 AM, Marc Worrell notifications@github.comwrote:

Ideas to use only a single library and what is needed for that.

I agree that iconv should be in Erlang's stdlib


Reply to this email directly or view it on GitHubhttps://github.com/Vagabond/gen_smtp/issues/36#issuecomment-29448053
.

Contributor

mmzeeman commented Nov 28, 2013

Cool, you handle e2big on the erlang side. Nice trick.

Contributor

seriyps commented Nov 28, 2013

One more issue for both libraries is how they handle 'einval' error when encoding specified with //IGNORE flag. See https://github.com/Vagabond/erlang-iconv/blob/master/c_src/iconv_drv.c#L190-L194
This is the reason of why 2 tests fall in gen_smtp: https://github.com/Vagabond/gen_smtp/blob/master/src/mimemail.erl#L1361 and https://github.com/Vagabond/gen_smtp/blob/master/src/mimemail.erl#L1512 with iconverl as well as with eiconv.

$ echo "this contains a copyright © symbol" | iconv  -f us-ascii -t utf-8
this contains a copyright iconv: illegal input sequence at position 26

$ echo "this contains a copyright © symbol" | iconv  -f us-ascii -t utf-8//ignore 
this contains a copyright  symbol
iconv: illegal input sequence at position 36

$ echo "this contains a copyright © symbol" | iconv -f us-ascii -t utf-8//ignore 2>/dev/null
this contains a copyright  symbol

$ echo "this contains a copyright © symbol" | iconv -c -f us-ascii -t utf-8  # note '-c' switch
this contains a copyright  symbol

$ erl
1> eiconv:convert("us-ascii", "utf-8", <<"this contains a copyright © symbol">>).
{error,eilseq}
2> eiconv:convert("us-ascii", "utf-8//IGNORE", <<"this contains a copyright © symbol">>).
{error,eilseq}
3> iconverl:conv("utf-8//IGNORE", "us-ascii", <<"this contains a copyright © symbol">>).
{error,eilseq}
Collaborator

mworrell commented Nov 29, 2013

When I am running the tests on master with eiconv (make test) on OS X, then all tests pass.

All 191 tests passed.

Maybe there is a platform issue with the iconv C libraries? Or some other difference?

Contributor

seriyps commented Nov 29, 2013

On my current environment (Ubuntu 13.10, erlang R16B01) a significant amount of tests fail: Failed: 15 Passed: 126.
Most of the failures is related to new requirement application:start(asn1) before starting public_key, crypto and ssl in R16, but I will make another ticket about that.

Which Erlang version do you use? Can you try to run

$ erl
1> eiconv:convert("us-ascii", "utf-8", <<"this contains a copyright © symbol">>).
{error,eilseq}
2> eiconv:convert("us-ascii", "utf-8//IGNORE", <<"this contains a copyright © symbol">>).
{error,eilseq}
3> iconverl:conv("utf-8//IGNORE", "us-ascii", <<"this contains a copyright © symbol">>).
{error,eilseq}
Contributor

mmzeeman commented Nov 29, 2013

Interesting, this indeed fails on linux.

After looking it seems that this is caused by a difference in
understanding of the posix iconv spec. Gnu glibc wants to follow the posix
spec, libiconv thinks it is a bug in spec and fixes its api.

This should be fixed in the nif.

ericdc commented Nov 30, 2013

Interesting thats good to know.

On Thu, Nov 28, 2013 at 5:58 PM, Sergey Prokhorov
notifications@github.comwrote:

One more issue for both libraries is how they handle 'einval' error when
encoding specified with //IGNORE flag. See
https://github.com/Vagabond/erlang-iconv/blob/master/c_src/iconv_drv.c#L190-L194
This is the reason of why 2 tests fall in gen_smtp:
https://github.com/Vagabond/gen_smtp/blob/master/src/mimemail.erl#L1361and
https://github.com/Vagabond/gen_smtp/blob/master/src/mimemail.erl#L1512with iconverl as well as with eiconv.


Reply to this email directly or view it on GitHubhttps://github.com/Vagabond/gen_smtp/issues/36#issuecomment-29488442
.

Okay it's fixed in the latest version of iconverl 3.0.15

Contributor

mmzeeman commented Nov 30, 2013

Maintaining two separate iconv wrappers is madness. If it is ok with you I will point others to your repository.

ericdc commented Dec 1, 2013

Okay no problem.

On Sat, Nov 30, 2013 at 4:43 PM, Maas-Maarten Zeeman <
notifications@github.com> wrote:

Maintaining two separate iconv wrappers is madness. If it is ok with you I
will point others to your repository.


Reply to this email directly or view it on GitHubhttps://github.com/Vagabond/gen_smtp/issues/36#issuecomment-29561719
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment