Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smtp: Support internationalised characters in envelope addresses #4892

Closed

Conversation

@captain-caveman2k
Copy link
Member

@captain-caveman2k captain-caveman2k commented Feb 7, 2020

This patch set fixes support for internationalised characters in mailbox addresses, which effects the MAIL, RCPT TO and VRFY commands. by either a) using IDN to encode a UTF-8 host name into a ACE which is 7-bit or b) by using the SMTPUTF8 extension to support UTF-8 in both the local part of the mailbox as well as the host name part.

When the SMTPUTF8 extension is used the EXPN will also advertise, to the servers, it's support for the extension.

As such, this PR not only fixes issue #4828 but also allows UTF-8 to be used "legally" in other parts of an SMTP message.

The log from issue #4828 shows that the server supports the SMTPUTF8 extension, as per RFC-6531, but either of the above two fixes would solve that particular issue. However, if UTF-8 mailboxes are to be used in other headers then only the option b can be used to implement the SMTPUTF8 extension fully.

I have utilised the existing IDN conversion code from url.c for the UTF-8 to ASCII conversion. We might want to consider moving that code out of url.c into an IDN or hostname module but we can do that at a later date.

When I first opened this PR I had the following questions. I have kept them here for historical reasons but I believe we have resolved them now.

--

  • Do we only need to do this if the server supports SMTPUTF8?
  • I haven't separated the email address from the domain name in the MAIL_FROM field - do we need to?
  • Are there other fields we need to do this for? For example: MAIL_AUTH and RCPT TO
  • Do we need to implement something similar for IMAP?
@captain-caveman2k captain-caveman2k requested a review from bagder Feb 7, 2020
@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch 2 times, most recently from 306cad2 to 2d8f264 Feb 7, 2020
@bagder
Copy link
Member

@bagder bagder commented Feb 7, 2020

Do we only need to do this if the server supports SMTPUTF8?

I don't think so. This is about IDN in the domain names, not what the SMTP server supports.

Update: did some more thinking and I'm no longer really sure what to think. We either need to test it out or find this documented to learn what the answer is.

I haven't separated the email address from the domain name in the MAIL_FROM field - do we need to?

Yes I think so. IDN is only for the host name and I would be a bit scared of what could happen by accident if we encode more than necessary.

Are there other fields we need to do this for? For example: MAIL_AUTH and RCPT TO

I'd say should rather than need, but yes probably?

Do we need to implement something similar for IMAP?

Again should rather than need but yes if the host name in the protocol, there will probably be users with IDN names who would like to use that name rather that the punycoded version.

@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch from 2d8f264 to b683fe0 Feb 7, 2020
lib/smtp.c Outdated
}
}

auth = aprintf("%s", mail_auth.name);

This comment has been minimized.

@bagder

bagder Feb 7, 2020
Member

If this returns NULL we need to set result too for failure, right?

This comment has been minimized.

@captain-caveman2k

captain-caveman2k Feb 7, 2020
Author Member

Yep - isn't that taken care of by the block that starts at line 568? Unless I've screwed up the indentation:

    if(!auth) {
      free(from);

      return CURLE_OUT_OF_MEMORY;
    }
@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch 2 times, most recently from e214ee6 to a4b0793 Feb 8, 2020
@captain-caveman2k captain-caveman2k changed the title smtp: IDN encode the MAIL FROM parameter if necessary [wipsmtp: IDN encode the MAIL FROM parameter if necessary Feb 8, 2020
@captain-caveman2k captain-caveman2k changed the title [wipsmtp: IDN encode the MAIL FROM parameter if necessary [wip] [wipsmtp: IDN encode the MAIL FROM parameter if necessary Feb 8, 2020
@captain-caveman2k captain-caveman2k changed the title [wip] [wipsmtp: IDN encode the MAIL FROM parameter if necessary [wip] smtp: IDN encode the MAIL FROM parameter if necessary Feb 8, 2020
@bagder
Copy link
Member

@bagder bagder commented Feb 8, 2020

This looks good, but we really need a test case or two for this as well, right?

@captain-caveman2k
Copy link
Member Author

@captain-caveman2k captain-caveman2k commented Feb 8, 2020

This looks good, but we really need a test case or two for this as well, right?

Yep - that's this evening's job. I've dug out the HTTP IDN test cases (165, 1034, 1035, 1448, 2046 and 2047) which will hopefully help me out with the non-ASCII stuff ;-)

Just waiting for my munchies to be delivered by Ocado (other supermarket delivery services are available) before I get stuck in ;-)

@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch 2 times, most recently from 07e9c9e to 83c8162 Feb 9, 2020
captain-caveman2k added a commit to captain-caveman2k/curl that referenced this pull request Feb 9, 2020
…ings

This avoids the duplication of strings when the optional AUTH and SIZE
parameters are required. It also assists with modifications in curl#4892.
@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch from 83c8162 to bc7e3d1 Feb 9, 2020
captain-caveman2k added a commit to captain-caveman2k/curl that referenced this pull request Feb 9, 2020
This avoids the duplication of strings when the optional AUTH and SIZE
parameters are required. It also assists with the modifications that
are part of curl#4892.
captain-caveman2k added a commit that referenced this pull request Feb 9, 2020
This avoids the duplication of strings when the optional AUTH and SIZE
parameters are required. It also assists with the modifications that
are part of #4892.

Closes #4903
@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch 3 times, most recently from 9dfe4d6 to 1b86261 Feb 9, 2020
@captain-caveman2k
Copy link
Member Author

@captain-caveman2k captain-caveman2k commented Feb 13, 2020

I think this is about there and ready for public scrutiny:

Do we only need to do this if the server supports SMTPUTF8?

I don't think so. This is about IDN in the domain names, not what the SMTP server supports.

Update: did some more thinking and I'm no longer really sure what to think. We either need to test it out or find this documented to learn what the answer is.

My thoughts were similar when I started this journey and I must admit I am still in two minds even now, having spent a a fair amount of time with RFC-6531.

My understanding is that if the client wants to transmit UTF-8 in either a) the local part of the mailbox in an envelope command, or b) any part of a mailbox address used in one or more headers then the SMTPUTF8 extension is required.

However if the client wants to transmit UTF-8 in only the host part of mailboxes in an envelope command (but not in any header) then the client may choose to convert these, at its discretion, using IDN. This is detailed in [1] and I think is where I started out.

You could argue that with client side SMTPUTF8 support there is no need to convert the host part to an A-label as it could be sent as a U-label - See [2].

Note: The server side extension is present in the logs from #4828 so in summary I think our only issue there is we didn't advertise SMTPUTF8 support in the MAIL command. If we had of done there the server should have accepted that address.

I haven't separated the email address from the domain name in the MAIL_FROM field - do we need to?

Yes I think so. IDN is only for the host name and I would be a bit scared of what could happen by accident if we encode more than necessary.

Yep - agreed, I have made changes to implement this.

Are there other fields we need to do this for? For example: MAIL_AUTH and RCPT TO

I'd say should rather than need, but yes probably?

I've ended up implementing this for the MAIL command (both FROM and AUTH parameters), RCPT TO command and VRFY command. I've also added the advertisement of SMTPUTF8 to the EXPN command so the server my respond with UTF-8 based addresses.

Do we need to implement something similar for IMAP?

Again should rather than need but yes if the host name in the protocol, there will probably be users with IDN names who would like to use that name rather that the punycoded version.

I think I'll leave IMAP for another rainy day.

[1] RFC-6531 Section 3.1 Paragraph 3 states:

If the SMTPUTF8 SMTP extension is not offered by the SMTP server, the SMTPUTF8-aware SMTP client MUST NOT transmit an internationalized email address and MUST NOT transmit a mail message containing internationalized mail headers as described in RFC 6532 [RFC6532] at any level within its MIME structure [RFC2045].

However, the paragraph then goes on to state:

(For this paragraph, the internationalized domain name in A-label form as specified in IDNA definitions [RFC5890] is not considered to be "internationalized".)

So I was tempted to implement support for ASCII based recipients (ie. The local part of a mailbox) on UTF-8 hosts using the IDN conversion to A-label.

Instead, if an SMTPUTF8-aware SMTP client (sender) attempts to transfer an internationalized message and encounters an SMTP server that does not support the extension, the best action for it to take depends on other conditions.

[2] RFC-6531 Section 3.2 Paragraph 2 states:

An SMTP client that receives the SMTPUTF8 extension keyword in response to the EHLO command MAY transmit mailbox names within SMTP commands as internationalized strings in UTF-8 form. It MAY send a UTF-8 header [RFC6532] (which may also include mailbox names in UTF-8). It MAY transmit the domain parts of mailbox names within SMTP commands or the message header as A-labels or U-labels [RFC5890]. The presence of the SMTPUTF8 extension does not change the server-relaying behaviors described in RFC 5321.

@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch from 1b86261 to 108e625 Feb 13, 2020
@captain-caveman2k captain-caveman2k changed the title [wip] smtp: IDN encode the MAIL FROM parameter if necessary [WIP] smtp: Support internationalised characters in envelope addresses Feb 13, 2020
@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch 2 times, most recently from f9af971 to 9cee717 Feb 13, 2020
@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch 3 times, most recently from def9c17 to 383e24b Feb 14, 2020
@captain-caveman2k captain-caveman2k changed the title [WIP] smtp: Support internationalised characters in envelope addresses smtp: Support internationalised characters in envelope addresses Feb 20, 2020
@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch from 383e24b to ed853a5 Feb 24, 2020
…eter

Non-ASCII host names will be ACE encoded if IDN is supported.
…eter
…-6531
…meter

Support the SMTPUTF8 extension when sending mailbox information in the
FROM parameter. Non-ASCII domain names will be ACE encoded if IDN is
supported.

Reported-by: ygthien on github
Fixes #4828
…ameter
…meter
…ameter
Simply notify the server we support the SMTPUTF8 extension if it does.
Closes #xxxx
@captain-caveman2k captain-caveman2k force-pushed the captain-caveman2k:smtp_idn_mail_from branch from ed853a5 to f064111 Feb 25, 2020
@captain-caveman2k
Copy link
Member Author

@captain-caveman2k captain-caveman2k commented Feb 25, 2020

Hi @mback2k,

Are you able to take a look at my Windows builds and suggest a way forward please?

I'm trying to update/add some tests that output UTF-8 characters which is then breaking those tests on Windows :(

For example:

https://ci.appveyor.com/project/curlorg/curl/builds/31045056/job/knhcb7deywn9j1ks

@mback2k
Copy link
Member

@mback2k mback2k commented Feb 25, 2020

@captain-caveman2k I can try to take a look this week. Unfortunately I have never been able to solve similar issues while porting the test suite to Windows, because the classic Windows Console is not so keen about UTF-8.

captain-caveman2k added a commit that referenced this pull request Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.