-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NIFI-2519 Fixed and refactored ListenSMTP processor #827
Conversation
olegz
commented
Aug 10, 2016
- Removed message queueing which could result in data loss
- Fixed life-cycle issues that coudl put processor in an unstable state
- Fixed PropertyDescriptor translation for Time units and Byte sizes
- Fixed broken tests
- Added additional tests
0423a24
to
ed2de8d
Compare
"certificates used by an TLS peer"), | ||
@WritesAttribute(attribute = "smtp.from", description = "The value used during MAIL FROM (i.e. envelope)"), | ||
@WritesAttribute(attribute = "smtp.to", description = "The value used during RCPT TO (i.e. envelope)"), | ||
@WritesAttribute(attribute = "smtp.src", description = "The source IP of the SMTP connection")}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attributes were in there for a reason: They allow you to understand the connection (where it came from, what certificates where used, etc) and record this information for posterior use.
I would not remove this functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why we removed them is to align then with how IMAP/POP3 works. Basically the email is written as raw bytes to the content of the Flow File. We can easily have ParseEmailProcessor downstream in the future which is configurable to extract whatever attributes you want. Remember the conversation we all had on the dev list about doing one thing but doing it well. . .? This would be one example. This processor's job IMHO is to receive email and send it downstream. There could be variety of use cases for extracting attributes, parsing content, searching etc., and we should not munge them into a single processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that conversation, however this information cannot be derived from downstream events as they are not part of the message but part of the SMTP conversation.
I works like this:
$ telnet 0 2525
Trying 0.0.0.0...
Connected to 0.
Escape character is '^]'.
220 localhost ESMTP Apache NiFi
helo .
250 localhost
mail from: aaa@aaa.com
250 Ok
rcpt to: bbb@bbb.com
250 Ok
data
354 End data with <CR><LF>.<CR><LF>
From: olegz@nowhere.com
To: sowhat@nowhere.com
Subject: Some relevant data was missed in here
Don't you think?
.
250 Ok
quit
221 Bye
Connection closed by foreign host.
Results in:
Received: from . (localhost [127.0.0.1])
by localhost
with SMTP (Apache NiFi) id IRPNED6Q
for bbb@bbb.com;
Thu, 11 Aug 2016 11:30:51 +1000 (AEST)
From: olegz@nowhere.com
To: sowhat@nowhere.com
Subject: Some relevant data was missed in here
Don't you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I just mentioned in the previous moment, adding things in the future is possible, removing is not. So I'd like to know (outside of "just in case") practical implication of those attributes to justify having them now. Also, the to/from/and few others can be derived from message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See how the original MAIL FROM data is fully lost while the rest now needs to be processed via RegEx to the detriment of the user?
RFC-2822 clearly distinguishes the nature of both sets of data, reason while nearly every single SMTP server will provide you with the option of logging the details provided during the SMTP exchange (envelope), while mail parsers will usually look into the Headers of the message (EmailExtractHeaders does that)
MAIL FROM:<>
As discussed in section 2.4.1, a relay SMTP has no need to inspect or
act upon the headers or body of the message data and MUST NOT do so
except to add its own "Received:" header (section 4.4) and,
optionally, to attempt to detect looping in the mail system (see
section 6.2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I am not against it, but outside of "nice to have" i don't see practical justification for it.
Remember, the idea here is not to implement a fully capable SMTP server, if so we would need a separate project just for that. The idea is to have a simple network listener that is capable of speaking SMTP. For anything more complex use an existing/external/secure/managed/monitored Email server and connect to it via IMAP/POP3 etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olegz: Yes, there a concrete reason, we already use it together with QueryDNS to implement RBL on NiFi.
ListenSMTP -> (with from IP) -> QueryDNS does DNS lookup to RBL -> RouteonAttribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you may not want to add those in IMAP and POP but they are intrinsically different processors:
Listeners get connections from entities which may be outside the control of the DFM, Getters (in theory) will only establish connections to targets defined by the DFM.
It is therefore critical for the chain of custody of a flow (an inbound email may a medical record...) we record information through the ingestion of the data to its delivery. Attributes help with that.
Note that this isn't something particular to this processor, it is a pattern used all across NiFi.
HandleHttpRequest records TLS client details and things like:
http.remote.host The hostname of the requestor
http.remote.addr The hostname:port combination of the requestor
http.remote.user The username of the requestor
ListenRELP
relp.sender The sending host of the messages.
relp.port The sending port the messages were received over.
ListenTCP & ListenUDP
tcp.sender The sending host of the messages.
udp.sender The sending host of the messages.
ListenSyslog records both the Hostname and the Sender.
syslog.hostname The hostname of the Syslog message.
syslog.sender The hostname of the Syslog server that sent the message.
You may not be fully aware but the sender of a syslog message is the server IP, while the hostname is a field of the syslog message. Think of sender as MAIL FROM envelope while hostname is the "From" of a message.
Why is this recorded?
Because the sender may just be forwarding messages on behalf of a certain host and you want to know that. Otherwise evilmachine.com may send messages on behalf of clueless.org and you would never know.
Add that to the fact @bbende allowed for decoupled processing (ParseSyslog) and you find a pattern where Listeners record "wire" level details and parsers deal with payload content. Hence ListenSMTP (ingest and record envelope data) -> ExtractEmailAttributes (deals exclusively with Header data) -> ExtractAttachments (deals only with attachments) -> ExtractTNEFAttachments (handles attachments that happen to be TNEF files)
Hope this helps to clarify.
@olegz - Tested the processor and other than the view that attributes should not be removed (as they are useful for many auditing and downstream applications) and a few remarks on SubEthaSMTP intricacies I am amazed by the refactoring. Truly awesome work in here. |
email.setFrom("alice@nifi.apache.org"); | ||
email.setSubject("This is a test"); | ||
email.setMsg("Test test test chocolate"); | ||
email.setMsg("MSG-" + i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/joke on
NOOOOOOOOOO!!!! What have you done!??!! :-)
https://github.com/apache/nifi/search?utf8=%E2%9C%93&q=chocolate
/joke off
@olegz You are clearly confusing envelope (the data exchanged within an SMTP session) with header information (the data added to the body of the message after the DATA command). They don't need to match. MAIL FROM is an envelope detail it is NOT added to a message. Please refer back to this comment: Note how although MAIL FROM was set to aaa@aaa.com this information never makes into the resulting flowfile. The RCPT TO input makes to it but requires a multiline regex to parse it and that regex will be quite brittle as resolution will require a IPv4 / IPv6 match, variable domains and so it goes. |
i see now. Putting it in |
|
||
|
||
protected static final PropertyDescriptor SMTP_PORT = new PropertyDescriptor.Builder() | ||
@WritesAttribute(attribute = "smtp.helo", description = "The value used during HELO"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olegz My apologies but I missed the removal of the mime type attribute completely...
The original attribute description is wrong but in any case:
This is NOT the mime type of the attachments (these are extracted by ExtractEmailAttachments)
Instead this was set with a non variable value "message/rfc822" (This is the standard mime type of a text file containing an email message (which is what this Listener produces)).
As with previous the addition of mime.type attribute follows the pattern used everywhere around as seen here:
Apologies for noting this earlier.
@trixpan and @olegz great discussion going here. I just ask you find a good 'this is right for first iteration' then let's get it in the hands of users and get more feedback to improve on. We can absolutely add and make this more full over time. But it must first be super stable and tests super reliable. |
@joewitt, @olegz code is a great improvement to the previous code. I do think the the removal of the maximum message size and timeout checks should be addressed as soon as possible: I base my opinion solely on the fact their absence will cause issues to the clients sending emails/flows via this interface. As the processor currently lacks the ability to provide the appropriate SMTP reply codes back to the connecting clients, both ListenSMTP and connecting MTAs will waste resource every time a a connection times out (I am not 100% sure how the size issue will unfold). Postfix documentation describes this scenario:
Note the whooping 600s before postfix gives up and disconnect. Having said that, I am happy to defer this decision to the wider group. |
@olegz is this expected?
|
@trixpan could you please describe scenario that triggered the above exception? |
@trixpan i think you've closed client while it was in he middle of reading InputStream. If that's the case then yes the stack trace above is exactly what one would expect. |
wait for longer than timeout |
Stopping the processor with an agent connected also seems to throw an exception (not sure if expected)
|
The above is actually correct and is exactly what we were aiming for. No data loss. There is actually new test that validates for that exact message. So all is good. User is notified that something went wrong and can resend the message |
@olegz have you pushed the notification recently? I got just an abrupt termination
Note the connection close as soon I stop the server |
@trixpan just confirmed, the "Read timed out" exception is the correct behavior, but I will wrap it into a nicer log message. But just to confirm the behavior is expected |
@trixpan what notification? The exception is thrown by the SMTPServer's thread without ever sending any type of ack about successful consumption of the message so abrupt termination especially with message such as 'closed by foreign host' simply means server is done and your message was not successfully delivered. |
@trixpan with regard to "Read timeout" I've just added better exception handling to notify user 406b37b#diff-be7be540c5e3dfb436cea96d2073e28fR216 |
@olegz thanks for the clarification. Just so that we are on the same page, I am not concerned about the DFM in this case. What I mean by notification is the following: SMTP allows you to convey messages back to the client (These messages are called reply codes). You can send a 5XX reply back and the server will not retry Or you can send a 4XX and the server will retry after a few minutes. Send nothing and reaction will depend on who is sending. Here's a list: http://www.greenend.org.uk/rjk/tech/smtpreplies.html This is why this was in here: |
I understand that and these are all good suggestions to improve in the future, but as Joe pointed out let's first settle on "this is right for first iteration" - meaning that one can interact with it and get message exchange going without message loss. Then let's give it some miles and see how it is used/applied in the workforce and then improve based on requirements at that time. |
2a47de0
to
f9e2c46
Compare
.build(); | ||
|
||
protected static final PropertyDescriptor SMTP_MAXIMUM_CONNECTIONS = new PropertyDescriptor.Builder() | ||
static final PropertyDescriptor SMTP_MAXIMUM_CONNECTIONS = new PropertyDescriptor.Builder() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olegz is SMTP_MAXIMUM_CONNECTIONS being enforced at all?
$ date && telnet 0 2525
Fri Aug 12 09:02:31 AEST 2016
Trying 0.0.0.0...
Connected to 0.
Escape character is '^]'.
220 localhost ESMTP Apache NiFi
$ date && telnet 0 2525
Fri Aug 12 09:02:30 AEST 2016
Trying 0.0.0.0...
Connected to 0.
Escape character is '^]'.
220 localhost ESMTP Apache NiFi
I did this test and it doesn't seem to be:
$ date && netstat -an| grep 2525 | pcregrep "2525\s+ES"
Fri Aug 12 09:02:33 AEST 2016
tcp 0 0 0.0.0.0:2525 0.0.0.0:* LISTEN
tcp 0 0 127.0.0.1:38840 127.0.0.1:2525 ESTABLISHED
tcp 0 0 127.0.0.1:38842 127.0.0.1:2525 ESTABLISHED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is propagated to the target API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olegz I imagined so. Not working as expected though (not sure why to be honest).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olegz If we cannot enforce Size, Timeout and Max Connections perhaps its time to consider changing the SMTP server API used here?
the new PR uses instance array for buffer. This will be a problem under multiple threads. the references in LICENSE to JOpt Simple and JLines..is this copy and paste or are those really in the build? |
- Removed message queueing which could result in data loss - Fixed life-cycle issues that coudl put processor in an unstable state - Fixed PropertyDescriptor translation for Time units and Byte sizes - Fixed broken tests - Added additional tests NIFI-2519 added default for SMTP_MAXIMUM_CONNECTIONS NIFI-2519 addressed PR comments, polishing - fixed intermittent deadlock on processor stop and added test for it - the attributes that can not be extracted from the message but available via MessageContext are written into the outgoing FlowFile - other minor fixes NIFI-2519 addressed lates PR comments NIFI-2519 added better messaging when server closes the connection NIFI-2519 some polishing and additional tests to validate deadlocks NIFI-2519 address latest PR comments fixed deadlock condition for when the consumer is stopped while server is distributing messages fixed MAX message size issue ensuring it is validated set max connections to SMTPServer polished pom added L&N NIFI-2519 PR comments - fixed LICENSE - Added usage of LimitingInputStream - simplified SmtpConsumer by removing hasMessage operation