KAFKA-4493: Throw InvalidTransportLayerException when a plaintext client receives a TLS response #5940

stanislavkozlovski · 2018-11-22T11:51:02Z

Currently, if a client if misconfigured to not use ssl and conencts to a broker with SSL listeners, the broker responds with a TLS Alert response.
The client interprets the first 4 TLS alert response bytes as the size of the response, resulting in it trying to allocate 352518912 bytes.
This either raises an OutOfMemory exception or stalls on the allocate() call. Either way, we should fatally throw a more clear exception so the user can know what is wrong.
I personally spent more time than I'm comfortable sharing debugging this issue and I'd like to see it improved.

Implementation

On the first read of a PlaintextTransportLayer channel, we check against the pre-defined TLS alert headers, which consist of the alert byte 0x15 and the TLS version 0301,0302, 0303.

I believe that it should be alright to close such responses since we would never expect to receive a 3mb response on the first response in a newly-established connection since we should at least verify API_VERSIONS first.

I'm not entirely certain whether this warrants creation of a KIP since we are raising a new exception - InvalidTransportLayerException. That is not publicly exposed, right?

cc @rajinisivaram for review

…ent receives a TLS response

ijuma · 2018-11-22T15:23:35Z

It would be better if the broker did the right thing here. Have we considered that? Given the client upgrades take forever, any fix on the client side is less effective in the short/medium term.

stanislavkozlovski · 2018-11-23T11:52:40Z

I had not considered that. Thanks for the suggestion @ijuma.

After some digging around I find that the first place where we the server detects that something is wrong is from a raised SSLException. As the docstring of maybeProcessHandshakeFailure (

kafka/clients/src/main/java/org/apache/kafka/common/network/SslTransportLayer.java

Line 863 in 4e90af3

    
           private void maybeProcessHandshakeFailure(SSLException sslException, boolean flush, IOException ioException) throws IOException {

) says:

// SSL handshake failures are typically thrown as SSLHandshakeException, SSLProtocolException,
// SSLPeerUnverifiedException or SSLKeyException if the cause is known. These exceptions indicate
// authentication failures (e.g. configuration errors) which should not be retried. But the SSL engine
// may also throw exceptions using the base class SSLException in a few cases:
// a) If there are no matching ciphers or TLS version or the private key is invalid, client will be
// unable to process the server message and an SSLException is thrown:
// javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection?
// b) If server closes the connection gracefully during handshake, client may receive close_notify
// and and an SSLException is thrown:
// javax.net.ssl.SSLException: Received close_notify during handshake
// We want to handle a) as a non-retriable SslAuthenticationException and b) as a retriable IOException.
// To do this we need to rely on the exception string. Since it is safer to throw a retriable exception
// when we are not sure, we will treat only the first exception string as a handshake exception.

javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection? is exactly what gets raised here. Since we do not know the core cause of the issue (and there doesn't seem to be a way to find out), we flush whatever bytes we have down the wire to the client. These bytes happen to represent a TLS alert message (any one of https://www.gnutls.org/manual/html_node/The-TLS-Alert-Protocol.html) (I could not get Java to print out the whole bytes and decode what type of message the SSLEngine produced).

It seems like we want to preserve the behavior of flushing the bytes to the client. For example, if the client's private key is invalid, we would want to send the alert message back.

ijuma · 2018-12-08T17:47:42Z

Thanks for the investigation. My concern is that this is a bit of a layering violation. The plaintext code now has to know about TLS. What about other protocols that can cause the same issue? It would be better if the plaintext code could validate based on what it knows about its own protocol.

stanislavkozlovski · 2018-12-09T19:23:53Z

I agree. What do you think would be a better approach? Maybe we can throw an exception if we receive a message size above a certain threshold. That would catch this scenario where the plaintext layer thinks it's receiving a 3mb message. It sounds reasonable to me since newly-established connections should first negotiate API_VERSION and that does not need big messages

ijuma · 2018-12-09T19:48:35Z

That could work, but the current code is checking all reads, right? Not just the first response.

stanislavkozlovski · 2018-12-10T10:14:25Z

No, the current code checks only the very first read due to the completedRead flag. Since we store one TransportLayer per connecition (KafkaChannel), this should only catch the very first read

ijuma · 2019-01-09T14:12:06Z

One thing to keep in mind is that we don't use ApiVersionRequest for inter-broker communication.

stanislavkozlovski · 2019-07-13T22:23:59Z

@ijuma From reading the code and quickly testing locally, I believe that inter-broker communication starts with either

UPDATE_METADTA request when it's from a controller to another broker, which contains all the partitions present in the cluster

[2019-07-13 23:08:32,238] ERROR [KafkaApi-0] Handling request:RequestHeader(apiKey=UPDATE_METADATA, apiVersion=4, clientId=0, correlationId=32) -- {controller_id=0,controller_epoch=3,partition_states=[{topic=test,partition=0,controller_epoch=3,leader=2,leader_epoch=6,isr=[2],zk_version=8,replicas=[1,2],offline_replicas=[1]}],live_brokers=[{id=2,end_points=[{port=9094,host=localhost,listener_name=PLAINTEXT,security_protocol_type=0}],rack=null},{id=0,end_points=[{port=9092,host=stanislozlovski,listener_name=PLAINTEXT,security_protocol_type=0}],rack=null}]} from connection 192.168.1.104:9092-192.168.1.104:61588-0;securityProtocol:PLAINTEXT,principal:User:ANONYMOUS (kafka.server.KafkaApis)

OFFSET_FOR_LEADER_EPOCH request when it's from a follower to a leader. In both cases. It contains all the partitions the follower is interested in

[2019-07-13 23:07:35,463] ERROR [KafkaApi-0] Handling request:RequestHeader(apiKey=OFFSET_FOR_LEADER_EPOCH, apiVersion=2, clientId=broker-2-fetcher-0, correlationId=161) -- {topics=[{topic=test,partitions=[{partition=0,current_leader_epoch=3,leader_epoch=0}]}]} from connection 192.168.1.104:9092-192.168.1.104:61715-4;securityProtocol:PLAINTEXT,principal:User:ANONYMOUS (kafka.server.KafkaApis)

It seems like the UpdateMetadataRequest is likely to go over 1MB in size. From a very rough calculation (basically multiplying the most frequently-variable piece of that request - UpdateMetadataPartitionState), I estimate that we will go over the 1MB threshold of UpdateMetadataRequest size once we reach around 20k partitions for the entire cluster.

Perhaps we could raise the MB check limit here? My initial message had said that in such a misconfiguration, the broker thinks it needs to allocate 352MB for the request. A simple 100MB check should suffice I think

stanislavkozlovski · 2019-08-19T18:41:02Z

cc @ijuma thoughts on this?

ybyzek · 2019-08-19T20:15:41Z

clients/src/main/java/org/apache/kafka/common/network/PlaintextTransportLayer.java

+        int readSize = socketChannel.read(dst);
+
+        if (!completedRead && ((ByteBuffer) dst.rewind()).getInt() > 1_000_000)  // first responses should negotiate API_VERSION and be small in size
+            throw new IllegalTransportLayerStateException("Received a first response larger than 1MB (Is this a plaintext response?)");


For your consideration: message could be more direct .

Is this a plaintext response? --> The broker expects SSL, is your client configured for SSL correctly? (or something of that nature)

At this layer we don't know whether the broker expects SSL. The broker is actually configured to expect plaintext in this scenario and we don't know whether the response here is SSL. As @ijuma said earlier, that is a layering violation so I think it's better to keep the current message

@stanislavkozlovski understood. My proposed wording was aggressively making assumptions, but perhaps we could find a middle ground that still directs users to check their port? The issue with Received a first response larger than 1MB (Is this a plaintext response?) is that users won't know what to DO with that response. It would be great to reword it accurately to give some proposed action/suggestions to the user...

Ack. Any other suggestions? Would something like (Please verify that the configuration is correct. It may be that we are not connecting to a plaintext port) work better?

Adding an extra "if" statement on every call to read() seems like a very heavy price to pay for this feature. We should look at where we're using the total length and check there, I think.

@cmccabe @ijuma Doesn't the compiler optimize the if check?

I do not know how to proceed with this change.

@astubbs's example wouldn't work for the common case where we try to allocate ~352MBs and it would also be appleid to every receive we get, whereas here we are trying to scope it to the first receive for a specific channel

Are you referring to my fix? or my problem? I just would like my problem solved :) I'm not concerned about my fix - I was just mainly trying to demonstrate the problem.

If you wanted to minimize the cost, you could move the exception and second part of the if to a separate method so that it's not in the hot path. Btw, calling rewind on the ByteBuffer mutates it, which seems generally unsafe?

Also, I thought the previous discussion indicates that 1 MB might be too low of a check?

I think the right time to check this would be when we originally create the NetworkReceive object, not in the hot path for every read. Also, whatever we choose as the "upper limit" for a size to be interpreted as valid needs to become the new upper limit for message.max.bytes (which in turn, means that maybe this needs a KIP, since it changes a public config)

guozhangwang · 2020-05-28T16:36:32Z

test this please

KAFKA-4493: Throw InvalidTransportLayerException when a plaintext cli…

6a670c9

…ent receives a TLS response

Check for response size rather than TLS header, rename exception

b531cba

ybyzek reviewed Aug 19, 2019

View reviewed changes

showuon mentioned this pull request Dec 22, 2022

KAFKA-4090: Validate SSL connection in client #8066

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-4493: Throw InvalidTransportLayerException when a plaintext client receives a TLS response #5940

KAFKA-4493: Throw InvalidTransportLayerException when a plaintext client receives a TLS response #5940

stanislavkozlovski commented Nov 22, 2018 •

edited

Loading

ijuma commented Nov 22, 2018

stanislavkozlovski commented Nov 23, 2018 •

edited

Loading

ijuma commented Dec 8, 2018

stanislavkozlovski commented Dec 9, 2018

ijuma commented Dec 9, 2018

stanislavkozlovski commented Dec 10, 2018

ijuma commented Jan 9, 2019

stanislavkozlovski commented Jul 13, 2019

stanislavkozlovski commented Aug 19, 2019

ybyzek Aug 19, 2019

stanislavkozlovski Aug 28, 2019

ybyzek Aug 28, 2019

stanislavkozlovski Oct 22, 2019

cmccabe Oct 22, 2019

stanislavkozlovski Nov 14, 2019 •

edited

Loading

astubbs Nov 14, 2019

ijuma Nov 14, 2019

ijuma Nov 14, 2019

cmccabe Nov 15, 2019

guozhangwang commented May 28, 2020

KAFKA-4493: Throw InvalidTransportLayerException when a plaintext client receives a TLS response #5940

Are you sure you want to change the base?

KAFKA-4493: Throw InvalidTransportLayerException when a plaintext client receives a TLS response #5940

Conversation

stanislavkozlovski commented Nov 22, 2018 • edited Loading

Implementation

ijuma commented Nov 22, 2018

stanislavkozlovski commented Nov 23, 2018 • edited Loading

ijuma commented Dec 8, 2018

stanislavkozlovski commented Dec 9, 2018

ijuma commented Dec 9, 2018

stanislavkozlovski commented Dec 10, 2018

ijuma commented Jan 9, 2019

stanislavkozlovski commented Jul 13, 2019

stanislavkozlovski commented Aug 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stanislavkozlovski Nov 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guozhangwang commented May 28, 2020

stanislavkozlovski commented Nov 22, 2018 •

edited

Loading

stanislavkozlovski commented Nov 23, 2018 •

edited

Loading

stanislavkozlovski Nov 14, 2019 •

edited

Loading