-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance. #3579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java
Outdated
Show resolved
Hide resolved
sodonnel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change LGTM provided the CI checks come back good.
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
Looks like there are some related test failures in TestSaslRPC, where it is checking the exception string. |
|
Seems the issue is caused by the ConnectionPool can't distinguish two different connections? |
|
@symious actually the connection is not two different connection, but the same connection to the same DataNode (it is a socket connection from the client node to the datanode without knowing the actual object instance using the connection, or the type of the connection). The problem is with what we send via the connection. In a secure environment, we are doing a SASL handshake, and for that we are sending a SASL auth header, but a DataNode expecting a simple auth does not expect the SASL auth header, it expects an int with a version, and for that reads the first 4 bytes, which in the SASL auth case is 0xDEAD which translates to the cryptic version number, -8531, in the DN log. The setupIOStreams method works based on configuration, and server defaults coming in via the DfsClient (if I remember correctly), and the shared atomicBoolean is set to control the behaviour of the SASLDataTransferClient inside the DfsClient, so that the SASLDataTransferClient does not try to do a SASL auth, does not set a SASL header, but sends the plain request. The first client behaves correctly, but when the second client tries to connect to the DN, its shared atomicBoolean (instantiated in the DfsClient constructor) is not set before the patch, so it sends a SASL header and initiates a SASL handshake first, and fails because of the EOF it gets, as the DN closes the connection when the exception happens on its side. |
|
@sodonnel thank you for the review (and sorry for the delay, I was off last week), also I have an idea on how to add a test for this case to demonstrate the problem better, I am working on that and the fix for the failing tests, will add both to the PR when I am ready. |
|
@fapifta Thanks for the explanation. Is the current situation as follows? Correct me if I'm wrong. Server S has two clients, client A of non-secure settings with "fallbackToSimpleAuth = false", client B of secure settings with "fallbackToSimpleAuth=true", client A and B both connecting to Datanode D, since client A first connect to Datanode D, so the connection is created in connections, and the fallbackToSimpleAuth is set to false. If the issue is caused by the above situation, I think if we distinguish the connections of Client A and Client B will solve the issue? Currently the ConnectionId of Client A and Client B is same, if we add "fallbackToSimpleAuth" to ConnectionId, since Client A has "false" and Client B has "true", they will generate different ConnectionId, then different connections will be created, so they won't affect each other. |
|
@symious Let me correct a bit about the case, but in general you see it the right way. We have a code that uses FS API to connect to the cluster. This code creates client A and B. During creating client A and B, an AtomicBoolean (fallbackToSimpleAuth) instance is created. Client A and B connects to the same DataNode D, A first, then B later. During setting up the connection for client A, setupIOStreams sets the fallbackToSimpleAuth instance of client A, and then later on as the connection between the host running the code and the DataNode D is already established, client B re-uses that connection, and skips setupIOStreams, so the fallbackToSimpleAuth instance of client B remains false. Due to this, the SaslRPCClient in client B does not fall back to simple auth but tries to do a SASL handshake which is not expected by DataNode D. Your suggestion to add the client's ID to the ConnectionID would help, but there is a tradeoff there, as in this case client B has to set up a new socket connection to DataNode D.
Because of this I am reluctant to modify the behaviour in a way where we start to use multiple connections towards the same DataNode from one single client code. |
|
@sodonnel I still owe this PR with some kind of test, the prior test failure should be fixed now, but it has revealed that there were some behaviour change caused by my prior changes, in the case the fallbackToSimpleAuth boolean is null. I also realised - while I was evaluating the problem with the TestSASLRPC - that the addition of setFallbackToSimpleAuth at the beginning of setupIOStreams changed behaviour if the authProtocol is not SASL. I had a plan to add a MiniCluster based test for this, but TestSASLRPC code gave me some other ideas, so I am exploring the possibilities there as well. |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
@fapifta Thanks for the comment. For the trade off, I think it's mainly on the reuse of the Connection, that is the ConnectionId. The current comparison of ConnectionId is as follows: From the comparison, two clients on the same server having "rpcTimeout" or "pingInterval" set differently would cause a different connection, so I think the overhead is not that heavy, since there are already many situations clients not reusing the same connection. IMHO, the connections seems like bridges, Client A like a bridge with 5 meters wide, Client B like a bridge with 10 meters wide, it seems more succinct to use different connections for different usage. |
|
@symious I would say yes, and no :) We have 3 levels in the communication, the DfsClient which connects to HDFS, and is one of many users of the SASL protocol layer (SaslRpcClient), which uses the basic network communication layer the ipc client. (At least as I understand the system so far, so correct me if I am wrong.) In our problem scenario, we have two DfsClient which uses two separate SaslRpcClient, but under the hood they are using the same ipc client, as ipc clients are cached in the RpcEngine, regardless of whether we cache the DfsClient or not in a higher layer. Your suggestion is this as I understand: I believe it is already unfortunate that the ipc layer decides and controls whether the SASL layer uses SASL or falls back to simple auth based communication, if we even add client ids from one more layer up, then the behaviour of the SASL layer would be defined by the layer above SASL via the layer under SASL that does not seem better at all. In my view, the high level client id does not, and should not distinguish between connections, the connection itself is a socket with one end on a port on the local machine, and the other end on a port on a remote machine. It does not really matter who uses the connection, also it should not really matter who uses the connection, until the user wants to use the connection between the two ports on the local and remote machine with the same settings. So if you look at it this way, the things in the equals method all qualify to differentiate between two connection, while the id of the user really does not. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
@symious |
|
@fapifta Sorry for late reply. I think it's not adding a new field like ClientID. I created a commit for example, could you help to check?symious@1389696 |
|
@symious ah I see what you are thinking about, and what I have assumed about the proposal and its impact is not correct. Actually after looking into this further, and review more how the client code is interconnected and working together, I have to admit that this might be a good idea, but I am still hesitant as it feels more risky and it is a change from the user's point of view, but as the ipc Client class is tagged as Public and Evolving, we can change it from minor to minor even if we are breaking compatibility. Let me explore this a bit, and test it out, I will get back to this sometime during this week. |
|
@fapifta Sorry for not making it clear before. Thank you for the test : ) |
|
@symious I am closing this PR for now, as we discussed to implement the other solution, so probably we won't need this anymore, but I wanted to keep the discussion and the solution separate, hence created a clean new PR for the fix, and referenced this discussion for those whom are interested. |
|
At the end of the day after an extensive discussion, we agreed with @symious here, to continue with this solution. I see one small improvement, I am update the PR with, after that @sodonnel if you are ok to commit this one, we probably are ready to go, I am not sure if we want to ask anyone else to review. |
…stributedFileSystem instance.
…eams method as it had before the changes in all other cases than the problem we want to solve here.
…on in TestRpcBase to make it easier to add.
…et to true, as in this case we would set it to true again in the same connection. Also removed some logs all information from the removed messaes can be infered from the rest.
4503c3f to
80c9afa
Compare
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
This change LGTM, but there are a few checkstyle issues highlighted in the last CI run. Could you fix them please? |
|
Oh, I haven't spotted that, fixed. Let's see the CI results. Thank you @sodonnel for the review! |
|
🎊 +1 overall
This message was automatically generated. |
…istributedFileSystem instance. (apache#3579)
Description of PR
When an ipc.Client.Connection is created for the first time, it is added to the connections pool based on ConnectionId, and then it is there until the connection is closed.
If someone instantiates two clients (eg. when fs.hdfs.impl.disable.cache is true), then the same connection is shared between the two DfsClient due to this pooling.
Connection.setupIOStreams is responsible to set the fallbackToSimpleAuth AtomicBoolean, but when a connection is accessed the second time, setupIOStreams returns without setting the AtomicBoolean properly.
This leads to read failures in the second DfsClient, when the client is running with a secure configuration, but connects to a SIMPLE auth cluster with ipc.client.fallback-to-simple-auth-allowed set to true unless it is created after the first one is closed.
The fix is to properly set the fallbackToSimpleAuth AtomicBoolean even though the socket is already created. In this case the authMethod is not null, but there is a sanity check for that, so if somehow we get there with an authMethod equals null, we just do not touch the AtomicBoolean, similarly if the AtomicBoolean itself is null.
How was this patch tested?
In a real environment with different iterations of the code that is shown in the JIRA ticket. Event orders that were tested:
Get two DfsClient with cache enabled - pass (passed before patch)
Get one DfsClient with cache disabled, then get on DfsClient with cache enabled - pass (failed before patch)
Get two DfsClient with cache disabled - pass (failed before patch)
Next tests were with cache disabled:
Get one DfsClient close it get the second DfsClient - pass (passed before patch)
Get one DfsClient close it after the second DfsClient is created - pass (failed before patch)
Get two DfsClient without deliberately closing any of them - pass (failed before patch)