-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance #3658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…stributedFileSystem instance
|
🎊 +1 overall
This message was automatically generated. |
| UserGroupInformation ticket, int rpcTimeout, | ||
| RetryPolicy connectionRetryPolicy, Configuration conf) { | ||
| RetryPolicy connectionRetryPolicy, Configuration conf, | ||
| AtomicBoolean fallbackToSimpleAuth) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should change it to boolean instead of AtomicBoolean here.
The "equals" of AtomicBoolean seem doesn't fit the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to simply use a boolean here, but it did not work, and I realized that we need to distinguish between connections based on the reference of the AtomicBoolean what Object.equals will do here.
Here is why:
If we pass in the boolean value of AtomicBoolean, then the event sequence is this:
- dfs client created with atomic boolean set to false
- we get the connection based on connection id, and the boolean value false
- we call setupIOStreams at first connect, which sets the AtomicBoolean to true
- at next call when we get a connection, we get an other one, based on the boolean value true
- the next client at initialization will create an other AtomicBoolean, and then gets the connection associated with the boolean value false, and never initializes its own AtomicBoolean via setupIOStreams again
The situation is similar also, if we use the value of the AtomicBooleans in the hashcode and equals.
That is why I chose to compare based on Object.equals (which will use AtomicBoolean's equals, which falls back to Object.equals), and also the hashcode method of AtomicBoolean (which will also give back something based on the Object.hashcode).
With that every distinct client with its own AtomicBoolean will get a distinct connection, and will initialize the fallback properly via setupIOStreams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Objects.equals(new AtomicBoolean(false), new AtomicBoolean(false) is false.
So if clientA and clientB from the same server connecting to NameNode, they should be using the same connection, but now they are using different connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are not equals, but let me re-iterate... they should not be equals.
Three things to note here:
- every DfsClient creates a new AtomicBoolean initialized to false
- these atomic booleans are set to true by ipc Client's Connection's setupIOStreams method which can be called only after we have a connection based on the ConnectionID which we would than possibly change in setupIOStreams.
- in the ipc Client layer, a connection to a NameNode is not different from a connection to a DataNode, as it is a connection to a host port combination with given network related setup.
Comparing the boolean values instead of the atomic boolean reference will bring us back to the initial problem. We will have a false value when we first try to get the connection based on the ConnectionID, then after we get the connection we change the value to true. Which is even worse, as in this case if fallback is on, we would have for sure two connections even for a single DfsClient, as when we get the connection the second time, the value would be true, so we would create an other connection for the new ConnectionID, and we possibly leak the previous one, as we would never create a ConnectionID with false again from that client. While on the other hand, from a second DfsClient, we would get the connection with the ID contains the false value, and we would have the problem we are trying to fix (unset AtomicBoolean prevents fallback of the second DfsClient).
About the number of connections we discussed in the previous PR, though at that point NN was out of picture, but yes, this change affects the NN connections as well if there are multiple DfsClients that are active.
Your suggestion to add the client's ID to the ConnectionID would help, but there is a tradeoff there, as in this case client B has to set up a new socket connection to DataNode D.
That would cause two things:
- is an overhead of creating a new connection instead of reusing what we already have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, connections from the same server with same configurations will reuse the connection tagged with the same ConnectionId. Say Client1, to Client10, these 10 clients will reuse the same connection.
If we add a field of AtomicBoolean in the comparison of ConnectionId, Client1 to Client10 will each initial a new AtomicBoolean, since the result of Object.equals are always false, the Client1 to Client10 will have 10 connections. That's connection leak.
If we change AtomicBoolean to boolean, there will at most be two different connections, since "false" is the default value, normally we will only have have one connection, only when we have a "true" value connection will the second connection be created.
I assume the problem we initially trying to solve is that clients with "fallback" set to true will be omitted. When we check with "boolean" instead of "AtomicBoolean", normal clients will go to the connection with "boolean" set to "false", special clients will go to the connection with "boolean" set to "true".
IMO, two connections are not leak, since idle connection will be collected and closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the problem again with other words, as I am starting to feel that we don't understand each other :)
When a client code initiates a connection to HDFS, the following decisions are made:
- if the DfsClient is cached, then we will have the same DfsClient with the same AtomicBoolean to all connections (this is the default)
- if the DfsClient is not cached, or if we explicitly create a new instance without checking the cache (we have API for this) then there will be as many distinct DfsClient with as many distinct AtomicBoolean as many are created.
With the distinct AtomicBooleans, there are distinct ipc connection ids, which means there are distinct connections.
Ipc Connections are closed when idle for 10 seconds (by default).
So if we have the AtomicBoolean reference in the ConnectionId, we might create connections if distinct DfsClients are created (DfsClients are not cached, or via explicit call to create a distinct instance), which are then either used by the different clients, or closed after the max idle time.
Note that the AtomicBoolean value is not there to determine if we are allowed to fall back, that comes from configuration via a variable fallbackAllowed into Connection#setupIOStreams. This atomicBoolean determines during the real communication which type of authentication will be used, and we do only know if we really have to fall back to simple auth after we connected to the server address and discovered its expected authentication method.
As I stated earlier, we can not use the boolean value of the AtomicBoolean in the key. Here is what happens when we do:
Let's say we have DfsClient1 and DfsClient2. DfsClient1 does 2 RPC calls to the same address then it is not closed, but DfsClient2 is created in the meantime which does 1 RPC call to the same address.
DfsClient1 creates ConnectionID1 with boolean value false.
DfsClient1 realizes it should fall back to simple auth, so it sets AtomicBoolean1 to true
DfsClient1 sends the first RPC via the connection that has ConnectionID1
DfsClient1 tries to send the second RPC, and in order to do so it creates a connection with ConnectionID2 where the boolean value is true
DfsClient1 sends the second RPC via the connection that has ConnectionID2
DfsClient1 goes idle
DfsClient2 is created with AtomicBoolean2 set to false
DfsCllient2 initiates the RPC call, and gets the connection with ConnectionID1 which has the boolean value as false
DfsClient2 does yield from setupIOStreams as the connection is already established, with that AtomicBoolean2 remains false
DfsClient2 fails the RPC call with the same error as documented in HADOOP-17975 which we are trying to solve.
While if the ConnectionID has the reference to the AtomicBoolean instead of the boolean value of it, then DfsClient1 does not create a second connection on the second RPC call with ConnectionID2, but DfsClient2 will create a second connection with its own AtomicBoolean, and then setupIOStreams is able to properly initialize AtomicBoolean2.
normally we will only have have one connection, only when we have a "true" value connection will the second connection be created.
This might sound appealing, but it is not true, as initially we would always have a ConnectionID with the value false, and this false value might change during setting up the connection when it is already cached by the false boolean value. As we do not know the final value and can not know the final value until we haven't set up the sockets and initiated some communication to see what auth method the server side has we will never initiate a connection with a ConnectionID that contains true.
Yes you are right in that, changing the ConnectionID to have a distinct connection per DfsClient for just this specific case, and create multiple connections for all the other cases is bad, but if we insists on having this property in the ConnectionID, then I do not see any way to avoid this.
Alternatively as this seems to be problematic, we can switch back to my previous proposal, which does not increases the number of connections, but just sets the AtomicBoolean value in setupIOStreams always when it is called if the auth protocol is SASL, and server side auth method is SIMPLE, and the client runs with a secured configuration, and if fallback is allowed where these conditions are checked in this order lazily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DfsClient2 is created with AtomicBoolean2 set to false
If the AtomicBoolean is set to false, DFSClient2 is supposed to fail. Since it won't fallback to simpleAuth.
The case of HADOOP-17975 is DFSClient2 is using "fallback = true", but it's reusing the connection with "fallback = false", so it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me quote myself here:
Note that the AtomicBoolean value is not there to determine if we are allowed to fall back, that comes from configuration via a variable fallbackAllowed into Connection#setupIOStreams. This atomicBoolean determines during the real communication which type of authentication will be used, and we do only know if we really have to fall back to simple auth after we connected to the server address and discovered its expected authentication method.
Please see this code, the falbackAllowed boolean is what you are talking about. It is initialized based on config here.
The AtomicBoolean is initialized in the Connection's setupIOStreams method which is called from the getConnection method, which also caches the connection based on ConnectionID here.
The fallbackToSimpleAuth AtomicBoolean is created in the DfsClient here, and is shared with the SaslDataTransferClient here, while it is provided to the ipc Client via the protocol proxy between this two points.
Later on the SaslDataTransferClient uses the AtomicBoolean to decide wether to fallback or not in this method.
So again:
the config we can not use, as even if the fallback is allowed we do not want to fallback against a secured server, and we do not know if the server is secure or unsecure until we are connected to it.
The AtomicBoolean is initialized as false, and set to be true only after the Connection has been set up and cached based on the ConnectionID, so we can not rely on its boolean value.
So the question still stands, do you still think we should go down this route, and tinker with the ConnectionID, instead of setting the AtomicBoolean's value in setupIOStreams if it should be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thank you for confirming, I am reopening that PR, and closing this one.
Thank you for the review, and the continued discussion about selecting the final solution.
Description of PR
In the JIRA there is a long description about the problem, the TLDR is that if there are two DFSClient instance at the same time on the client side, the second one can not properly fall back to SIMPLE auth due to how the SaslRpcClient handles the fallback via an AtomicBoolean created by the DfsClient, and set by the hadoop.ipc.Client.
The initial idea/solution was posted as #3579 where we had a long discussion with @symious about this solution and that solution, and the tradeoffs. After a while, I think I was convinced that this is a proper solution, and fits into the current system, so I am adding this PR to finally fix the issue.
The solution is to add the AtomicBoolean fallbackToSimpleAuth which controls how/if the SaslRpcClient falls back to simple auth or not, to the ConnectionId class, and with that distinguish the connections of the two DfsClient based on the fallbackToSimpleAuth AtomicBoolean instance. If that is different, from any other connections' AtomicBoolean instance, then we will distinguish the connection, and with that we will initialize the value of fallbackToSimpleAuth properly.
How was this patch tested?
JUnit test added.
The problem manifested itself as a failing HBase ExportSnapshot job, between two small clusters, one using Kerberos auth one using Simple auth. With this fix the ExportSnapshot job was able to run as well.