-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance #3658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
HADOOP-17975 Fallback to simple auth does not work for a secondary DistributedFileSystem instance #3658
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we should change it to boolean instead of AtomicBoolean here.
The "equals" of AtomicBoolean seem doesn't fit the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to simply use a boolean here, but it did not work, and I realized that we need to distinguish between connections based on the reference of the AtomicBoolean what Object.equals will do here.
Here is why:
If we pass in the boolean value of AtomicBoolean, then the event sequence is this:
The situation is similar also, if we use the value of the AtomicBooleans in the hashcode and equals.
That is why I chose to compare based on Object.equals (which will use AtomicBoolean's equals, which falls back to Object.equals), and also the hashcode method of AtomicBoolean (which will also give back something based on the Object.hashcode).
With that every distinct client with its own AtomicBoolean will get a distinct connection, and will initialize the fallback properly via setupIOStreams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Objects.equals(new AtomicBoolean(false), new AtomicBoolean(false) is false.
So if clientA and clientB from the same server connecting to NameNode, they should be using the same connection, but now they are using different connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are not equals, but let me re-iterate... they should not be equals.
Three things to note here:
Comparing the boolean values instead of the atomic boolean reference will bring us back to the initial problem. We will have a false value when we first try to get the connection based on the ConnectionID, then after we get the connection we change the value to true. Which is even worse, as in this case if fallback is on, we would have for sure two connections even for a single DfsClient, as when we get the connection the second time, the value would be true, so we would create an other connection for the new ConnectionID, and we possibly leak the previous one, as we would never create a ConnectionID with false again from that client. While on the other hand, from a second DfsClient, we would get the connection with the ID contains the false value, and we would have the problem we are trying to fix (unset AtomicBoolean prevents fallback of the second DfsClient).
About the number of connections we discussed in the previous PR, though at that point NN was out of picture, but yes, this change affects the NN connections as well if there are multiple DfsClients that are active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, connections from the same server with same configurations will reuse the connection tagged with the same ConnectionId. Say Client1, to Client10, these 10 clients will reuse the same connection.
If we add a field of AtomicBoolean in the comparison of ConnectionId, Client1 to Client10 will each initial a new AtomicBoolean, since the result of Object.equals are always false, the Client1 to Client10 will have 10 connections. That's connection leak.
If we change AtomicBoolean to boolean, there will at most be two different connections, since "false" is the default value, normally we will only have have one connection, only when we have a "true" value connection will the second connection be created.
I assume the problem we initially trying to solve is that clients with "fallback" set to true will be omitted. When we check with "boolean" instead of "AtomicBoolean", normal clients will go to the connection with "boolean" set to "false", special clients will go to the connection with "boolean" set to "true".
IMO, two connections are not leak, since idle connection will be collected and closed.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the problem again with other words, as I am starting to feel that we don't understand each other :)
When a client code initiates a connection to HDFS, the following decisions are made:
With the distinct AtomicBooleans, there are distinct ipc connection ids, which means there are distinct connections.
Ipc Connections are closed when idle for 10 seconds (by default).
So if we have the AtomicBoolean reference in the ConnectionId, we might create connections if distinct DfsClients are created (DfsClients are not cached, or via explicit call to create a distinct instance), which are then either used by the different clients, or closed after the max idle time.
Note that the AtomicBoolean value is not there to determine if we are allowed to fall back, that comes from configuration via a variable fallbackAllowed into Connection#setupIOStreams. This atomicBoolean determines during the real communication which type of authentication will be used, and we do only know if we really have to fall back to simple auth after we connected to the server address and discovered its expected authentication method.
As I stated earlier, we can not use the boolean value of the AtomicBoolean in the key. Here is what happens when we do:
Let's say we have DfsClient1 and DfsClient2. DfsClient1 does 2 RPC calls to the same address then it is not closed, but DfsClient2 is created in the meantime which does 1 RPC call to the same address.
DfsClient1 creates ConnectionID1 with boolean value false.
DfsClient1 realizes it should fall back to simple auth, so it sets AtomicBoolean1 to true
DfsClient1 sends the first RPC via the connection that has ConnectionID1
DfsClient1 tries to send the second RPC, and in order to do so it creates a connection with ConnectionID2 where the boolean value is true
DfsClient1 sends the second RPC via the connection that has ConnectionID2
DfsClient1 goes idle
DfsClient2 is created with AtomicBoolean2 set to false
DfsCllient2 initiates the RPC call, and gets the connection with ConnectionID1 which has the boolean value as false
DfsClient2 does yield from setupIOStreams as the connection is already established, with that AtomicBoolean2 remains false
DfsClient2 fails the RPC call with the same error as documented in HADOOP-17975 which we are trying to solve.
While if the ConnectionID has the reference to the AtomicBoolean instead of the boolean value of it, then DfsClient1 does not create a second connection on the second RPC call with ConnectionID2, but DfsClient2 will create a second connection with its own AtomicBoolean, and then setupIOStreams is able to properly initialize AtomicBoolean2.
This might sound appealing, but it is not true, as initially we would always have a ConnectionID with the value false, and this false value might change during setting up the connection when it is already cached by the false boolean value. As we do not know the final value and can not know the final value until we haven't set up the sockets and initiated some communication to see what auth method the server side has we will never initiate a connection with a ConnectionID that contains true.
Yes you are right in that, changing the ConnectionID to have a distinct connection per DfsClient for just this specific case, and create multiple connections for all the other cases is bad, but if we insists on having this property in the ConnectionID, then I do not see any way to avoid this.
Alternatively as this seems to be problematic, we can switch back to my previous proposal, which does not increases the number of connections, but just sets the AtomicBoolean value in setupIOStreams always when it is called if the auth protocol is SASL, and server side auth method is SIMPLE, and the client runs with a secured configuration, and if fallback is allowed where these conditions are checked in this order lazily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the AtomicBoolean is set to false, DFSClient2 is supposed to fail. Since it won't fallback to simpleAuth.
The case of HADOOP-17975 is DFSClient2 is using "fallback = true", but it's reusing the connection with "fallback = false", so it fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me quote myself here:
Please see this code, the falbackAllowed boolean is what you are talking about. It is initialized based on config here.
The AtomicBoolean is initialized in the Connection's setupIOStreams method which is called from the getConnection method, which also caches the connection based on ConnectionID here.
The fallbackToSimpleAuth AtomicBoolean is created in the DfsClient here, and is shared with the SaslDataTransferClient here, while it is provided to the ipc Client via the protocol proxy between this two points.
Later on the SaslDataTransferClient uses the AtomicBoolean to decide wether to fallback or not in this method.
So again:
the config we can not use, as even if the fallback is allowed we do not want to fallback against a secured server, and we do not know if the server is secure or unsecure until we are connected to it.
The AtomicBoolean is initialized as false, and set to be true only after the Connection has been set up and cached based on the ConnectionID, so we can not rely on its boolean value.
So the question still stands, do you still think we should go down this route, and tinker with the ConnectionID, instead of setting the AtomicBoolean's value in setupIOStreams if it should be true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fapifta I see.
Seems we need to fallback to the solution of #3579.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, thank you for confirming, I am reopening that PR, and closing this one.
Thank you for the review, and the continued discussion about selecting the final solution.